What is our primary use case?
When we started with Dynatrace we were an on-prem organization. We used it in the early days as an APM, the way most people used it.
Our usage of Dynatrace has grown over the years, not as much in terms of capacity as in usability. It is now used by three departments within our organization. It originally started with just my group, which is IT, and then we rolled it out to development because they saw the advantages of being able to identify code bottlenecks in existing code. We've rolled it out to operations and they use Session Replay to troubleshoot customer-specific issues. And the sales department also uses it to gauge productivity and how many visits we get to a particular page, how many times people watch a particular video, how many take a certain practice exam, etc.
Those use cases are all in addition to its core use, which is to help us keep our infrastructure running.
We're currently using the Dynatrace SaaS, the Dynatrace ONE product. We're not using anything in the old, modular product. It fits very well for us. We are a cloud organization. We're all Azure now. We migrated from on-prem to cloud about three years ago.
How has it helped my organization?
The automated discovery and analysis definitely help us to proactively troubleshoot production and pinpoint underlying root cause, both from a code perspective as well as an infrastructure perspective. When we get an alert, or we're seeing a degradation in performance, Dynatrace will lead us down the path: Where do we need to look first? It will tell us that it has analyzed so many trillions of dependencies and that it thinks that the problem is "here," and it will point to a query or a line of code or perhaps to a system or to a container that is not functioning properly. Depending on what the problem is, it saves us an enormous amount of time in troubleshooting and identifying problems.
I estimate it has cut our mean time to identification at least in half, if not more. Before, we were relegated to combing through logs. We would take Splunk, look for the error, find out where it was occurring, how many times it was occurring — do all that type of investigation that you normally need to do. We don't have to do that anymore because it's all automated.
And as far as decreasing our mean time to repair goes, it's closer to 60 to 70 percent. The reason is that we don't need to take such drastic troubleshooting time. We take its recommendation, and the time that we spend is checking that Dynatrace was right. We'll test out a quick fix in dev and then take it to QA and then push it to production. In some instances, it does reduce our MTTR by anywhere from 60 to 70 percent, although it really depends on the problem.
I operate an entire stack on four people, and the only way I'm able to do that is by automating as much as I can and having tools that I can rely on to reduce time-dependent tasks. Dynatrace has allowed me to function and keep my people productive without working them 24/7. Dynatrace works 24/7 for me.
Another thing that Dynatrace gives us is very deep visibility, not only into user actions but systems interactions. How are the systems relating to each other? Are the right systems talking to the right systems? When we first deployed Dynatrace five years ago, it showed us, through its Smartscape tool, that we had servers talking to servers they shouldn't be talking to. That was quite an eye-opener. I've noticed that a lot of companies are trying to copy what Dynatrace came out with in its Smartscape, but to me, it is the best visualization tool of your app stack and network that you'll ever put together, and you don't have to do anything. The system puts all that together. You deploy your one agent, it maps out the system, and you can see everything from application to network to infrastructure connectivity. It depends what you want to see, but it's all Smartscape'd out. You can tell what traffic is going in which direction and where it's going.
In addition, when I first started using Dynatrace, I had a routine. I would come into the office early and go through all of the night's activities. I would check for any problems we had: Was anything broken, were there any active alerts? With Dynatrace Davis, I started getting those reports automatically, through Amazon Alexa, and I do that on my drive to work. Instead of having to go in early and spend time in the office, I'm able to stay at home a little later, have breakfast with the family. Then, when I'm in the car, I invoke Alexa to give me my Dynatrace morning report, which will include my Apdex rating, any open problems, and a summary of closed problems. It's probably one of the least advertised aspects of Dynatrace, and one which I think is among the most highly efficient tools that they offer.
The amount of time we have to devote to maintaining Dynatrace is next to nothing. The time that we spend in Dynatrace is actually using it. We're using it to look at what's happening, what's going on, is something broken, or do we have an alert? We go in to find out what's wrong. Maintaining it is really almost nonexistent.
Another advantage is that it is much more of a proactive tool than it is one for putting out fires. Of course, it helps us tremendously if we have to put out a fire, but our goal is to never have a fire. We want to make sure that any deployments that we put out are fully tested in all aspects of use, so that when things are fully deployed, there isn't any need for a rollback. In the last three years, we've had to roll back a production deployment once. I don't attribute that all to Dynatrace, but I attribute a large part of it to it.
It has increased our uptime because we find out about problems before they're problems. The one goal that my team has, above anything else, is to know about problems before the customer does. If the customer is telling us there's a problem, we have failed. We are so redundant and so HA-built, that there is absolutely no reason for us not to be able to circumvent an issue that is under our control, and to prevent any type of a work stoppage or outage. We can't help it if the internet goes down or if Microsoft has a core problem, but we can certainly help by making sure that it's not our application stack or our infrastructure. I would estimate our uptime is better by at least 20 percent.
In the end, it has decreased our time to market with new innovations and capabilities, because anything that reduces time-to-produce decreases time to market. Once the code has actually been developed, it's in testing and deployment and that's where my window of efficiency is. I can't control how long it takes to build something, but I can control how long it takes to fully test it and deploy it. And there, it has saved us time.
Before we had Dynatrace, and a lot of the processes that Dynatrace has helped us put into place, everything was manual. And the more manual work you have, the more margin for human error you have.
What is most valuable?
The most valuable features really depend on what I'm doing. The most unique feature that Dynatrace offers, in my opinion, is Davis. It's an AI engine and it's heavily integrated into the core product.
The Session Replay not only allows us to watch the user in 4K video, but to see the individual steps happening behind the scenes, from a developer perspective. It gives us every single step that a user takes in a session, along with the ability to watch it as a video playback. We can see each call to every server as the user goes through the site. If something is broken or not running optimally, it's going to come up in the Session Replay.
We also use the solution for dynamic microservices within a Kubernetes environment. We are in the process of converting from Docker Swarm to Kubernetes, but that is in its infancy for us and will grow as our Kubernetes deployments grow. Dynatrace's functionality in this is really good.
We use JIRA as well as Jenkins. We have a big DevOps push right now and Dynatrace is an integral part of that push. We're using Azure DevOps, and tying in Dynatrace, Jenkins, and JIRA and trying to automate that whole process. So Dynatrace plays a role in that as well.
In terms of the self-healing, we use the recommendations that it provides. I'd say the Davis engine runs at about 90 percent accuracy in its recommendations. We have yet to allow automated remediation, which is our ultimate goal. It's going to be a bit before we get comfortable with anything doing that type of automated work in production. But I feel that we're as close as we've ever been and we're getting closer.
User management is extremely — and I hate to use the word "easy" — but it really is. And it's a lot easier today than it was when we first started with Dynatrace. We create a lot of customized dashboards both for the executive teams and management teams. These dashboards are central to their areas of oversight. It used to take quite a bit of time to create dashboards. Now it even has an automated tool that takes care of that. You just tell it what you want it to present and everything falls together. It has templated dashboards that you can customize.
The single agent does all of it. Once you deploy the one agent to your environment, it's going to propagate itself throughout the environment, unless you specifically tell it not to. It is the easiest thing that we've ever owned, because we don't have to do anything to it. It self-maintains. Every once in a while we'll have to reinstall the agent on something or a new version will come out and we'll want to deploy it, but for the most part, it's set-it-and-forget-it.
What needs improvement?
I would love to see Dynatrace get more involved in the security realm. I get badgered by so many endpoint protection companies. It seems like a natural fit to me, that Dynatrace should be playing in that space.
I'd also like to see some deeper metrics in network troubleshooting. That's another area that it's not really into.
For how long have I used the solution?
We're in our fifth year of using Dynatrace. We were the very first paying customer for the new platform, Dynatrace ONE. We used it right at launch.
What do I think about the stability of the solution?
The stability has been phenomenal. I'm not going to say that Dynatrace has never had an outage, but I've never had an outage where Dynatrace wasn't available for me. It's always been there. It's always there when I need it. It's always on. Our uptime is five-nines, and we do attribute a large portion of our ability to maintain that figure to Dynatrace.
What do I think about the scalability of the solution?
In terms of scalability, we don't have anything that it can't do. As we add to our infrastructure, it scales. Yes, every time we add a node, we're going to spend more. But it's up to me to decide if I want to monitor everything or a set of everything. My philosophy is to monitor all of production. Anything that is deployed to production is being monitored by Dynatrace.
From a dev and test perspective we don't monitor like that. We keep a secondary Dynatrace instance that we use in the event that we need to troubleshoot something in development, but for the most part, our Dynatrace usage is relegated to production. And that's for cost reasons.
We have four environments in our builds. We have production, where we cover everything. We have a development environment, which is a subset of production, with different copies. We have QA, which is where everything goes from development for final testing. And then we have staging, which is the final step before it's pushed to the production clusters.
As we add to production, we add to Dynatrace. That is always going to be the plan. We will not deploy anything to production that doesn't have Dynatrace on it.
I don't get involved in the minutiae, but from what the guys tell me, with Linux servers you don't even blink. They have to watch Windows servers a little bit more because it's more intensive. Windows itself doesn't tend to perform very well when you first build. You've got to massage it and get it to where you want it to be. Dynatrace helps us with that, but Windows is more finicky.
We have about 50 users of Dynatrace between infrastructure, development, operations, and sales.
How are customer service and technical support?
Their technical support is the best ever. I know I sound like a broken record, but we get chat support on the Dynatrace site, not from some guy in India, but from a high-level tech in the US who has all the answers to the questions. That person is not like some first-level guy who's going to ask you if your machine's booted up. The techs can answer our questions and, if they can't, they open the ticket and get back to us later. It's the best support model I've ever had the pleasure of working with.
Which solution did I use previously and why did I switch?
We were using New Relic at the time. We were having a lot of frustrations with that in terms of its dashboarding capabilities, and the amount of time that my people had to spend keeping it updated and running correctly. We started looking at other products and we ended up settling on Dynatrace. Aside from its major capabilities, what Dynatrace ended up doing for us was to assist us in our migration to the cloud, because it gave us the sizing recommendations and the baselines that we needed to formulate what we were going to start with in Azure.
New Relic was the primary APM at the time and we were just very frustrated with it. We started looking at other products and really didn't see much of a difference in the competition, differences that would warrant going through the change, until we came upon what was then called Ruxit and is now called Dynatrace.
The biggest difference was that the other solutions required overhead. My biggest complaint was the amount of time we had to spend with these tools, because they're supposed to save you time, not take up more of your time. Dynatrace was the first one to actually complete that promise.
We ran hybrid for a year, collecting data on both ends, using Dynatrace both on-prem and in the cloud, and now it's all cloud.
How was the initial setup?
The setup is really not much different, whether you're an on-prem organization or a cloud or even a hybrid. It's still the one agent. I have no experience with their AppMon product, so I can't tell you how much easier the new product is versus the old. But I can tell you that this product that we have been using is the easiest thing we've ever had. The only comment I got from my systems team is, "Why didn't we get this sooner?"
I am not the norm when it comes to policy and procedure. I tend to buck the trends a little bit. If I have a new product that I feel is going to be advantageous to the company and my team as a whole, then once we've done our due diligence, we will just deploy it. I know that larger companies with different criteria and regulations have to follow different channels and paths, through security and infrastructure and storage, etc. But ultimately, as long as you have "air-cover," and by that I mean an executive sponsor who believes in what you're doing, then you really should be able to get it done with minimal effort.
We were fully up and running in a week. It took me longer to remove New Relic than it did to deploy Dynatrace. We only needed one person to deploy Dynatrace. One of my systems people took care of it. I took care of the administrative stuff, creating the initial dashboards and getting the payments set up and so forth, but my systems people took care of the actual deployment of the one agent.
What about the implementation team?
I didn't hire any contractors or deployment services. I signed up for Dynatrace's free trial and we went to town.
What was our ROI?
From a monitoring-tool perspective, Dynatrace has saved us money through consolidation of tools. We used to use a number of tools: PRTG, Pingdom, and we used to pay for an additional Azure service that we don't pay for anymore. And we used to use Splunk for log mining and now we don't. Just in the tools that we eliminated it has saved us $30,000, but there are more soft dollars that I could add to that.
I'm not sure how you come up with an ROI because it's pretty much all soft dollars. It's a line item in my budget that doesn't have to grow unless we grow. We have not experienced a base-price increase from Dynatrace.
What's my experience with pricing, setup cost, and licensing?
Dynatrace is not the cheapest product out there and it's not the most expensive product out there. In our business, you get what you pay for.
Dynatrace has a place for everybody. How you use it and what your budgetary limitations are will dictate what you do with it. But it's within everybody's reach. If you're a small organization and you have a large infrastructure, you may not be able to monitor the whole thing. You may have to pick and choose what you want to monitor, and you have the ability to do so. Your available funds are going to dictate that.
The only additional costs that I incur are for additional log storage space, which is like $100 a year.
What other advice do I have?
My advice would be to compare and compare again. Everybody's offering free trials, and I know that they're a pain to do, but compare the products, apples for apples. Everybody's going to compare costs, but be sure to compare the functionality. Are you getting what you pay for? Are you getting the bang for your buck out of what the product is returning to you? If all you need to know is "my server's down," you can probably get by with the cheapest thing out there. But if you want to know why the server is down, or that the server is about to go down and you need to do something, then you want a product like Dynatrace.
I go to their Perform conference every year, and it's amazing to me to see the loyalty and dedication from the customer side. It's like a family reunion every year when we go to Perform. I hope we have it next year.
From a core-product perspective, Dynatrace is doing everything that we ever asked for. Everything that we've ever wanted to monitor, it has always been there first.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?