Dynatrace Review

AI identifies all the components of a response-time issue or failure, hugely benefiting our triage efforts

What is our primary use case?

We're a health plan, a health insurer. We're not a big one, we have about a million members. We are growing through adding new business and we're looking to expand into the government programs: Medicare, Medicaid. Right now we provide individual and family, large corporate, self-insured, and a couple other types of health plans. 

We are headquartered in Minnesota, outside of Minneapolis. We have a data center in Minnetonka and one in another suburb. We do most of our work on-premise. We don't have much in the cloud for our core backroom applications. We use a package from a company called HealthEdge in Boston, to do our claims processing, membership, enrollment, etc.

Our main use case is application performance monitoring, right at Dynatrace's sweet spot. First, we wanted to know what the performance of our healthcare and our health claims processing system was. Then we wanted to be able to segment it by where the transaction response time is spent. We also wanted to get into the deep dive of the Java profile, because HealthEdge is a Java application that runs on several JVMs. We wanted not only to get into the Java code but to get into the SQL that's created to call into the database, which is where the response-time problems are. 

We're using Dynatrace SaaS now. It's the newest version.

How has it helped my organization?

Since we have the OneAgent feature available, we have real-user monitoring. So not only do we know the response time and availability of the synthetic route, but we know what real users experience on our website. If our service desk gets a call, which seldom happens — but let's say you, as a member, had trouble with something — we can go back and find exactly what you did and why the response was poor. We've used that many times to find errors. JavaScript errors caused by a setting in Internet Explorer were the latest ones that were annoying the members. But members don't call our service desk and say, "Hey, your website sucks." So we have to look at the data and say, "Geez, why does Internet Explorer have these huge JavaScript errors?" And then we find out.

We found an error where developers used a Google API that was supposed to find a Medicare workshop by loading a Google Map and help a member find a place where they could go to a Medicare workshop. The API had so many calls an hour and we saw that, usually, about 45 minutes after the hour, that transaction was failing. It turned out that we'd used the 1,000 allocated calls and, when, the hour turned over, it worked again. It integrates all things monitoring, from an application perspective: synthetic, real users, and Java deep-dive.

Dynatrace provides us with a huge benefit for triage because by the time a Dynatrace problem is open, AI has identified all the components and where the response-time issue is or where the failure is. It's really mindless. We don't have to try to pull out a map and figure out how the application looks. 

And Dynatrace has a feature called SmartScape. I don't use it a lot because their AI is so good that I've never had to go dig through it myself. But if I were to go through it, it would go from data centers to hosts to processes to services and applications, to show how they're all linked together. So it has a topology view. We use that sometimes when we're doing performance testing, which is something another part of my team does. They need to know which pieces are involved and this helps them know that. 

But from a day-to-day event-management and IT operations-center perspective, the Dynatrace AI is what has identified the failing component. The dashboard has all the problems. They open up these problems, which are already events in our ServiceNow environment, and these problems have the call-path and everything else laid out in them. So I've never had to dig into the Smartscape to figure out where my failure is. The Dynatrace AI has done that for me.

What we found early on in our HealthRules environment was that the response-time problems were, 99 percent of the time, in the type of SQL that we throw at the database, because the DBAs would say, "It's not the database, it's the bad SQL." Dynatrace helped us focus immediately on that and get away from: "Is it the network? Is it the server? Is this too busy?" There are all the different things that the vendor wants to throw at you. I went up to Boston to help the vendor a year or two ago. I took them right through the code and the response times and said, "Here's the piece of SQL that makes this particular function slow." Dynatrace was able to do that. We got there in minutes. They said, "Well, your server might be too busy, it might be your network," and I could say, "No, it's none of that. Here's the response time of that transaction and here is the decomposition of it. The thing runs for 13 seconds and spends 12 seconds on this one piece of SQL. I think that's where your problem is." Dynatrace was a huge help there.

The solution has decreased our MTTR by well over 50 percent and maybe by as much as 90 percent. It enabled us to identify some things, first of all. Before, it was endless war rooms, and not really an identification. Dynatrace has driven that almost to zero. When the problem is opened, we know the root cause.

As for mean time to repair, since we know what we need to repair, we can point the developer right at it. It has decreased that by 50 to 60 percent.

It has also dramatically improved our uptime. One of the biggest problems we have with the JVMs, of course, is garbage collection and memory saturation. A memory leak will develop and Dynatrace will show the memory increasing steadily. It will create a problem and they'll work on the problem proactively, and either fix it or schedule graceful downtime. If they have to shut down the environment, they can stage through the three different servers in a type of HA arrangement. So without any disruption to the client, we've been able to fix things that would have turned into major outages of the whole environment. It's a definite help on the preventive side.

In terms of time to market, the guys who work on our web portal interface, who are in-house, were early adopters of the technology on our team and learned what works and what doesn't. Dynatrace has significantly decreased their time to market. They're not really part of the development cycle, but the way they use it and the things they say about it and the reports they've made indicate that it has probably cut nearly 50 percent of the development of their portal code.

It has also helped us with consolidation of tools. We got rid of some New Relic and we got rid of some older tool which was a great, early innovator in this space, but it was acquired by CA or Microsoft. We were still paying licenses for that and were able to consolidate it. We were about to buy a network tool to help us with ACI conversion on our network side, a tool that would mainly tell us who an application is talking to on the and network. We use Dynatrace to do that, so we saved tens of thousands of dollars in not acquiring that tool. We also took the synthetic work that we paid an outsourced company to do for us and we converted all of that. Once we had Dynatrace in the house, we could do it ourselves and that saved $20,000 to $30,000 a year. There's probably more, if I were to look at it, that I could do with Dynatrace. I have to focus on the core system right now, but I think they'll get it in the SNMP monitoring space soon, if they're not already there. And the plugins on the ActiveGates have a lot of capabilities we could use. We already monitor our VMware environment with it now.

We've started to use the Apdex score in all of our communications. It's a standard metric that's used for websites to indicate how they're performing. That idea is baked into Dynatrace and we've built on that throughout our company. The weekly service quality reports that are produced and sent via email to all Dynatrace users are starting to get some notice. They show, from the web portal side, what the Apdex is. Is it acceptable, tolerating, or unacceptable? It shows the percentages of the time of use and where they're coming from. It also shows it geographically and what type of browser most of your users are using. It shows how much of it is mobile versus desktop, which has proved very valuable to our digital experience people. Things like that are a huge benefit, and those are things I didn't even know existed when I bought it.

What is most valuable?

In addition to just the monitoring of the HealthRules ecosystem — which is typical BusinessWorks, Oracle Databases, and JVMs for transactions — we do a lot of web monitoring. With Dynatrace, we have synthetic checks and real-user monitoring of all of our websites, places where members and providers can interact with us over the web. We monitor the response times of those with Dynatrace, and it's all integrated into one place. We actively synthetically monitor our websites from two or three geographic locations. Our business is in nine States, so we're not international by any means. We sell health insurance to members in Oklahoma, Kansas, North and South Dakota, Wisconsin, Minnesota. We monitor those synthetically.

It also instruments .NET, and BusinessWorks out-of-the-box.

It has an integration with ServiceNow, which is great. Dynatrace creates tickets for things and its AI finds root cause. We have integrated that with our ServiceNow to generate events and incidents, so that all of our event management will be done in the ServiceNow Developer. We're working on that now. In terms of the self-healing aspect, we don't use Dynatrace to do that, although we could. We've gone down the path of trying to use ServiceNow's Orchestration. But we may come back to Dynatrace for that, depending on how that works.

In addition to ServiceNow, there is a CMDB integration, so when a Dynatrace problem is discovered, the Dynatrace ID correlates to a CMDB and that's how we open an incident or event. We don't need to do the correlation. If an event turns into an incident, then the correlation is done automatically with the Dynatrace ServiceNow application, which is in the ServiceNow store. It syncs up the CMDB's entries, the CIs, with the Dynatrace IDs so that all of the different pieces of the response-time puzzle that Dynatrace has, can be assigned to a CI in our CMDB. We are actively working on improving our discovery in CMDB, as it's not the most robust. Dynatrace is a huge help there because the OneAgent discovers all these things for us. So it helps with ServiceNow discovery as well.

The Dynatrace panel generally lets you know how many users it affects, and how many transactions or events in that application it affects. We don't use that a lot. That's beyond our capability right now, but I don't see any reason why it wouldn't be quite useful to assign severity from that.

What needs improvement?

Around the way licensing works, I would like to put it everywhere in infrastructure-only mode and I want it to be reasonable to do that.

From a technological standpoint, there is the OneAgent versus plugins they have. They called them security gateways when they first came out. They're the way that the OneAgents talk to local active gates, which communicate out to the Dynatrace cloud to store all the performance data. Instead of every agent going out to the cloud, there's just one spot and security likes that. But they've leveraged those security gateways and renamed them ActiveGates, and now there are different web plugins we can run on it. Sometimes the plugins are designed for things where you put in an agent, Like an Oracle instance of Exadata, or an Oracle appliance. We can't put a OneAgent on that. It's not a standard Linux or Windows OS, so the ActiveGate solution is better there. Sometimes the development of those seems to be running very fast and it's not complete. They don't yet function quite as easily as the OneAgents do. But I have hopes that that's going to get better. We have tried the MQ, the Citrix, and the Oracle ActiveGate plugins. They could be sharper. It's the right direction to go. It just seems like it could be smoother.

For how long have I used the solution?

I have been using Dynatrace for close to three years in my current company, and before that I used the earlier versions of Dynatrace, DC RUM, at a previous job.

What do I think about the stability of the solution?

I had one problem early on with WebLogic where Dynatrace was not stable and it would actually affect the ability of one of the WebLogic components. It was instrumented because we thought we needed it to be, but it didn't need to be. When we decided not to instrument it the problem went away. 

But that's the only stability issue I've ever had with it. That was the only time it's caused an outage or been responsible for high resource consumption. Typically the OneAgent is well under 1 percent CPU utilization and takes very little memory.

It's used constantly by several teams. They use the Dynatrace mobile app on their phones to get notified of problems in the environment before ServiceNow even notifies them. Our platform services team, which is the team responsible for the HealthEdge environment — if we were a bank, it would be all the backroom functions. It is where you pay claims, enroll members, credential providers and maintain all that stuff. That support team has it on their phones. Our portal team also has the mobile app, so it's used constantly. I hear about it when it's not available, or if there's something odd going on with the mobile app.

What do I think about the scalability of the solution?

It could handle a much larger environment. I add ActiveGates mainly for redundancy. I don't think I need as many as I have. I could scale it out very large. I don't see any limitations. I've never had a problem with that other than my checkbook.

We've tried scaling it to cloud-native environments a little bit. We have a few things that are off-premises, like Microsoft Dynamics and Salesforce, which are in the cloud. We have a cloud-based application that does provider credentialing, as well. We don't have anything that we own in the cloud, so we can't instrument AWS or anything like that with it.

How are customer service and technical support?

Tech support has generally been pretty good. We get good response. They have a thing called Dynatrace ONE and I find the tech support to be best if I engage it through a chat window on Dynatrace. There's a place, right in the tool, where you can get a hold of a Dynatrace ONE person and they'll look at your problem right away. That seems to work better than the old model of calling support or sending an email, because you would go back and forth. "Send me more doc. What about this? Send me that." The Dynatrace ONE agent gathers everything he needs and, once he has all that, if he doesn't know what the problem is, at something like a level-one triage, he'll open the incident for you and it's done. I like that part. The traditional send-them-an-email, open-a-ticket-online takes too long. The Dynatrace ONE agent available through chat is a great concept. I encourage my team to use that rather than opening a problem. And that's included in the standard licensing.

How was the initial setup?

For our deployment, we did the first 40 in less than an hour. That required a part of one guy, and he maintains it all now. We have close to 200 nodes with OneAgent on them and four ActiveGates, synthetic monitoring, and plugins for MQ and Citrix, among other things. That takes three-fourths of a person on my team. I've federated the support for a lot of the stuff on our portal side. Our portal team developers fell in love with it so much that I just let them run with it and install it as needed. I give them more and more administrative rights. If you add their time, it works out to the equivalent of about a person.

We have close to 100 users. Some of them are just management who use the reports. Some of them are the portal team who are administrators, just like my team, and the majority are in IT. We're starting to take it out to our sales organization, as they're interested in the response time and other things.

What was our ROI?

We see ROI in performance tuning — improving application performance — big-time. We have teams using it constantly to make our digital experience better, performance-wise and availability-wise. Another part of my group is load testing. They use it as they do their load tests. They use LoadRunner to build a load test and use Dynatrace to monitor after every new release of the HealthRules code to tell them what's better and what's not. There is a huge ROI on load testing and performance testing.

There is also incident response, preventative incident response. We even had the CIO come into my boss's office one day and he was able to say that Dynatrace saw a problem and it was fixed and we didn't have an outage. And he looked at him and said, "That's how it's supposed to work, right?" What the CIO had been promised for 10 years, he finally actually saw an instance of it "in the wild" where we preemptively discovered a problem and fixed it. That's a huge win.

Also, reporting and analytics — to know what the response time is, and how many users use it, just the simple things — are huge.

I'm not sure how to estimate how much Dynatrace has saved us overall. But it's had to have saved us on the order of millions.

What's my experience with pricing, setup cost, and licensing?

We license it for two environments, typically all of production and all of one lower environment, usually our staging environment. If there is a downside to Dynatrace, the only thing I can think of would be the cost. If it were cheaper, I'd have it in all my environments. I don't think they're charging more than it's worth, by any means. It's just that good software costs money.

They have the OneAgent which you buy and install. You can run that in infrastructure-only mode and pay less. The cost is a bit funny, it's calculated based on the memory size of the server you put it on. Sixteen gigabytes of memory, for instance, is one host unit and a host unit costs you, say, $1,000. (I don't recall what the actual cost is, I'd have to look at our contract). There's a switch they've added for infrastructure-only mode, which will cut that cost to about one-sixth or one-seventh of the cost of a full host agent. You won't get the deep-dive response time metrics, but you'll get the infrastructure stuff, which sometimes is all you want.

In addition to the host agent fee, which was the first thing I bought, based on the memory size of the server, the other is in metrics that we collect through the ActiveGate plugins. They charge you per metric.

So the three principle things they charge you for are OneAgent, how many metrics you collect through the ActiveGate, and digital experience monitoring units, or DEM units. Those are basically the cost of the synthetic things, per test. Those things are quite reasonable in cost. The biggest cost is the OneAgent.

The cost to get us up, my first allocation, was under $100,000. My first PO was for about $60,000 and it covered almost our whole production HealthRules environment. We started out with 40 host units and we've grown to 200-plus, and we're a small place. Down the street is a health-related business and I think they have 20,000 host units.

Which other solutions did I evaluate?

We started by looking at industry reviews and selected the top four or five up in the upper-right quadrant: Dynatrace, AppDynamics, New Relic, and we had a brief look at what at that time was a CA product, or it might've been BMC.

We evaluated the four of them on paper and then brought two in for a trial, a proof of concept: Dynatrace and AppDynamics. Ultimately we selected Dynatrace.

There were several advantages to Dynatrace. Dynatrace was new. Its presence in the cloud was nice, but I could also run it on-prem if I wanted to and, at the time I didn't know which way I was going to go — which way I'd be allowed to go by security. AppDynamics was cloud-only at the time.

For installation, Dynatrace was trivial compared to AppDynamics. AppDynamics had an engineer onsite for two or three weeks and they still couldn't meet all of our use cases, which were pretty simple. I did them first. Then I went to Dynatrace and they said, "Well, download it, install it, and call us If you have any questions." And I thought, "Well, geez, don't I get any hand holding or anything?" It turned out that it was because I didn't need it. It was that simple. You download it, install it, and it injects itself. You can control it. It was just engineered for ease of use, by far. So the installation was night-and-day different. 

We have a lot of TIBCO BusinessWorks code around that that we wanted to instrument, and with AppDynamics we had to go into every business process and change the startup. We had hundreds of them and that was a real pain. We had to select which ones and do the work, whereas with Dynatrace, it would discover. Dynatrace has a concept called OneAgent, which you install on the server and it discovers things that you can monitor. You just click on them and say, "I want these monitored," or "Don't monitor these." It takes care of all that work and that was a huge difference. I didn't need a huge staff to maintain it. I didn't need a lot of time from the support teams — because they don't have it — to help me with monitoring. We were able to do the monitoring ourselves.

Then, once it was up and running, the use cases were pretty simple. One was to create a business-level dashboard of response time, and I don't think AppDynamics ever got that out for me. 

Dynatrace is easy to use from that perspective. It's easy to install and maintain. I have a small team and one person is my Dynatrace SME, but he does other things as well, so it's not even a full-time job.

What other advice do I have?

I've been doing this for close to 30 years. I've worked for software vendors and I've worked for major companies and now I'm at this small healthcare organization. The "holy grail" has always been the ability to decompose response time and Dynatrace has done that and integrated all of my APM needs in one tool. That is the biggest benefit to me. I can do application performance, from web to Java deep-dive, in one place. That's probably why it costs so much.

If you're thinking about Dynatrace, consider how easy it is to install and maintain. It has broad coverage and it's easy to use. I don't know how the rest of the market even competes anymore; it must be on cost.

As an APM tool, I'd probably rate it at nine out of ten. There are a few rough edges, but I think that's mainly because they're trying to do the right thing too fast.

**Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
More Dynatrace reviews from users
...who work at a Financial Services Firm
...who compared it with New Relic APM
Add a Comment