What is our primary use case?
Our primary use cases are operational awareness, health of the systems, and impact on users. Other use cases include proactive performance management, system checkouts (as we investigate the ability to manage configuration and integration to the CMDB), some usage of it from a product perspective in terms of application usage, and I use it to manage and improve the user experience by understanding user behaviors.
We are in both Azure and AWS. We have both on-premise and cloud Kubernetes environments that we're running in. In fact, we have been using less efficient deployment methodologies. We haven't encountered any limitations in scaling to cloud-native environments.
We have only used version 1.192 of the Dynatrace product. We have not used any previous versions.
How has it helped my organization?
It has improved our critical incident response, exposing critical issues impacting the environment and our ability to respond to those events prior to client impact as well as resolving those events more quickly. We have use cases where we have studied a 70 percent improvement for response times in an occurring event as well as future reoccurrences being improved.
The solution's use of a single agent for automated deployment and discovery helps our operations significantly. Oftentimes when you are looking at endpoint management, centralized monitoring teams need access to data across systems. They need to manage agents deployed throughout the organization. Remote polling of data can be helpful, but it's not deep enough, especially for APM capabilities. Having one agent significantly simplifies that functionality in such a way that it enables a very small team to manage a very large environment with very limited overhead. It provides the ability for external teams to manage it because they don't need any deeper knowledge of the application than installing the agent. They have the ability to integrate the agent into deployments and to do the work with very limited overhead.
The automated discovery and analysis helps us to proactively troubleshoot production and pinpoint the underlying root cause. We have had scenarios where we can see end user impact. One of the use cases was where we had an individual system and a cluster of nine for a content management system that was having an issue. Through Dynatrace, we were able to quickly identify the one host that was having a problem, take that out of the active cluster, recycle that application instance, bring it back, and reintroduce it to the cluster in a very efficient manner. Historically, these processes take multiple hours in order to diagnose and identify the instance, then do the work. With Dynatrace, we are able to do the work in less than 20 minutes from when it first occurred to issue resolution. Thus, there have been scenarios where you can quickly identify infrastructure issues and back-end services.
Out-of-the-box, it's the best product that I've seen. Its ability to associate application impact, as well as root cause from an infrastructure standpoint, is by far ahead of anything that I have seen due to its ability to associate infrastructure anomalies to applications. We are still on our journey of identifying the right business KPIs to see how we can associate this data.
Dynatrace is doing an excellent job of giving us 360-degree visibility of the user experience across channels in most technologies. We are working with Dynatrace to expose the full transparency to the mainframe, as we have transactions that call from the cloud onto the mainframe and back out to other services. This is a critical visibility that isn't there yet. Otherwise, with a lot of the cloud and historical systems, we do see a lot of transparency of transaction trace across the environment.
What is most valuable?
- Automated discovery
- Automated deployments
- The AI
These are probably the most key, because it gets into the traceability from tracing transactions of the end user all the way through the back-end systems. We are still working through the mainframe integration, but the scenarios where we can integrate through the mainframe are very useful.
We can see issues that occur, sometimes before the clients do. Before we have client (or end user) calls for issues, we are able to start troubleshooting and even resolve those issues. We can quickly identify the root cause and impact of the issues as they occur, and this is very helpful for providing the best client experience.
We have found the self-management of the management cluster and Dynatrace processes to be highly reliable. There have been minimal issues with managing the infrastructure.
We've targeted deployment of the real-user monitoring to the most critical applications in the company to understand if there's something that's happening in the environment and the user impact. This is to be able to understand the blast radius of issues, helping us understand if an issue is impacting one app or multiple applications. We can then quickly diagnose where the common event is (the root cause), resolve it, and then leverage the product to validate healthy user traffic after completion by seeing transactions be processed again.
From a synthetic standpoint, we use the synthetics in two ways:
- We do lower-level infrastructure pings (HP pings) primarily in order to validate individual, technology services on the back-end, i.e., the API endpoints.
- We use the front-end synthetics to validate user experience 24/7. When you have low usage periods, you are still able to validate the availability and performance of services to the organization. Oftentimes, changes may be implemented to reduce risk during lower usage times and the synthetics can be valuable to validate during that time.
It has been very easy to deploy and obtain basic information.
It's very good from a problem troubleshooting perspective.
What needs improvement?
I find the value from the out-of-the-box features to be extremely valuable. However, there will be gaps and challenges as you go into a much broader set of infrastructure technologies to consume that necessary information. This will be a challenge for the company. The things that they need to focus on is the ease of integrating external data sources, which can then also contribute to the AI. There is a ton of value gotten out-of-the-box, but moving to the next steps will be an interesting journey. I know this is something they are focused on now. When bringing in other telemetry, whether it be network devices, databases, or other third-party products that all integrate into a larger ecosystem, there will also be a lot of successes, but there will also be some challenges on this journey.
There is some complexity in the alarm processing logic within the product between the alert policies and problem notifications.
Expand the user session query data to be inclusive and enable that for the application or other telemetry within the system. Currently, in order to analyze the data outside of dashboards, it requires exporting to other reporting systems. If you want to do higher level reporting, then this may make sense. However, there is a desire to be able to do some of that analysis within the product.
There continues to be some opportunity to expose the infrastructure from a broader reporting standpoint. Overall, the opportunity is in the reporting capability and the ability to more flexibly expose or pivot the data for deeper analysis. Oftentimes, the solution is good at looking narrowly at information, but when you want to broaden that perspective, that's where the challenges come in. At this point, it requires the export of data to external systems to do this.
Adoption lagged primarily due to:
- The prioritization of monitoring as a functionality when teams do their work, as our teams are more focused on business functionality than nonfunctional requirements.
- Getting familiar with the navigation of the product. With our implementation, we have a single node where people get access to all the data within the enterprise. They're able to see everything. It takes time working through the process and getting the correct set of tags and everything else in place to allow them to filter and limit data to what they need to see and can consume. It takes some time for them to understand the data, what's there, and how to consume it as we learn how to limit the data sets to what they really want to see.
For how long have I used the solution?
What do I think about the scalability of the solution?
At this point, we have about 1700 host units. We're monitoring 2000 to 3000 systems. We have 300 to 500 users a month using the systems with approximately 700 users overall.
How are customer service and technical support?
Their Tier 0 is better than most companies that I have ever worked with. Normally, I'll get useful information even at that initial level/Tier 0.
The in-app chat is extremely helpful. It helps not only with the ability for me to troubleshoot, but the ability for the rest of the organization to ask how-to questions. We have hundreds of those chats across the organization per month which are leveraged by end users.
Everything else is as expected when working through engineering and our product specialists, who have been helpful.
How was the initial setup?
The initial setup and implementation are almost too easy. With real-user monitoring and all the application monitoring, you are introducing change into the environment. It is so easy to set up, configure, and implement that you can get way ahead of your organization technically from where they are from a usability standpoint. We have run into virtually no technical limitations in implementing the product. It has purely been from the ability to get users to adapt, understand, and leverage the value of the product.
We implemented and installed the Dynatrace platform (and everything) within a couple of days. We deployed the product in certain environments within overnight of instrumentation. Onboarding of teams and the training required, that took months. Even though we were able to technically implement the product from non-production into production within a month of deploying everything, having it there, and instrumented. It took us another eight to nine months to onboard individual teams into adopting and leveraging the product. From there, the rolling out is really limited more by organizational change, communication, and facilitating training with teams and their technical capabilities. Key teams have adopted the product and used it very quickly. Therefore, we are seeing value within four weeks of deployment from our centralized critical incident teams, but the product adoption from application and development teams has lagged.
If you are implementing Dynatrace, the first thing is to not underestimate your users and their experience, providing them personal service to onboard and consume the information, then leverage the product on the front-end. Technically the product makes it so easy to implement and deploy, this makes it difficult to stay in front of the rest of the organization when adopting the product. You need to ensure the data starts presenting itself before they are ready and able to consume it. You need to focus that into your implementation.
What was our ROI?
The solution has decreased both our MTTI and MTTR.
In 2018, we were having on average one issue per day. It is one of the reasons that we purchased the product in 2018. Last year, we significantly drilled those numbers down in outage time by 70 to 80 percent, as an organization. While Dynatrace is part of driving that avoidance as well as reduced outage time, it's impossible for us to have a direct correlation of its direct impact because there are so many other factors at play in an organization. I had to change management processes and everything else that could also influence that. However, we know that it was part of that increased uptime to where we've decided to invest significantly more in the product.
What's my experience with pricing, setup cost, and licensing?
It's understandable to do a smaller scale initial evaluation. However, as you identify the product value, don't hesitant in your scope and scale to maximize the initial investment and your opportunity to do a bulk investment of the product.
Which other solutions did I evaluate?
We have other competitive products. The automation instrument will be extremely valuable as we look to consolidate our solution set. The insight to quickly gain information is interesting and good information that we can use. There will be a challenge internally with our teams since application teams were never exposed to infrastructure information and infrastructure teams have never been exposed to application nor end user information. Organizationally, we have to change where people are now going to see this insight and figure out how to leverage it for good, which will be helpful. It will be a game changer in terms of how we can identify and respond to events in the organization from the point of view of data and analysis, as opposed to tribal knowledge and fear.
Dynatrace was initially brought in to eliminate one competitive APM product. We are now on to eliminating the second, and we'll be consolidating all APM on the Dynatrace platform. We are also in the process of consolidating other infrastructure monitoring products on the platform. We expect there will be a small incremental investment from a purely licensing standpoint to consolidate the products, but we expect realization of a significant amount of benefit from the capabilities it provides from root cause analysis, impact analysis, transaction trace observability in the environment, the reduced administrative costs of disparate products, and the ability to integrate data. However, a lot of these were not measured previously because we had a lot of disparate tools across disparate teams managing things. Therefore, we can't measure the savings but we expect it will be significant.
We have CA APM Introscope, New Relic, and AppDynamics. We are users of all three of these products, though we are probably using AppDynamics the least. We have almost completely migrated away from Broadcom and are starting the replacement of New Relic.
Holistically, Dynatrace's traceability starts from the user endpoint, meaning the ability to trace a transaction from a user session all the way through other technologies. We've had more comprehensive traces than with other products. Other products do not offer an easy interface to see the trace of the user session in a comprehensive way. Dynatrace offers the ability to go from a mobile, microservices, or mainframe and be able to trace across all those platforms. It also has the ability to associate or automatically correlate user transactions to applications, then into the underlying infrastructure components. Another Dynatrace benefit is the whole function of the AI as well as bringing in other external data sources. E.g., we are looking at things like a DataPower and F5 data integrations, but also incorporating those into the trace. Finally, there is support of legacy technologies, because it really gets into traceability, AI, and the supportive legacy. Mainframe technologies are the big positive differentiators and kind of come to a conclusive root cause analysis.
CA APM Introscope and New Relic have simpler interfaces to consume data. With Dynatrace, you need to develop plugins to obtain easier API interfaces for pushing data into other products. This is a little easier with the other products. The New Relic Insights product is a stronger reporting feature than what Dynatrace provides.
There are also other products that we are looking at eliminating in other product suites, such as Broadcom UIM, Microsoft SCOM, and Zabbix. We have a lot open source solutions where we're looking to roll out infrastructure, then consolidate and centralized data. The primary function and capabilities gets into mobile to mainframe traceability in order to simplify or expedite impact and root cause analysis processes for the teams. The solution also has the ability to support our modern technologies running in AWS and Kubernetes cluster microservices as well as traceability all the way through the mainframe.
What other advice do I have?
We have integrated our notification systems through PagerDuty, Slack, and our auto ticketing app. This is to generate incident records. The integrations with PagerDuty and Slack are effective. We're in the process of migrating some tools to ServiceNow. Thus, we are in the process of doing synchronization of both the events while also evaluating the CMDB integration with ServiceNow. There are some recent capabilities that make this look more attractive to automate discovery and relationship building that we're looking forward to, but we have not yet implemented. The integration to ServiceNow will be good.
The desire is to have Dynatrace help DevOps focus on continuous delivery and shift quality issues to pre-production. We are not there yet. The vision is there and it makes sense with the information that we see, but we have not had the opportunity. Even though we've been using the product now for two years, we're only now just starting an effort to roll the product out across the enterprise and replace competitive products for application infrastructure monitoring. We'll then have the opportunity for that full CI/CD integration or NoOps opportunity.
We will be rolling out to some highly dense environments in the near future. We haven't run into any performance issues yet. The only issue that we ran into previously is with the automated instrumentation of the product. We accidentally disabled the competitive products that teams were using as we were evaluating Dynatrace. You can get in front of yourself in rollout.
We don't have the solution’s self-healing functionality integrated into the automation product. Dynatrace doesn't have the self-healing capability of restarting services. Therefore, from a monitored application perspective, we haven't enjoyed that capability yet.
We are in the process of testing some parts of the session replay. We see value there and are working through understanding the auditory or compliance impacts to leverage this feature.
Based on my experience and history of the products, I would rate it at least a nine (out of 10). It's been far superior to other products in its capabilities and comprehensiveness, especially across both cloud and legacy technologies, such as older technologies (like mainframes and server-based monolithic applications).
Which deployment model are you using for this solution?