Top 8 Application Performance Management (APM) Tools

DynatraceDatadogAppDynamicsAternityNew Relic APMAzure MonitorITRS GeneosApica Synthetic
  1. leader badge
    In my experience, Dynatrace is scalable.The agent deployment is the most valuable. You don't need to do any configuration. You just deploy the agents, and it can automatically detect your infrastructure. That was the greatest feature that we saw in Dynatrace. If there is any database, it can detect it automatically and present everything to you.
  2. leader badge
    Because of our client focus, it is easy for us to sell. This is because it is easy to use and easy to set up.I have found error reporting and log centralization the most valuable features. Overall, Datadog provides a full package solution.
  3. Find out what your peers are saying about Dynatrace, Datadog, AppDynamics and others in Application Performance Management (APM). Updated: September 2021.
    534,226 professionals have used our research since 2012.
  4. leader badge
    Technical support is helpful.It is easy to gain visibility into complex environments with AppDynamics. It has the ability to combine operation information of the environment and business information with strong business IQ support.
  5. The infrastructure data, especially the CPU and memory data, is per second, which makes it outstanding as compared to other solutions. Its licensing cost is very low for us.
  6. The simplicity of the dashboard is very good.Working with the solution is very easy. It's user-friendly.
  7. Azure Monitor is very stable. Azure Monitor is really just a source for Dynatrace. It's just collecting data and monitoring the environment and the infrastructure. It is fairly good at that.
  8. report
    Use our free recommendation engine to learn which Application Performance Management (APM) solutions are best for your needs.
    534,226 professionals have used our research since 2012.
  9. The solution is used across the entire investment banking division, covering environments such as electronic trading, algo-trading, fixed income, FX, etc. It monitors that environment and enables a bank to significantly reduce down time. Although hard to measure, since implementation, we have probably seen some increased stability because of it and we have definitely seen teams a lot more aware of their environment. Consequently, we can be more proactive in challenging and improving previously undetected weaknesses.
  10. There are several features that are really good. The first one is the flexibility and the advanced configuration that Apica offers when it comes to configuring synthetic checks. It provides the ability to customize how the check should be performed and it is very flexible in the number of synthetic locations that it can use. It allows us to run scripts from different locations all over the world, and they have a really good number of these locations.

Advice From The Community

Read answers to top Application Performance Management (APM) questions. 534,226 professionals have gotten help from our community of experts.
Rony_Sklar
Hi peers,  How is synthetic monitoring used in Application Performance Management (APM)?  How does it differ from real user monitoring?
author avatarNetworkOb0a3 (Network Operation Center Team Leader at a recruiting/HR firm with 1,001-5,000 employees)
Real User

I think different shops may use the term differently. In regards to an industry standard the other replies may be more appropriate. 



I can tell you that where I work we refer to SEUM (Synthetic End User Monitoring) UX and Synthetic (both user experience monitors)  monitoring as simulating actual human activities and setting various types of validations. These validations may be load times for images, text, pages, or validating an expected action based on the steps completed by the monitor. We target all aspects of infrastructure / platform for standard monitoring and then for any user facing service we try to place at least one Synthetic / UX monitor on top of the process. I often find the most value from our Synthetics comes in the form of historical trending. Great examples of NOC wins have been patch X was applied and we noticed a consistent 3 second additional time required to complete UX monitor step Y. Another value from Synthetics is quickly assessing actual user impact. More mature orgs may have this all mapped out but I have found that many NOCs will see alarms on several services but not be able to determine what this means to an actual user community until feedback comes in via tickets or user reported issues. Seeing the standard alarms tells me what is broken, then seeing which steps are failing in the synthetics tells me what this means to our users. 


I think that one of the great benefits to an open forum like this is getting to consider how each org does things. There are no wrong answers, just some info applies better for what you may be asking. 

author avatarBrian Philips
User

There is actually a place and a need for both synthetic and real user experience monitoring. If you look at the question from the point of view of what you trying to learn, detect, & then investigate, the answer should be that you want to be pro-active in ensuring a positive end-user experience.

I love real user traffic. There are a number of metrics that can be captured and measured.  The number of items that can be learned will be controlled by the type and kind of data source used. NetFlow, Logs, and Ethernet packets. Response time, true client user location, application command executed, response to that command from the application including exact error messages, direct indicators of server and client physical or virtual performance, the list goes on and on. Highly valuable information for APP-OPS, Networking, Cloud, & data center teams.

Here is the challenge though, you need to have real user traffic to measure user traffic. The number of transactions and users and volumes of traffic and the path of those connections are great for measuring over time and baselining and triage as a count or measure and to find correlations between metrics when user experience is perceived as poor. The variation in these same metrics though makes them poor candidates for measuring efficiency and pro-active availability. Another challenge is that often real user traffic is often encrypted now so just exposing that level of data has a cost that is prohibitive to do outside of data center, cloud, Co-Lo. These aspects are often controlled by different teams so coordinating translations and time intervals of measurements between the different data sources is a "C" level initiative. Synthetic testing is/are fixed in number, duration, transaction type, & location. A single team can administer them but everyone can use the data. Transaction types and commands, tests, can be scaled up and down as needed for new version changes of applications and micro-services living in Containers, Virtual hosts, clusters, physical hosts Co-Lo, & data-centers. These synthetic transactions also determine availability and predict end-user experience long before there are any actual end-users. Imagine an organization that can generate transactions and even makes phone calls of all types and kinds in varying volumes a few hours before a geographic workday begins? If there is not a version change in software or change control in networking or infrastructure and there is a change from baseline or a failure to transact, IT has time to address the issue before a real user begins using the systems or services. These fixed transactions in number and time are very valuable in anyone's math for comparison and SLA measurements and do not need to be decrypted to get a COMMAND level delineation measurement.

Another thing to consider is that these synthetic tests also address SaaS and direct cloud access as well as 3rd party collaboration access {WEBEX, ZOOM, TEAMS, etc.}. Some vendors' offerings integrate together with there real-user measurements and baseline's, out of the box to realize the benefit of both and provide even more measurements and calculations and faster triage. Others may offer integration points like API or WEBHOOKS and leave it up to you.

The value and the ROI are not so much one or the other. Those determinations for an organization should be measured by how you responded to my original answer, /"//you want to be pro-active in ensuring a positive end-user experience."

author avatarDiego Caicedo Lescano
Real User

Synthetic monitoring and real user monitoring (RUM) are two extremely different approaches that can be used to measure how your systems are performing. While synthetic monitoring relies on automatic simulated tests, Real User Monitor (RUM) records the behavior of actual visitors on your site and let you analyze  and diagnose


Synthetic monitoring is active, meanwhile Real User Monitoring is passive,  that means both are complement of each other

author avatarSunder Rajagopalan
Real User

Synthetic monitoring helps simulate traffic from various geographic locations 24/7 at some regular frequency, say 5 minutes to make sure your services are available and performing as expected. In addition, running Synthetic monitoring along with alerts on some of your critical services that are dependent on other external connections like Payment Gateways, etc. will help you catch any issues with external connections proactively and address them before your users experience any issue with your services.

author avatarMichael Sydor
Real User

Synthetics for production, are best used when there is little or no traffic to help confirm that your external access points are functioning.  They also can be used to stress test components or systems - simulating traffic to test firewall capacity or message queue behavior  and many other cases.  You can also use synthetics to do availability testing during your operational day - again usually directed at your external points.  Technology for cloud monitoring is generally synthetics.  And the ever-popular speedtest.net is effectively doing synthetics to assess internet speed.  The challenge with synthetics is maintaining those transactions.  They need to be updated every time you make changes in you code base (that affects the transactions) and to cover all of the scenarios you care about.  And also the HW requirements to support the generation and analysis of what can quickly become thousands of different transactions.  Often this results in synthetics being used every 30 minutes (or longer) - which, of course, defeats the usefulness as an availability monitor.


Real User monitoring is just that - real transactions, not simulated.  You use the transaction volume to infer availability of the various endpoints, and baselines for transaction type and volume to assess the availability.  This eliminates the extra step of keeping the synthetics up-to date and trying to live with the intervals at which you have visibility into actual traffic conditions.  But it will take extra work to decide which transactions are significant and to establish the baseline behaviors, especially when you have seasonality or Time-of-Day considerations that vary greatly.


However, I'm seeing that the best measure of transaction performance is to add user sentiment to your APM.  Don't guess at what the transaction volume means - simply ask the user if things are going well, or not!  This helps you narrow down what activities are significant, and thus what KPIs need to be in your baseline.


A good APM Practice will use both synthetics and real-user monitoring - where appropriate!  You do not choose one over the other.  You have to be mindful of where each tool has its strengths, what visibility they offer and the process that they need for effective use.



author avatarHani Khalil
Real User

Actually, RUM is giving the value after the fact. 


I mean once the customer got impacted then RUM will show that. Synthetic user monitoring will keep testing your service 24/7 and you will be notified in case there is an issue and this is not requiring any real user to interact with your service. Both components will complete each other.

author avatarATHANASSIOS FAMELIARIS
Real User

Synthetic Monitoring refers to Proactive Monitoring of Applications’ Components’ and Business Transactions Performance and Availability. Using this technique the monitoring of availability and performance of specific critical business transactions per application is achieved by simulating user interactions with web applications and by running transaction simulation scripts.


By simulating user transactions, the specific business is constantly tested for availability and performance. Moreover, synthetic monitoring provides detailed information and feedback for the reasons of performance degradation and loss of availability, and with this information, performance and availability issues can be pinpointed before users are impacted. Normally tools supporting Synthetic Monitoring  include features like: complete performance monitoring, continuous synthetic transaction monitoring, detailed load-time metrics, monitoring from multiple locations, and browser-based transaction recording. 


On the other hand Real User’s experience Monitoring (RUM), allows recording and observation of real end-user interactions with the applications providing information on how users navigate in the applications, what URLs and functions they are using and with what performance. This approach is achieved by recording time-stamped availability (status, error codes, etc.) and performance data from an application and its components. RUM also helps in defining the most commonly used business transactions or most problematic transactions to properly configure them for synthetic monitoring, as described previously.

author avatarTjeerd Saijoen
Vendor

In real-time monitoring the load on the systems is different every time based on the total number of users, applications, batch jobs, etc. while in synthetic monitoring we use what we call a robot firing for example every hour the same transaction. Because it is the same transaction every time you can determine the performance of the transaction. if you do this in DevOps you can monitor the transaction before actually going live and minimize the risk of performance problems before going in production.

Rony_Sklar
Hi peers, With so many APM tools available, it can be hard for businesses to choose the right one for their needs.  With this in mind, what is your favorite APM tool that you would happily recommend to others?  What makes it your tool of choice?
author avatarHani Khalil
Real User

I have tested a lot of APM tools and most of it are doing the same job in different techniques and different interfaces. 


One of the best tools I tested called eG Enterprise, this tool provided the required info and data to our technical team. we Found also great support from eG technical team during the implementation. One of the main factors was cost and they can challenge a lot of vendors on that.


 

author avatarSilvija Herdrich
User

Hi, I recommend Dynatrace.Companies can focus on their business instead of wasting time and money in different tools and complex analysis. I today's world, companies would need more and more specialized employees only to do what Dynatrace can deliver in minutes via artificial intelligence. IT-world is changing, but companies can’t change quickly their monitoring tools and educate people. Customers, suppliers, partners and employees do expect perfect IT. The risk is too high doing something that does not deliver fast enough full observability. No matter what you do: run applications in your company, develop apps for employees and customer, built an e-commerce channel. IT is important for success, and this can be guaranteed if you know challenges in IT before others tell you.

author avatarAbbasi Poonawala (Yahoo!)
Real User

My favourite APM tool is New Relic. Monitoring dashboard shows exact method calls with line numbers, including external dependencies for apps of any size and complexity.

author avatarPradeep Saxena
Real User

My favourite APM tool is Azure Monitoring from this I can check application insights. I can also check when application crashed.

author avatarGustavoTorres
User

My favorite APM tool is Dynatrace, the one agent handling enables fast and agile deployment.

author avatarRavi Suvvari
Real User

Agree, well explained.

author avatarreviewer1352679 (IT Technical Architect at a insurance company with 5,001-10,000 employees)
Real User

Our organization is large and has a long history.  We had a lot of on-premise, monolithic applications and tons of business logic included in places it shouldn't be.  This caused a lot of pain to implementing new architectures.  Several years back we implemented new architectures using micro-service apps, client side browser processing, ephemeral systems based on kubernetes.  During the transition, DevOps teams were given full reign to use whatever tool they wanted, including open source.  A handful of tools were used more pervasively including New Relic, Prometheus, CloudWatch, OMS, Elastisearch, Splunk, and Zabbix.  As can be imagined, this caused a lot of issues coordinating work and responding to issues.  3 years back we did an evaluation of all tools and pulled Dynatrace into the mix.


Dynatrace was easily the most powerful solution to provide APM and simplify the user experience into a "single pane of glass".  We are also working to integrate several other data sources (zabbix, OMS, cloudwatch & prometheus) to extend the data set and increase the leverage of the AI engine.


Why Dynatrace?


- Most comprehensive end-to-end tracing solution from browser to mainframe
- Entity (aka CI) mapping to relate RUM to applications to hosts.  This includes mapping of entities such as Azure, AWS, kubernetes, and VMWare
- An AI engine that uses the transaction trace and entity mapping to consolidate alerts to accelerate impact and root cause analysis


There are several other features such as simplified/automated deployment, API exposure and analytics tools.

author avatarPradeep Saxena
Real User

Azure Monitor gives application insights as ingesting metrics and data logs as many varients OS, application, CPU, memory, etc. We can visualise and analyse what's going on in the application.

Rony_Sklar
Hi community members, I have some questions for you:  What is ITOM? How does it differ from ITSM?  Which products would you recommend to make up a fully defined ITOM suite?
author avatarTjeerd Saijoen
Vendor

ITOM is a range of products integrated together, it contains infrastructure management Network management Application management Firewall Management Configuration management. you have a choice of products from different vendors vendors. (BMC, IBM, Riverbed, ManageEngine etc).


ITSM is a set of policies and practices for implementing, delivering and managing IT Services for end users 


author avatarSyed Abu Owais Bin Nasar
Real User

One is that ITSM is focused on how services are delivered by IT teams, while ITOM focuses more on event management, performance monitoring, and the processes IT teams use to manage themselves and their internal activities.


I will recommend you to use BMC TrueSight Operations Management (TSOM) an ITOM tool. TrueSight Operations Management delivers end-to-end performance monitoring and event management. It uses AIOps to dynamically learn behavior, correlate, analyze, and prioritize event data so IT operations teams can predict, find and fix issues faster.


For more details:
https://www.bmc.com/it-solutio...

author avatarNick Giampietro
User

Rony, ITOM and ITSM are guidelines (best practices) with a punch list of all the things you need to address in managing your network and the applications which ride on them. 


Often the range of things on the list is relatively broad and often while some software suites offered by companies will attempt to cover ALL the items on the list, typically, the saying "jack of all trades, master of none!" comes to mind here. 


In my experience, you can ask this question by doing a Google search and come up with multiple responses each covering a small range of the best practices. 


My suggestion is to meet with your business units and make sure you know what apps are critical to their success and then meet with your IT team to ask them how they manage those applications and make sure they are monitoring the performance of those applications. Hopefully, both teams have some history with the company and can provide their experiences (both good and bad) to help you prioritize what is important and key IT infrastructure that needs to be monitored.  

Like most things in life, there is more than one way to skin the cat. 

author avatarreviewer1195575 (Managing Director at a tech services company with 1-10 employees)
Real User

There are two letters which define a core "difference" in these definition and one which define a common theme.
O for Operations is the first pointer to the IT function of using IT infrastructure to keep business satisfied. That does involve day to day tasks but also longer term planning. Ideally Operations teams DON'T firefight except in rare circumstances, but have information at hand on the health of all objects that could impact business directly or indirectly. Monitoring collects  data, then correlation, analysis helps extract useful information to deduce the situations and take corrective action. The functions available in toolsets may automate parts of that, rare are case where they become 100% automatic.


S points to service delivery to users, hence ITSM is about serving users, mostly. So for many ITSM is fact the help desk or ticket management. Of course within ITSM there's a lot more to it, maybe a lot of analytics of operations data as well as history of past incidents and fixes to them that impacted service delivery in the past. ITSM may also include commitment, so called SLA/SLOs are contracts that describe the quality of service expected and committed to.


M for management means more than tools is needed for both. People are needed even if automation is highly present as all automation will require design and modification. Change is constant.
Management means processes for standardisation of data, tasks and their execution etc. It also means data collection, cleansing, handling, analysis, protection, access and many other aspects without which risks are taken and delivery of service becomes more hazardous.


ITIL and other formalised standards of conduct in the IT world have proven to be vital ways of driving standardisation, and shouldn't be ignored.


With the emergence of modern application landscapes and DevOps there's a tendency to "imagine" doing away with ITOM and ITSM.
Like everything they need to evolve and have over the last couple of decades, but getting some of the basic correct go a long way to ensuring IT serves business as a partner.


author avatarHani Khalil
Real User

Hi,


ITOM is IT Operations Management which is the process of managing the provisioning, capacity, cost, performance, security, and availability of infrastructure and services including on-premises data centers, private cloud deployments, and public cloud resources.


ITSM refers to all the activities involved in designing, creating, delivering, supporting and managing the lifecycle of IT services.


I tired Microfocus OBM (HP OMi) and its good. You have also App Manager from manage engine. 

Ariel Lindenfeld
Let the community know what you think. Share your opinions now!
author avatarit_user342780 (Senior Software Engineer Team Lead at THE ICONIC)
Vendor

Speed to get data into the platform is one of our most important metrics. We NEED to know what is going on right now, not 3-4minutes ago.

author avatarreviewer1528404 (Engineer at a comms service provider with 1,001-5,000 employees)
Real User

1.Ability to Corelate 


2. Machine learning/AI based thresholds 


3. Ease of configuration ( in bulk)

author avatarit_user364554 (COO with 51-200 employees)
Vendor

Hi,
Full disclosure I am the COO at Correlsense.
2 years ago I wrote a post just about that - "Why APM project fails" - I think it can guide you through the process of the most important aspects of APM tools.

Take a look - feel free to leave a comment:
http://www.correlsense.com/enterprise-apm-projects-fail/
Elad Katav

author avatarreviewer1608147 (CTO at Kaholo)
User

There are so many monitoring systems in every company that provide alerts. This is what's called - alert fatigue phenomena. Moreover, most alerts are handled manually which means it takes a long time and cost a lot of money to resolve an event. I think companies should evaluate the remediation part of monitoring systems... What is there to instantly and automatically resolve problems that come up instead of just alerting.

author avatarRavi Suvvari
Real User

Tracing ability like record level, latency and capabilities to inform good predictions in advance, history storage, pricing, support etc..,

author avatarDavid Fourie
Real User

Full stack end-to-end monitoring including frontend and backend server profiling, real user monitoring, synthetic monitoring and root cause deep dive analysis. Ease of use and intuitive UX. 

author avatarit_user229734 (IT Technical Testing Consultant at adhoc International)
Vendor

In order to evaluate/benchmark APM solutions, We can based on the 5 dimension provided by Gartner:
1. End-user experience monitoring: the capture of data about how end-to-end application
availability, latency, execution correctness and quality appeared to the end user
2. Runtime application architecture discovery, modeling and display: the discovery of the
various software and hardware components involved in application execution, and the array of
possible paths across which those components could communicate that, together, enable that
involvement
3. User-defined transaction profiling: the tracing of events as they occur among the components
or objects as they move across the paths discovered in the second dimension, generated in
response to a user's attempt to cause the application to execute what the user regards as a
logical unit of work
4. Component deep-dive monitoring in an application context: the fine-grained monitoring of
resources consumed by and events occurring within the components discovered in the second
dimension
5. Analytics: the marshalling of a variety of techniques (including behavior learning engines,
complex-event processing (CEP) platforms, log analysis and multidimensional database
analysis) to discover meaningful and actionable patterns in the typically large datasets
generated by the first four dimensions of APM

In other side, we tried to benchmark internally some APM solutions based on the following evaluation groups:
Monitoring capabilities
Technologies and framework support
Central PMDB (Performance Management DataBase)
Integration
Service modeling and monitoring
Performance analysis and diagnostics
Alerts/event Management
Dashboard and Visualization
Setup and configuration
User experience
and we got interresting results

author avatarit_user178302 (Senior Engineer at a financial services firm with 10,001+ employees)
Real User

Most vendors have similar transaction monitoring capabilities so I look at the End user experience monitoring features to differentiate. Not only RUM (mobile and web) but also Active Monitoring through synthetics.


Application Performance Management (APM) Articles

Shibu Babuchandran
Regional Manager/ Service Delivery at ASPL Info Services
Sep 14 2021
The Essential Guide to AIOps

What Is AIOps?

AIOps is the practice of applying analytics and machine learning to big data to automate and improve IT operations. These new learning systems can analyze massive amounts of network and machine data to find patterns not always identified by human operators. These patterns can both identify the cause of existing problems and predict future impacts. The ultimate goal of AIOps is to automate routine practices in order to increase accuracy and speed of issue recognition, enabling IT staff to more effectively meet increasing demands.

History and Beginnings

The term AIOps was coined by Gartner in 2016. In the Market Guide for AIOps Platforms, Gartner describes AIOps platforms as “software systems that combine big data and artificial intelligence (AI) or machine learning functionality to enhance and partially replace a broad range of IT operations processes and tasks, including availability and performance monitoring, event correlation and analysis, IT service management and automation.”

AIOps Today

Ops teams are being asked to do more than ever before. In a common practice that can sometimes even feel laughable, old tools and systems never seem to die. Yet the same ops teams are under constant pressure to support more new projects and

technologies, very often with flat or declining staffing. To top it off, increased change frequencies and higher throughput in systems often mean the data these monitoring tools produce is almost impossible to digest.

To combat these challenges, AIOps:

•Brings together data from multiple sources: Conventional IT operations methods, tools and solutions aggregate and average data in simplistic ways that compromise data fidelity (as an example, consider the aggregation technique known as “averages of averages”). They weren’t designed for the volume, variety and velocity of data generated by today’s complex and connected IT environments. A fundamental tenet of an AIOps platform is its ability to capture large data sets of any type while maintaining full data fidelity for comprehensive analysis. An analyst should always be able to drill down to the source data that feeds any aggregated conclusions.

•Simplifies data analysis: One of the big differentiators for AIOps platforms is their ability to correlate these massive, diverse data sets. The best analysis is only possible with all of the best data. The platform then applies automated analysis on that data to identify the cause(s) of existing issues and predict future issues by examining intersections between seemingly disparate streams from many sources.

•Automates response: Identifying and predicting issues is important, but AIOps platforms have the most impact when they also notify the correct personnel, automatically remediate the issue once identified or, ideally, execute commands to prevent the issue altogether. Common remedies such as restarting a component or cleaning up a full disk can be handled automatically so that the staff is only involved once typical solutions have been exhausted.

Key Business Benefits of AIOps

By automating IT operations functions to enhance and improve system performance, AIOps can provide significant business benefits to an organization. For example:

•Avoiding downtime improves both customer and employee satisfaction and confidence.

•Bringing together data sources that had previously been siloed allows more complete analysis and insight.

•Accelerating root-cause analysis and remediation saves time, money and resources.

•Increasing the speed and consistency of incident response improves service delivery.

•Finding and fixing complicated issues more quickly improves IT’s capacity to support growth.

•Proactively identifying and preventing errors empowers IT teams to focus on higher-value analysis and optimization.

•Proactive response improves forecasting for system and application growth to meet future demand.

•Adding “slack” to an overwhelmed system by handling mundane work, allowing humans to focus on higher-order problems, yielding higher productivity and better morale.

Data Is Vital for AIOps

Data is the foundation for any successful automated solution. You need both historical and real-time data to understand the past and predict what’s most likely to happen in the future. To achieve a broad picture of events, organizations must access a range of historical and streaming data types of both human- and machine-generated data.

Better data from more sources will yield analytics algorithms better able to find correlations too difficult for humans to isolate, allowing the resulting automation tasks to be better curated. For example, it’s not hard in most semi-modern monitoring systems to automate some sort of response. However, if response times slow down an application, AIOps would help ensure the correct automated response and not just the “knee-jerk” response that’s statically connected. Adding more capacity to a service may in fact make a slowdown worse if the bottleneck isn’t related to capacity. And it certainly can result in unintended and unnecessary costs in cloud environments. Thus, having the right data to make more complete decisions results in better outcomes.

For total visibility, it’s necessary to access data in one place across all of your IT silos. It’s important to understand the underlying data supporting your services and applications — defining KPIs that determine health and performance status. As you move beyond data aggregation, search and visualizations to monitor and troubleshoot your IT, machine learning become the key to achieving predictive analysis and automation.

Key AIOps Use Cases

According to Gartner, there are five primary use cases for AIOps:

1. Performance analysis

2. Anomaly detection

3. Event correlation and analysis

4. IT service management

5. Automation


1. Performance analysis:

It has become increasingly difficult for IT professionals to analyze their data using traditional IT methods, even as those methods have incorporated machine learning technology. The volume and variety of data are just too large. AIOps helps address the problem of increasing volume and complexity of data by applying more sophisticated techniques to analyze bigger data sets to identify accurate service levels, often preventing performance problems before they happen.

2. Anomaly detection:

Machine learning is especially efficient at identifying data outliers — that is, events and activities in a data set that stand out enough from historical data to suggest a potential problem. These outliers are called anomalous events. Anomaly detection can identify problems even when they haven’t been seen before, and without explicit alert configuration for every condition.

Anomaly detection relies on algorithms. A trending algorithm monitors a single key performance indicator (KPI) by comparing its current behavior to its past. If the score grows anomalously large, the algorithm raises an alert. A cohesive algorithm looks at a group of KPIs expected to behave similarly and raises alerts if the behavior of one or more changes. This approach provides more insight than simply monitoring raw metrics and can act as a bellwether for the health of components and services.

AIOps makes anomaly detection faster and more effective. Once a behavior has been identified, AIOps can monitor and detect significant deviations between the actual value of the KPI of interest versus what the machine learning model predicts. Accurate anomaly detection is vital in complex systems as failures often exist in ways that are not always immediately clear to the IT professionals supporting them.

3. Event correlation and analysis:

The ability to see through an “event storm” of multiple, related warnings to identify the

underlying cause of events. The reality of most complex systems is that something is always “red” or alerting. It’s inevitable. The problem with traditional IT tools, however, is that they don’t provide insights into the problem, just a storm of warnings. This creates a phenomenon known as “alert fatigue”; teams see a particular alert that turns out to be trivial so often that they ignore the alert even on the occasions when it’s important.

AIOps automatically groups notable events based on their similarity. Think of this as drawing a circle around events that belong together, regardless of their source or format. This grouping of similar events reduces the burden on IT teams and reduces unnecessary event traffic and noise. AIOps focuses on key event groups and performs rule-based actions such as consolidating duplicate events, suppressing alerts or closing notable events. This enables teams to compare information more effectively to identify the cause of the issue.

4. IT service management (ITSM): A general term for everything involved in designing, building, delivering, supporting and managing IT services within an organization. ITSM

encompasses the policies, processes and procedures of delivering IT services to end-users within an organization. AIOps provides benefits to ITSM by letting IT professionals manage their services as a whole rather than as individual components. They can then use those whole units to define the system thresholds and automated responses to align with their ITSM framework, helping IT departments run more efficiently.

AIOps for ITSM can help IT departments to manage the whole service from a business perspective rather than managing components individually. For example, if one server in a pool of three machines encounters problems during a normal-load

period, the risk to the overall service may be considered low, and the server can be taken offline without any user-facing impact. Conversely, if the same thing were to happen during a high-

load period, an automated decision could be taken to add new capacity before taking any poor-performing systems offline.

In addition, AIOps for ITSM can help:

• Manage infrastructure performance in a multi-cloud environment more consistently

• Make more accurate predictions for capacity planning

• Maximize storage resource availability by automatically adjusting capacity based on forecasting needs.

• Improve resource utilization based on historical data and predictions

• Manage connected devices across a complex network

5. Automation: Legacy tools often require manually cobbling information together from multiple sources before it’s possible to understand, troubleshoot and resolve incidents. AIOps provides a significant advantage — automatically collecting and correlating data from multiple sources into complete services, increasing the speed and accuracy of identifying necessary relationships. Once an organization has a good handle on correlating and analyzing data streams, the next step is to automate responses to abnormal conditions.

An AIOps approach automates these functions across an organization’s IT operations, taking simple actions that responders would otherwise be forced to take themselves. Take for example a server that tends to run out of disk space every few weeks during high-volume periods due to known-issue logging. In a typical situation, a responder would be tasked with logging in, checking for normal behavior, cleaning up the excessive logs, freeing up disk space and confirming nominal performance has resumed. These steps could be automated so that an incident is created and responders are notified only if normal responses have already been tried and have not remedied the situation. These actions can range from the simple, like restarting a server or taking a server out of load-balancer pools, to more sophisticated, like backing out a recent change or rebuilding a server (container or otherwise).

AIOps automation can also be applied to:

•Servers, OS and networks: Collect all logs, metrics, configurations and messages to search, correlate, alert and report across multiple servers.

•Containers: Collect, search and correlate container data with other infrastructure data for better service context, monitoring and reporting.

•Cloud monitoring: Monitor performance, usage and availability of cloud infrastructure.

•Virtualization monitoring: Gain visibility across the virtual stack, make faster event correlations, and search transactions spanning virtual and physical components.

•Storage monitoring: Understand storage systems in context with corresponding app performance, server response times and virtualization overhead.

•Application monitoring: Identify application service levels and suggest or automate response to maintain defined service level objectives.

AIOps and the Shift to Proactive IT

One of the primary benefits of AIOps is its ability to help IT departments predict and prevent incidents before they happen, rather than waiting to fix them after they do. AIOps, specifically the application of machine learning to all of the data monitored by an IT organization, is designed to help you make that shift today.

By reducing the manual tasks associated with detecting, troubleshooting and resolving incidents, your team not only saves time but adds critical “slack” to the system. This slack allows you to spend time on higher-value tasks focused on increasing the quality of customer service. Your customer experience is maintained and improved by consistently maintaining uptime.

AIOps can have a significant impact in improving key IT KPIs, including:

• Increasing mean time between failures (MTBF)

• Decreasing mean time to detect (MTTD)

• Decreasing mean time to investigate (MTTI)

• Decreasing mean time to resolution (MTTR)

IT organizations who have implemented a proactive monitoring approach with AIOps have seen significant improvement in a variety of IT metrics, including:

How to Get Started With AIOps

The best way to get started with AIOps is an incremental approach. As with most new technology initiatives, a plan is key. Here are some important considerations to get you started.

Choose Inspiring Examples

If you’re evaluating AIOps solutions, platforms and vendors for your organization, you’ve got a big task ahead of you. The most challenging aspect may not be the evaluation process itself, but gaining the support and executive buy-in you need to conduct the evaluation.

If you choose inspiring examples of other, similar organizations that have benefited from AIOps — and have metrics to prove it — you’ll have a much easier time getting the go-ahead. A good partner can help you do that.

Consider People and Process

It’s obvious that technology plays an important role in AIOps, but it’s just as important to make a plan to address people and process.

For example, if an AIOps solution identifies a problem that’s about to happen and pages a support team to intervene, a responder might ignore the warning because nothing has actually happened yet. This can undermine trust in the AIOps solution before it has a chance to be proven in operation.

It’s also important to give IT teams the time to work on building, maintaining and improving systems. This vital work can’t be assigned as a side project or entry-level job if you expect meaningful change. Put your best people on it. Make it a high priority so other work can’t infringe on it. AIOps practices are iterative and must be refined over time; this can only be done with a mature and consistent focus on improvement.

You’ll also need to re-examine and adjust previously manual processes that had multiple levels of manager approval, like restarting a server. This requires trust in both technology and team practices. Building trust takes time. Start with simple wins to build cultural acceptance of automation. For example, be prepared to build historical reports that show previous incidents were correctly handled by a consistent, simple activity (such as a restart or disk cleanup) and offer to automate those tasks on similar future issues. Choose a solution that allows for “automation compromise” by inserting approval gates for certain activities. Over time, those gates should be removed to improve speed as analytics proves its value in selecting correct automation tasks.

Finally, include in your plans a campaign to reassure staff that AIOps is not intended to replace people with robots. Show them how AIOps can free up key resources to work on higher-value activities — limiting the unplanned work your teams have to endure each day.

The Bottom Line: Now Is the Time for AIOps

If you’re an IT and networking professional, you’ve been told over and over that data is your company’s most important asset, and that big data will transform your world forever. Machine learning and artificial intelligence will be transformative and AIOps provides a concrete way to leverage its potential for IT. From improving responsiveness to streamlining complex operations to increasing productivity of your entire IT staff, AIOps is a practical, readily available way to help you grow and scale your IT operations to meet future challenges. Perhaps most important, AIOps can solidify IT’s role as a strategic enabler of business growth.

Evgeny BelenkyGreat article, @Shibu Babuchandran! Thank you for sharing your knowledge with… more »
Tjeerd Saijoen
CEO at Rufusforyou
Sep 03 2021

ICT is getting more and more complex: today I have several systems in Chicago, several more in Amsterdam and if you need to protect your environment you will need to check on-premises, the cloud at Amazon, and the cloud at Microsoft Azure. 

Why is Performance related to security?

For the following reasons: 

Today we need more than one tool to protect our environment. You need anti-spoofing, antivirus, firewalls, protection against DDOS, etc. All these tools can slow performance and if you experience performance slowdowns, it affects both your end-users and your business.

This can affect your profits. For example, if I sell airline tickets online and without performance problems, I can sell 10.000 an hour but due to performance slowdowns, I sell only 7.500. That is a loss of profit, and the planes will leave with empty seats.

If a hacker attacks our systems a performance tool capable of detecting unusual behavior will alert us because most of the time CPU usage will go up and transaction times will go down.

Are Security and Performance enough or do we need more?

If we take security and performance seriously, we need more. What do we need and why?

Automation is the key: if a hacker tries to penetrate your systems, you'll get alerts from your security and performance tools. Now you’ll need to do something and if you'll need to do this manually an event will be sent to your service tool and a ticket will be created. Your helpdesk team will then start processing the ticket. Before this process is finished a hacker could be able to break into your system.

Now we have an automation tool. It is possible to automate everything. Some policies we activate in our automation tool need to block, for example, a part of the network or require a system restart after a policy becomes activated. 

Because of this, you have a lot of work to separate action rules in. For example, golden rules, requiring a restart and, thus, need to be scheduled in your change management unless they require immediate action. Silver ones require direct action, but with a review of a technical engineer before action has been taken. Bronze ones result in automated action.

Now we have several tools to improve performance and secure our environment. 

What is the fiscal impact? A lot, if we calculate on average a minimum of 3 agents or licenses are needed (often costing around 70$ per server). This equals 210$ per server per month. 

You’ll probably need one engineer to keep this running and several engineers to check the monitors. If I compare this amount with some vendors, it can easily become much more. 

Besides the capacity those agents use are taken from my server, you’ll need more CPU resources, more memory, and more disk space.

Is it possible to reduce this? Yes, by using integrated software: we have 2 agents integrated with performance and security running in a SaaS delivery model for our customers, reducing the price and checking all kinds of environments on security, performance, networks, and automation.

If your systems are blocked with ransomware, it will be a lot more expensive. So proactive joint with automation can protect your systems better - never for 100% but it will come close.

Shibu BabuchandranVery good insights about correlation for security with performance.
Johann DelaunayInteresting positioning and way of thinking, thank you very much for the… more »
Tjeerd Saijoen
CEO at Rufusforyou
May 06 2021

How are security and performance related to each other?

Today a lot of monitor vendors are on the market, most of the time they focus on a particular area, for example, APM (Application Performance Monitoring) or Infrastructure monitoring. Is this enough to detect and fix all problems?

How are performance and security related?

Now our landscape is changing rapidly. In the past, we had to deal with one system. Today we are dealing with many systems in different locations. For example, your own data center called on-premise. Next, we have on-premise and for example AWS, and now we get on-premise and AWS and Azure and it doesn't stop. Now hackers have more locations and a better chance to find a weak spot in the chain, also if performance slows down, where is the problem. 

Because of this, you need many different monitoring tools also they don't monitor your application or OS parameter settings. For example, I have a webserver and it has a parameter to set the number of concurrent users to 30, a monitor tool will probably tell you more memory is required, you ad more expensive memory and you get the same result more memory, while the real solution is to adjust those parameter settings. 

We had several applications running for years while the total number of end users is rapidly growing, now most people don't adjust the parameters because they are not aware of they exist and the right value. 

How are performance and security related to each other, if they compromise systems as well you will see unusual behavior in performance? For example, a performance drop and more CPU will be allocated. For this, you need monitors capable of looking holistic to the complete environment, checking parameter settings and alert on unusual behavior also look for one single dashboard to check your environment including the cloud. Don't look at a sexy dashboard but more important a functional dashboard. Important is the tool capable and give it advise on what to do or is it to tell you there is a problem in the database but it doesn't tell you the buffer setting on DB xxx needs to be adjusted from 2400 Mb to 4800 MB

If we have the right settings, performance will increase and better performance is more transactions. More transactions mean more selling and more business.

Caleb MillerGood article, but the spelling and grammatical errors are pretty blatant.
Tjeerd Saijoen
CEO at Rufusforyou
Mar 29 2021

End-users can connect with different options: by cloud (AWS, Microsoft Azure or other cloud providers), by a SaaS solution or from their own datacenter. The next option is Multi Cloud and hybrid - this makes it difficult to find reasons for a performance problem. 

Now users have to deal with many options for their network. You have to take into account problems such as latency and congestion, and now an added a new layer because of Covid-19. Normally you work in an office space as an end-user and your network team takes care of all the problems. Now everybody is working from home, and many IOT devices are connected to our home network - are they protected? It is easy for a hacker to use these kinds of devices to enter your office network. 

How can we prevent all of this? With a security tool like QRadar or Riverbed. The most important thing to know is that you don't need a APM solution only. Many times, I hear people say,  "We have a great APM solution." Well, this is great for application response times, however an enterprise environment has many more components, like the network, load balancers switches and so on. Also, if you're running power machines you have to deal with microcodes and sometimes with HACMP - an APM solution will not monitor this. 

Bottom line: you need a holistic solution.  

Find out what your peers are saying about Dynatrace, Datadog, AppDynamics and others in Application Performance Management (APM). Updated: September 2021.
534,226 professionals have used our research since 2012.