What is our primary use case?
We are in four public clouds. We are in AWS, Azure, and GCP. While we do Oracle cloud, we only have a small footprint there. We are monitoring all the virtual server environments as well as all the services in those environments and alerting on various set points depending on what it is: virtual, server and service.
We are also monitoring our colos. We have on-prem hardware, networking, and server solutions that we are monitoring with LogicMonitor. We are in both the cloud and on-prem. The breadth of cloud and on-prem that we have is a good use case for LogicMonitor
How has it helped my organization?
We have very fine-tuned alerting that lets us know when there are issues by identifying where exactly that issue is, so we can troubleshoot and resolve them quickly. This is hopefully before the customer even notices. Then, it gives us some insight into potential issues coming down the road through our environmental health dashboards.
The breadth of its ability to monitor all our environments, putting it in one place, has been helpful. This way, we don't have to manage multiple tools and try to juggle multiple balls to keep our environment monitored. It presents a clear picture to us of what is going on.
When I first started, it was less granular in terms of the fine tuning and the ability to tune out specific servers running high CPU. Keeping a global general standard has really helped. We now modify the environment where we need to alert and ignore those areas where we're not as concerned. This has helped our company in ways that maybe management doesn't even realize, e.g., we're not waking up our engineers in the middle of the night. Therefore, there is more job satisfaction in being able to get a good night's sleep. For example, we had one team that was being alerted every couple hours, which was ridiculous when you're on call and need to sleep. This was one of my first prime objectives when I started: To improve the quality of life, so we don't have as much turnover in our engineering support staff.
What is most valuable?
At the top of the list of most valuable features is the ability to modify and add data sources, to use other people's data sources, and the LM Exchange itself. It gives LogicMonitor a lot of flexibility. It gives the end user the ability to monitor just about anything that can connect to a network and send data, which is a nice. You can take the data sources for what you are trying to do, then modify and adjust them to what your new parameters are or your use cases. With a lot of other applications, you either don't have the option at all (because you have to use what they have out-of-the-box) or it takes a lot of work to be able to enable monitoring something new. That is the best thing about being an administrator of LogicMonitor.
I have written my own data sources in a number of cases. We have also leveraged existing data sources and modified them to fit our specific cases. We don't typically publish them, but I know with the LM Exchange that it's becoming easier to do that.
I know management very much likes the dashboard presentations that LogicMonitor has. They are very comprehensive. You can pull in other things and add them in as a widget. You can see more than just what is in LogicMonitor, as it gives a single pane of glass for whatever management is interested in or whatever environment they're looking at when they are the monitoring software metrics. Then, it is presented all in one location, which is really nice.
We have SLAs for uptime, all our hardware, and all our infrastructure: hardware, servers, and storage. I have spun up a number of services based on the specific metrics for all those devices, then determine SLAs based on the uptime of those metrics. We have a nice SLA dashboard that shows the uptime of all of our environments, so when my manager or his manager comes to me, and asks, "What was the uptime of our environments or this area in storage?" Then, I can quickly look at the dashboard and tell him. Therefore, I really like that feature.
Another dashboard that we find valuable is environmental health. We have a number of dashboards for all of our products. We have product teams for whom we created dashboards to look at the product, not just see what's happening now or in the past, e.g., what is currently having an issue. We also use it for forecasting, where we potentially might see an issue with storage on this server with a CPU that generally runs high or if there is an increasing trend in network traffic on the pipe. The environmental health dashboards have helped us stay ahead of potential issues that were coming down and ensure we had uptime for our customers' environments.
LogicMonitor has the flexibility to enhance networking gear as well as handle our unique environment: servers, hardware, cloud, and Kubernetes. There are a lot of features that we like about LogicMonitor.
I would rate it a nine out of 10 in terms of alerting. It is doing everything that we wanted it to do. We did a lot of tweaking in the last year and a half. In the last two years, since I have gotten really familiar with the product, I have been able to mesh with the teams to learn what we need to alert on. Previous to my arrival, we were sending a lot of alerts to teams, waking them up in the middle of the night. We have cleaned up a bit of their garbage so we are pretty clean in terms of what we're alerting on. It is doing a good job of letting us know when there is a problem in the environment, which is nice.
What needs improvement?
I have struggled a bit with the SLA calculations though, because I have some issues with the reporting having no data. However, I have worked around those issues and we have a solid process for reporting the SLA.
Automated remediation of issues has room for improvement. I don't know how best to handle it, but I know that they're kind of working on it. I know there are some resources that can do automated remediation. I would like them to improve this area so it could be completely hands-free, where it detects an issue, such as, if a CPU is running high. There are ways to do it even now, but it's a bit more involved. Also, for a LogicMonitor program, it really depends upon the hardware and environment that it is running on to make that call.
In terms of when it alerts, there are times when we do get alert storms because one device kind of fails on an interface where there are a number of things. Even if only one out of the five things on the interface fails, then everything on the interface will alert.
I would like it to able to create network maps and connectivity structures so you don't have to manually do it. This piece hasn't been a big hitch for us, but I imagine there are other customers who would really like to see the mapping piece of it grow and become a little bit more automated.
For how long have I used the solution?
I personally have been using it for almost three years. The company has been using it for six years.
What do I think about the stability of the solution?
The stability is very good. There are times when we get specific alerts based on if there are issues with this piece or that, but those generally haven't affected us.
What do I think about the scalability of the solution?
It can handle scaling. It is like any other cloud service. There is a cost associated with scaling, so we currently don't monitor all of our environments. We monitor just the customer-facing production environments. It would be nice if we could monitor our dominant environments, but we will have to pay a lot more due to the scaling issue. So, there's a balance there between what we would like and what we are willing to pay for.
We have had issues in the past with data collection. Maybe it is due to pushing the limits of what LogicMonitor can do, or even the devices its monitoring. For example, we have a couple of F5s that are heavily used with a number of data sources on them and the SNMP couldn't actually pull all the information back in time, which was causing blind spots.
We have probably close to 100 users who use LogicMonitor, not all of them on a regular basis:
- We have infrastructure engineers who maintain the infrastructure of our environment.
- We have product engineers who maintain the IT server environments for the products. They work closely together with the infrastructure engineers.
- We have our automation team and DevOps team who use LogicMonitor to do performance modeling on their environment and learn the automation processes that they have. They also use the API fairly heavily.
- We have software engineers on the teams who are monitoring specific server processes.
There are heavier and lighter users in all those areas. We have primary admins who administer LogicMonitor, and we're the heaviest users of it.
How are customer service and technical support?
Their technical support is very good. When we have an issue, they are usually knowledgeable enough to handle it. If not, they at least know what the issue is. It seems like they're sitting right next to a DevOps software engineer because it doesn't take them long to escalate to the developers. They are very good at getting back to us. I would give them 10 out of 10 in terms of their response.
Which solution did I use previously and why did I switch?
LogicMonitor has become our standard for all the products. Each product is basically an acquisition, e.g., we got rid of Datadog recently and phased out Splunk. The other solutions all came with their own tools, and we have gotten rid of all those other tools. A lot of that happened before I joined.
How was the initial setup?
I was not involved in the initial setup.
I was at the company for enabling the cloud and Kubernetes, which was a fair amount of work to pull that information in and reconfigure the cloud devices. We had them monitored as regular resources, but needed to migrate them over to monitoring them as cloud devices. It was a fair amount of work with no good way to automate it.
What was our ROI?
We haven't had as big a cost for downtime, so that has saved us a lot of money.
I am on a call every Monday where we evaluate all the alerting that has been done in the previous week. We have gone from constant complaints two years ago down to basically nothing.
When we spin up new servers and network devices, we have NetScans that are going on in LogicMonitor. It's a weekly scan on each subnet. If it detects a new device, then it will look it up in the DNS. From there, we have everything named appropriately, such that they are named in a way where LogicMonitor can, using property sources, figure out who the device belongs to and what the device does. This is in addition to it doing a standard SNMP network monitoring for the device to determine what it is. It uses that information, along with the name and property sources, to automatically assign where that device goes in our resource tree, then starts holding that device. That has been a lot of work, but it has been very fruitful in terms of being able to be hands-free and hands-off for bringing new devices into LogicMonitor. This saves us about five man-hours a week.
Which other solutions did I evaluate?
When we were evaluating software packages (and we were already using LogicMonitor at that point), LogicMonitor became one of the few solutions that ended up on our short list because it can handle cloud and on-prem. They are really good at both. Solutions, like Datadog, don't give you the option to monitor on-prem hardware. They assume that you are just in the cloud because why would anyone be on-prem when there is cloud available, then you can spend a lot of money in the cloud.
What other advice do I have?
We have used dynamics thresholds in only a couple of cases. We didn't necessarily see the application of dynamics thresholds in looking at critical alerts. So, we haven't used that a whole lot. Also, we haven't really leveraged the AI pieces of LogicMonitor. We are at a point with our tuning that we haven't needed to do so. If teams started complaining about specific alerts, like specific servers showing trends, increasing or decreasing, then we would probably do it, but we have been able to handle those concerns with static thresholds at this point.
I would rate the solution a nine out of 10.