What is our primary use case?
We use it to make sure that proper tuning is done for the existing monitoring.
In addition, our university has a number of schools and each is a customer of the main IT organization that manages and provides support for all the colleges, like the law school, the business school, the medical school, the arts school, etc. The goal, and one of the main use cases that we were planning and thinking about, was to be able to onboard all the devices, all the applications, all the databases, as required by individual schools.
We also wanted them to be able to create their own dashboard, tweak it, manage it, delete from it, and add to it.
It's deployed as a SaaS model. LogicMonitor is out in the cloud.
How has it helped my organization?
When we were using Nagios and we had alerts but there was only red, yellow, green. Here, the good thing is that you have escalation: level-one, two, three, which are clearly defined, and what action needs to be taken for each level. The clear escalation chain and tuning helps, because we don't want to wake up the director for 80 percent of the cases. That would be ridiculous. But when necessary, the right people should be alerted, especially for the production environment. If something has been "red" or there has been no interaction for half an hour, it's important to know that and to take the necessary actions.
That's a key thing, being a production-operations team member, because I don't want my team to be flooded with all the noise of alerts for something which can be tackled by a specific team. Having escalation chains, so that the alert goes to the right team to look into that and take action, means the prod-ops team doesn't need to even look into it. We don't even need to ticket it. We only keep aware of it through the daily alert dashboards. That has made a big difference in our overall resource planning, because previously we had 400 to 450 daily alerts. By using this feature we cut that down to 150 to 200 which are "candidate alerts" that production-operations needs to take action on. They may require creating a ticket, or calling the right people, or doing some activity that needs intervention or escalation to the next level. We have been able to cut down on our resources. We don't need to have four members actively looking into the dashboard. We can validate things with one or two employees.
LogicMonitor has also helped to consolidate the number of monitoring tools we need. We had some third-party monitoring, four or five things, and they're all consolidated with LogicMonitor. The only exception is IBM Tivoli Workload Scheduler. But what we did was we integrated that via Slack. I'm not really sure why we weren't able to consolidate TWS. The plan is to get rid of TWS, but we could not do so immediately, until there is an alternate route. But apart from that, everything has been consolidated using LogicMonitor.
We were especially able to consolidate third-party cloud monitoring for AWS. There were discussions about how we could also integrate or combine Azure monitoring resources through LogicMonitor. The team has mentioned that it has plug-ins that it can use to combine that. We also had separate backup scheduling software, a tool that had separate monitoring, and that has also been combined with LogicMonitor.
And LogicMonitor has absolutely reduced the number of false positives compared to how many we were getting with other monitoring platforms. At a minimum they have been reduced by 50 percent. The scope of more tuning and going through the learning curve helped to bring it down. Within the first two or three months, we were able to bring the false positives down by 50 percent. That's a big achievement. That is the main reason we initiated this project of getting into LogicMonitor. There have been further talks internally about how we can eliminate them further, and bring it down by 70 percent compared to the false positives we were getting. That's our goal. So far, it has reduced the time we used to spend on them by 50 percent, both offshore and onsite, as we have an offshore team in India that works 24/7. We used to have multiple people in each shift and we have reduced that down to a single person in each shift. That's a big step in the right direction.
What is most valuable?
Tuning is one of the main components. We like to make sure that only the right alerts are escalated, and that alerts are being sent to the right members, as opposed to every alert being broadcast to everybody. The main thing is the escalation chains. We feel that is a very good thing, rather than sending all the information to everybody at each level. Having the ability to make those sorts of changes doesn't require you to do too much, out-of-the-box. You just need to create the basic entities, like who are the different people, who are the contacts, or email groups, and cover the data source and events which should be alerted.
Another feature from the technical aspect, the back-end, is the ability to allow individual users or customers to have their own APIs. They're able to make changes using the plugins covered by LogicMonitor. That is a very powerful feature that is more attractive to our techno-savvy customers.
In terms of basic functionality, from a normal user's perspective, the escalation chains and the tuning part that are embedded in LogicMonitor are the two most important things.
Among my favorite dashboards are the alert dashboards. Being a prod-ops team, we took the out-of-the-box alerts dashboard given by LogicMonitor and we have kept on tweaking it by adding more columns and more data points. The alert dashboard is something which is very key for us as a team. In general, it gives us more in-depth information about uptime, the SLAs, etc. LogicMonitor has done a good job of providing very user-friendly dashboards, out-of-the-box. There are so many things that we are still learning about it, how we can use it better, but the alerts dashboard is my favorite.
The reporting is something which I have explored, to send me an email every day with how many alerts, in particular how many critical alerts, there were. It's a good starting point. The reporting can be sent in both HTML and Excel and is accessible on the dashboard after you log in. These two things are very good. This is the first feature I looked at once we went live, because I want to know things on a day-to-day basis and a weekly basis. I activated the email feature because I want it to send daily, weekly, and monthly reports of my alert dashboard data.
We use LogicMonitor's ability to customize data sources and it's a must, because ours is a very heterogeneous, complex environment. Changing data sources is important for at least some of the deployments. For other organizations, it may not really be required to change the default data sources provided by LogicMonitor. But here, it was important to change them. That's where the capabilities of the embedded APIs really helped us. I'm not part of the team that makes those changes, but I worked actively with the teams that did, and I always got very positive feedback from them on how they would get the right answers from LogicMonitor. They had to make a lot of changes to the data sources, for each customer, and it worked out well.
What needs improvement?
There are a few things that could have been done better with the reporting. It could have a more graphical interface.
The dashboards can be improved. They are good, but there is a pain point. To show things to management, to explain pain points to other customers, to show them exactly where we can do better, the dashboarding could be better. Dashboards need to show the key things. Nobody is going to go into the ample details of Excel sheets or HTML.
Automation can also be improved.
Finally, while this is a very good tool for monitoring and responding, if there was a way they could do something like PagerDuty or another third-party solution for alerting, integrate both monitoring and alerting, that would be an ideal scenario.
For how long have I used the solution?
I have been using LogicMonitor for close to a year. If I remember correctly, LogicMonitor was implemented in my organization as a replacement for Nagios. I was actively involved in that project right from the beginning of verification through going live. In the initial stages we may not have been actively using it, but we started learning about the tool and how to implement it about a year ago.
What do I think about the stability of the solution?
Overall, the stability has been good. We didn't have any issues during the phase after we set up and went live.
The performance was also pretty good. We didn't have to wait for a response for any of the attributes on the dashboard or reporting.
LogicMonitor has the ability to alert you if the cloud loses contact with the on-prem collectors. We had a challenge within one or two months of deployment. The problem was the way we were using the collectors. We were actually using our Nagios server as one of the collectors. We were trying to eliminate that server altogether, because it was giving duplicate alerts.
Initially we had a challenge of not getting any alerts when the connection to the collector was lost. Later on we found that there was a routing table or there were some firewall changes that were needed. I would attribute that more the learning curve and what the best practices are.
Since correcting that problem, we haven't had an issue of any collector being down. There's no question about any of the alerting.
What do I think about the scalability of the solution?
The impression we got when we provided information about the number of servers, the number of end-users, and the number of networks that were part of Nagios back then, was that LogicMonitor said they could expand and double that, if things were to grow. There is scalability in that environment to support a big data buffer. So there should not be any problem with scalability.
In terms of DR, discussions are still going on as to what would happen if there were a disaster.
As a whole, the organization has to use a monitoring tool. It could be Nagios, it could be LogicMonitor. There was a phase in which most of the schools were using both in parallel. But one after another, they are all happy to be using LogicMonitor. Usage-wise now, it's only LogicMonitor. Nagios has been cut down, so nobody is looking for any monitoring system apart from LogicMonitor.
There are some schools that still need to tweak it and tune it, because they have not given it much attention or have not really been required to actively monitor their solutions. We know where the priorities are, which school is the top priority and which schools were using Nagios more actively. But all the major customers that were using Nagios, once we unplugged it, have been happy with the LogicMonitor implementation. There are a few schools which are not actively using any monitoring system. They may get to the stage of actively using it, but, university-wide, everybody is using LogicMonitor. There is no other monitoring tool out there.
How are customer service and technical support?
We have evolved and have kept on making changes, as per the requirement of the customers and one good thing about LogicMonitor is that it has a very good support system. We have had chat sessions with them to ask questions which help each school, and the IT organization as a whole, to evolve a better monitoring and alerting tool.
The way LogicMonitor support responded during our initial setup was amazing. That's something I really enjoyed a lot. They never said something like, "This question should not be asked," or "This question is not a candidate for the chat session." For every question we would get a reasonably quick answer which we would be able to implement right away. They would also log in remotely and help if something was something beyond an individual's capability. That helped to migrate and complete this process in a quicker manner. LogicMonitor has a very highly talented support team that can answer the questions and help the customer right away. It's been wonderful.
I don't see that happening with all vendors. With other organizations, when you submit questions in the chat session, they'll take the request and they'll say, "Okay, we'll get back to you." LogicMonitor — and it's a differentiating factor — is there to provide solutions right away, rather than putting it into their ticketing system and escalating to level-2 and to level-3.
I really don't know if that level of service is only for specific customers, based on the contractual terms and conditions, or if it is the way they do it for everybody. If this is the way they do it for every customer, they should definitely be very proud of the way they are doing it. Their team is there to help support the customer instantly, versus taking their own sweet time.
I would encourage LogicMonitor to continue that same level of expertise, of people being there 24/7 to support customers. That would be a big differentiating factor compared to competitors.
Which solution did I use previously and why did I switch?
The main reason for migrating to LogicMonitor from Nagios was to eliminate the noise of alerts. It may have been because alerts were not properly tuned, but the visibility with Nagios was not complete. It became a bottleneck.
Only one or two people had active access to tune things. If anything had to be done, there was just one guy who had to do it. We wanted to move towards a self-managed model. LogicMonitor is a solution which can be in that category, once it's deployed and there is a transfer of knowledge to each school.
We want each department to self-manage: manage their own dashboards and create their own reports based on their requirements. If they have a new device coming up, they can spin up a new AWS instance and onboard that, etc. It's the initial phase which is going to be challenging. But once we have the handover call with the individual customer, it's going to be easy, and that was not possible in Nagios.
We also wanted to have a proper escalation chain, which was not present in Nagios. That's something we have made use of in LogicMonitor.
Finally, we switched to use fewer resources and to speed up turnaround.
How was the initial setup?
The initial setup is complex. It's too picky. I'm a hands-on technical guy, although I don't call myself an SME, but I know everything right from networking, servers, databases, firewalls, to clustering, support, and operations. The initial phase is definitely a little bumpy for somebody who's not completely technically savvy. I understand that it's because there are so many features involved, and there are so many ways for onboarding and using the custom APIs, etc. To me, LogicMonitor, looks like too much of a technical-savvy company. There's good and bad in that. It depends on how you look at it.
The automated and agentless discovery, deployment, and configuration are good. We used that a lot initially. They did a good job with that. One thing that could be done is to make the naming conventions — adding different names like the IPs, the DNS lookup — a little better. They could eliminate some of the duplicate entries when you're onboarding it. I saw a lot of duplicate entries, which goes into the licensing. Apart from that, the way they provide a template or a flat file to the system for onboarding is good.
As for monitoring things out-of-the-box, it seemed that our database team spent more time in configuring stuff, whether MySQL or Oracle, etc. Now, LogicMonitor has come up with a very easy way for configuring and monitoring database components out-of-the-box. But that's something which I felt was a little bit of a pain point. I don't know whether it was that our team made it more complicated or LogicMonitor didn't handle it out-of-the-box.
Apart from that, LogicMonitor has done a good job of out-of-the-box monitoring of the basic resources within the servers — memory, CPU, disk configuration, etc. — as well as for HTTP, the web components.
While I wasn't actively involved in the planning for the implementation, I picked up things from the team which was actively involved in planning and implementation. The process was primarily to engage with LogicMonitor. Our team — the product owner and team members — worked together and was in touch with LogicMonitor to gather all the existing features that were available and how we would make use of all that. That was the initial phase during which we got to know the product completely.
We mapped all of the devices which were in Nagios to make sure we onboarded everything that was in Nagios to LogicMonitor.
We had several internal discussions where we told the schools how we were actively engaging with LogicMonitor to make sure that we would go in phases. The initial phase was knowledge-transfer, the second one was to onboard a school, or at least one application, to make sure that it was tested completely and then remove that from Nagios. We took time to make sure that they were getting proper monitoring and proper alerts, out-of-the-box.
While doing that, we found that there were a few things which were not properly configured in LogicMonitor, compared to Nagios. The goal was to improve on Nagios, minimize the false alerts, and have better features for reporting, dashboarding, escalation chains etc.
We had six to seven people actively involved in the process. Two to three were purely technical, and made use of LogicMonitor support very extensively, especially for some of the customized activities like using custom APIs. From the LogicMonitor side, there were two to three members from the front-office who were actively involved, and on the technical side they designated a couple of people whom we could directly contact on a day-to-day basis. We had a daily, separate session with each of our teams, like networking, business, operations, and DevOps, so that each team could ask questions about its pain points and get better information so that we could do things ourselves and, for things that were beyond us, to learn how they could help. We had a month of one-on-one sessions with them, every day, for two or three hours.
When we initially started the engagement with the LogicMonitor team, they came onsite to run a one-week session with all the key stakeholders: the customers, the technical team, and back-end operations team. That was a very useful session that helped kickstart things. At that point, not everybody knew completely how LogicMonitor works and how we could plan to migrate from Nagios to LogicMonitor. What were the things that we could retain? What were the things that we could just ignore? Overall, the exposure to LogicMonitor during that one-week phase, in terms of customer-engagement, was really a great experience for me. We also had the ability to quickly use the chat session online and ask questions.
The implementation team's role and its way of engaging with the customer was amazing. That's something which I really appreciated. That helped me. Once the engagement was over and the contract started, the online support was available. If we had a problem, we could type in our question or our problem right away. The support team would respond and fulfill our requirements. They would fix the problem.
Our deployment took two to three months. That includes the visits by the LogicMonitor to do some knowledge transfer and give hands-on experience to some of the key stakeholders. But during that time, not all places within the university were onboarded. Some schools were not really interested. I don't think they were properly updated. That was something that was more of an internal issue, because we were doing our own "selling" to tell them what the differences are between LogicMonitor and other things. We had to tell them that Nagios was going to be pulled and that they would be completely in the dark if they were not moving to LogicMonitor. So during those three months, there were still quite a few schools which were not migrated to LogicMonitor or didn't onboard all of their resources. But the majority of them were done in three months.
In terms of maintenance, we have three to four people involved. One guy was actively involved in the Nagios implementation and its maintenance. He was part of decommissioning that and completely taking ownership of LogicMonitor's technical aspects. One person is the product owner who interacts with all the stakeholders, the different schools, to make sure that they have their requirements met using LogicMonitor. One is a manager. And there is a person from the business point of view, who provides his pain points, and what they're seeing on a day-to-day basis. So those four people are actively dedicated — I would not call it to maintenance — but to the day-to-day LogicMonitor stuff.
There are the users as well. Each school has its own applications and services that they offer internally. I don't have exact numbers but there are about 20 of them.
What was our ROI?
It allows us to accomplish more with less by minimizing the false alerts.
And by giving the "keys" to the individual owners, it makes things faster.
Also, as I mentioned, we don't need to have as many people in each monitoring shift, in the 24/7 environment. Previously, we had alerts that went to everybody and everybody was up and looking into why we had a given problem. Now that we are splitting the problems into different buckets, we are not tapping into all our resources' time. That's an area where we're saving. As a rough ballpark, we are saving about 50 percent of the resources from an operations perspective.
What's my experience with pricing, setup cost, and licensing?
We have a separate team involved in licensing. I wasn't involved in that.
Which other solutions did I evaluate?
I believe they evaluated two or three other tools, but I was not part of that process.
What other advice do I have?
For the initial phase, rather than having only one or two functional guys participating, it's always good to have one or two technical folks in the discussions. That helps a lot. You don't want surprises if an organization decides to go live with this tool, and then realizes that technical things are not on board with the ideas of the functional team. That's something I can say based on my journey and experience.
Another thing that is important is to keep on having internal conversations; that you value and give importance to everybody. It's good to educate them. Use the help of the LogicMonitor support team for internal question/answer sessions and do anything that will help them feel more comfortable. It's not about two or three members being really happy with this. LogicMonitor is something which can only be successful in automation if all the key teams and team players are on the same page.
The biggest lesson has been how we could make everybody be part of the mission. Previously, monitoring used to be in the hands of one or two, and each of them had a lot of overhead to deal with. But by doing this, we have reduced the complaints from individuals and each stakeholder. They know how they're configured. They know what the escalation chain is, so they're confident. If there is something not working, it's because of the way they have it configured.
By doing this we have minimized the internal noise. We have given everyone the opportunity to know the pain involved in monitoring and what it takes to have a better monitoring system in place, and how each person can contribute and think outside the box. They know how to put into place the right parameters and the right numbers. Previously, 70 or 80 percent of things were escalated internally. There was no involvement of the particular customer. If there was a problem for a team, it was somebody's problem, not their problem. Now, it has all become their problem. This is a very high-level benefit of using tools like LogicMonitor, which involves everybody more.
I would give LogicMonitor an eight out of 10. There are a few things that LogicMonitor is also learning from their experience with the customer. Most of the customers are giving feedback to LogicMonitor for improvements and to make changes. I'm sure that very soon it will be a 10, but at this point in time, from my experience and journey, it's an eight.