What is our primary use case?
We are using version 5.9.5 as well as 5.9.7. Ours is a huge database infra so we are using two different environments to monitor our corporate and store servers.
We have thousands of DB servers that have to be monitored across our environment, which includes Lowe's stores across the U.S. We support the infra and the monitoring for all the stores and the store-related applications, as well as the servers which support those applications. With the DB servers being an integral part of all those applications, we thought we should have a separate monitoring tool just to monitor them.
Foglight ensures that we have a dedicated monitoring tool to monitor all our store and corporate DB servers.
How has it helped my organization?
There is something called SQL PI which Foglight offers, although it is only used for Oracle and SQL servers at this point in time. But that feature is really helpful for us in terms of assessing the performance of a database, or to see what kind of consumption is happening which could cause performance issues in a database. To an extent, Foglight is helping us to find the root cause. From a very high-level overview, across the whole Lowe's organization, if you ask me what kind of SQL query is taking a very long time to run and that might trigger a performance issue, we are able to access that from Foglight. In terms of metrics, we look at the tablespace data, the database space, the log space, as well as the availability rates of the databases, up or down, and how often downtime is occurring. These are some of the critical metrics which help us to measure the performance and increase the performance of the databases if required.
We have more than seven different types of platforms in Foglight. The fact that it enables us to monitor multiple platforms is a really cool feature. It gives us a single pane of glass to look at all the different types of databases in one place. That helps our database teams. We are the monitoring team that provides the monitoring solutions for our customers who are the different types of DB guys or DB teams. We are monitoring Oracle, SQL, MongoDB, Db2, Sybase, and Cassandra databases. Foglight gives all these DB teams a single tool to log into and look at the databases. They can see the performance of a database at any point in time, or they can take a look at the alerts for their databases. While the teams don't use it proactively to see if there are any long-running queries, they can always pull a report for the past month or so and see what kinds of queries are taking a very long time. That can help the database guys to ensure that the SQL query performance is improved. It also has a feature, out-of-the-box, to display long-running queries, which is really helpful.
What is most valuable?
It's important that Foglight supports different databases, starting with Oracle and including SQL, MySQL, Apache Cassandra, Db2, and MongoDB.
Another good thing about Foglight is that we can visualize all the different types of DB servers that we are monitoring in a single pane of glass. It uses a 360-degree overview of the database, for each of those databases that we are monitoring. That includes what kind of resource utilization is happening and what kind of DB parameters are getting monitored, as well as the different types of DB parameters that are being offered for each database type.
The UI is pretty simple. You don't need to write any custom scripts, the kind that are used in open source tools. You need to ensure that the DB servers that you're going to add have the necessary DB permissions, and the DB users, for the DB server to connect to Foglight. Once you have that, no matter what type of database, it's just a matter of clicking to add a database that you want to monitor. From the UI perspective, Foglight is good. They can improve it a bit here and there, but overall it's an okay UI.
What needs improvement?
There have been times where the database guys have used Foglight to find the root cause and it has taken longer than anticipated. One type of feedback we have gotten from our DB guys, especially when it comes to root cause, is that Foglight can improve. There are tools on the market that actually show where the issue is happening. It could be a performance issue or it could be another issue that is causing the database to go down. What we have been told by our DB guys is that Foglight should improve when it comes to root cause analysis.
There are thousands of objects within the Foglight Management Server. At times what happens is that these objects consume a lot of resources and that causes the database, or Foglight itself, to go down. To then identify which object is consuming a lot of resources is really difficult. At times it's very cumbersome. It would help if they could ensure that the performance of their tool is improved. Maybe they can try to eliminate some of those thousands of objects and just keep the important ones that are really necessary.
Or if they can come up with a way to let customers know what objects are causing, or potentially cause, performance issues and then give an option to the customer to change the threshold on those objects, that would help. I'm stressing this point because there have been cases where Foglight has gone down and, because of that, all the database servers have been impacted. One of the reasons was that some of the host processes, and the objects related to the databases, were breaching the default threshold. It takes us some time to identify that and then change the threshold and work with the Quest team to bring the tool back up. Foglight should really work on that and come up with a handy solution.
For how long have I used the solution?
I have been using Quest Foglight for Databases for one and half years.
What do I think about the stability of the solution?
The system is not that stable. We have been facing a lot of issues. We built a new store environment of Foglight, an environment for monitoring the Lowe's store servers, which are all Db2 servers. The objective is to monitor 800 Db2 servers in each Foglight instance. Up to 150 Db2 servers, the environment was working fine. The moment it crossed 150 or 160, we started having a lot of stability issues. The Agent Manager was restarting frequently, like every five minutes. Because of that a lot of alerts were generated. We have been working with Quest since last week but no resolution has been found.
When it comes to stability, they really need to work on that and then find a way to handle a larger number of databases, regardless of the platform. They need to find a way to handle more load on the Foglight Management Servers and the Agent Managers.
What do I think about the scalability of the solution?
When we initially procured Foglight, the intention was to monitor only our corporate DB servers, of which there are around 500 to 600, including both production and QA environments.
We have now set up another six new Foglight environments to monitor our store DB servers. There are 1,800 Lowe's stores and each store has 2 DBs. So altogether we are trying to monitor 3,600 DB servers, which is a huge infra. I have heard Quest saying that they have not seen such a huge infra where any of their customers is monitoring thousands of DB servers.
The usage is really increasing and we have not even added one-sixth of the servers that we are planning to monitor. But we are facing stability issues already. I'm really worried about what will happen if we start adding more servers than those I just talked about.
How are customer service and technical support?
When we started with Foglight at Lowe's, it was a new tool for us. We didn't have any background in using the tool. The support guys were really helpful in setting up the environment for us, and whatever issues we were facing at those earlier stages, and even today, we got—and are getting—correct support from Quest.
Although there are times where it takes longer than expected to resolve an issue, at the end of the day they try to find out the root cause and ensure that correct solutions are provided.
If it's a Sev 2 issue, they try to resolve it within a day. We have a dedicated support person from Quest who is supporting us on a daily basis. That means we can go through the pending issues everyday, for an hour or so, and ensure that the support is given on time, right then and there.
But we have been having an issue since last week and we have been working together with the person, but the issue has not been resolved yet. At times there are cases where the first-line support should go back to their R&D team and come up with a solution. That's what is happening in this particular case. On average, it doesn't take them more than a day to resolve an issue, but in extreme cases like this, it is taking more than a week to come up with the proper solution.
Which solution did I use previously and why did I switch?
Prior to Foglight, we used an open source monitoring tool called Nagios. We used that to monitor both our infra and databases. Because it was an open source tool, we needed to write a lot of custom scripts. Foglight offers a lot of out-of-the-box, database-related metrics, which the DB teams here are looking for. Foglight has helped us avoid a lot of the time needed to create custom scripts, compared to when we were using Nagios.
Ultimately we switched because we're monitoring thousands of servers. Since Nagios is an open source tool, there is a limit on the number of servers that you can monitor in a single instance. We had close to 50 Nagios instances, which were monitoring all our infra servers, including database monitoring. We wanted to have a single pane of glass to view all our database servers. We wanted one tool to monitor just the DB servers. That's the whole point of having Foglight in place.
How was the initial setup?
It took us some time to get accustomed to Foglight installations. The very first time, we had help from Quest support. After a couple of installations, it was okay. But I'm sure that they could make the installation process much simpler. The total installation process shouldn't take more than an hour, with all the configurations set up. They need to bring that time down to something like that.
Prior to installation, there are a lot of prerequisites that the customer needs to take care of. For example, building a new machine to be a Foglight Management Server or the Agent Manager, as well as the database server. You need to work with the architects to build the architecture based on the number of servers or the type of monitoring that you're going to do.
In terms of the architecture, Foglight has a Management Server which is connected to the Agent Manager and the database. The DB agents are installed on the Agent Manager which communicates with the FMS and the data is sent to the database. Since ours is a huge infra, we needed to build a lot of machines to start with. To set up our corporate environment, we had to procure more than 10 or 12 different types of servers.
What happens is that since Foglight supports multiple databases, each Agent Manager has a restriction on monitoring in terms of the number of DB servers. Let's take Db2 servers as an example. If you are planning to monitor more than 800 Db2 servers, you need to have an Agent Manager with a lot of resources. When I say a lot of resources, that means you should have an eight-core CPU, 48 GBs of RAM, and 100 GB storage, minimum.
These are requirements that not every organization can handle. Foglight has to find a way to reduce these resource dependencies. That is something they need to work on.
We have three people who look after the maintenance and the operations side of Foglight. We have a senior software engineer, a software engineer, and me, as lead engineer, who look after all the rules and tasks. Sometimes Foglight causes a headwind against us, meaning you need to do regular patching. And if you're adding more servers you need to again work with the vendor. There are a lot of issues in terms of maintaining Foglight. It's really painful. We have about 200 users of the solution, who are all database admins for the different DB platforms. Occasionally, application teams use it as well.
What about the implementation team?
We did reach out to PSO which is a third-party vendor for Quest, to integrate Foglight with our event management tool. Every time when we want to create customized rules, we also need to reach out to them.
That is really painful. I have to pay for custom rules that our DB guys are looking for. I cannot create them on my own because there are a lot of attributes and variables that you need to be aware of, from the application, when creating a custom rule. PSO is the only vendor that can do that. That causes a delay.
What was our ROI?
We still have a couple of more years before our license expires. Hopefully, starting next year, we will see benefit, in terms of ROI, from using Foglight.
What's my experience with pricing, setup cost, and licensing?
As far as I know, compared to the other tools on the market, Foglight is okay in terms of pricing and licensing.
Apart from the enterprise license we have, there is the cost of the third-party integration that we talked about. If you need to integrate, you need to procure an additional license from PSO.
If you want to set up, say, five new Foglight instances, and you want to integrate all five of them through the third-party, for each of those instances you need to procure an additional license, which would start around $1,000 each. That's something I have talked about with the vendor, something which they should work on. Maybe they could include all those integration licenses as a package.
What other advice do I have?
We have had a lot of stability issues since we brought in Foglight to Lowe's. From the stability standpoint, Foglight really has to work and improve.
I know that Foglight is capable of monitoring OS parameters as well as cloud DB instances, but we're not really using those features. We're just using Foglight to monitor the DB infra, purely from the database metric standpoint.
The time it saves us when it comes to a root cause analysis differs from case to case. There are instances where the metrics that we are monitoring on the DB servers have really helped us to narrow down the root cause. For example, it could be an ORA-600 error which is causing our Oracle database server to have a performance issue. If that's the case, Foglight raises an alert and sends an email to the DB team. As a result, they may disable that particular alert or look into the alert. They may end up opening a case with Oracle.
Which deployment model are you using for this solution?
Which version of this solution are you currently using?
5.9.5 and 5.9.7