What is our primary use case?
We stood up an event management group and our responsibility is to monitor the entire company, globally: systems, applications, and infrastructure. We're modeling those out as services. We've got about 800 services that we're modeling out from the CMDB right now and monitoring pretty much everything.
We are big users of the service models. We use CA's SDM system, which we're evaluating. But in the meantime, we wrote the interface between TrueSight and CA to cut tickets and also to, in reverse, give ticket statuses in TrueSight. We're also going through a process of onboarding our services for event management where we go through a checklist of about eight different items and bring them on as a service with SLAs. Some individuals on our Service Desk - and eventually all will be - are dedicated to doing 24/7, 365 monitoring of the services, the events, and the applications.
One of the primary things we're doing is using this as a vehicle, within our "One-IT" initiative - which includes event management - to truly bring people together from a cultural and technological perspective. The goal is that everybody will have the same place to see what's going on. No longer will they have to worry about their application. Is it the databases? Is the network? And how long do they have to spend trying to figure it out? Culturally, the Service Desk is coordinating some of those impacts when they happen, so that the right people are on the call, based on what the service model says. All in all, it's a very flexible tool, which means it's complex but very powerful.
We're using Operations Management, Capacity Optimization, some App Visibility with some of the Synthetic scripting and we're just starting to deploy some Java agents on some app servers.
How has it helped my organization?
With the service modeling, once we managed to build our import stuff to get our CMD impact models and services into TrueSight, that was a big win. Because once we integrate it with SolarWinds, they will actually be able to see when there's a problem with the plant, and they will know if it is a network problem or a server problem. With the service models, they can actually get right down to the impact of any issue. We're working on some other things to make that easier, like event correlation. So if a network goes out at the plant, they don't need to know that there are problems connecting to 60 servers, rather they've got a problem with the router.
We're currently looking at either consolidating the other monitoring tools that we have around the organization or connecting them for the single-pane-of-glass goodness. We're bringing in data from SolarWinds, we're bringing in data from Oracle's OEM, and we're integrated with an application monitoring desktops. It generates an event and a ticket is cut out to the regional support people. They will go to the desktop and say, "Your disk is in danger of imminent failure. We need to go ahead and clone that guy and replace it before you're down." So we're definitely going with a single pane of glass. In terms of our IT ops management, that means it's getting better. We're trying to be more proactive instead of reactive. We've only been heavily into this for nine or ten months so the actual, long-term impacts aren't measurable yet. We're still baselining where we are at.
The single pane of glass is a big improvement.
There is also the ability to do predictive and corrective, especially for some services which we're monitoring out in the field which are critical to various plant components. It used to be that they would go down and the plant would call. Now we're detecting that they're down, we're restarting them, and we're letting somebody know there's an issue. That's also a big improvement in our manufacturing capabilities. Culturally, it is bringing people together with one place to look and giving them something to talk about when there's an issue. It's bringing IT together. The collaborative and predictive stuff is actually starting to improve.
We're not doing a tremendous amount of preventative stuff yet - unless you count when your disk is three percent from being full and you need to do something before it fills up. We're not using some of the more advanced features of the predictive analytics yet. We are starting to look at some data analytics though. We have a data analytics group which we stood up, a couple of people who are starting to use data analytics to do some things.
It's improving the overall operation, but the impact is going to be measured a little bit later. We've seen some cost deferrals and some cost savings with some support renewals we haven't had to do on some other tools. But we haven't seen the major cost impacts yet. We have spent a lot, but on cost-avoidance for various support tools we have saved close to $1,000,000. In the nine months we've been operational, we've deferred cost on at least two tools. One was about $750,000 and the other was $250,000 for maintenance.
It also helps to maintain the availability of our infrastructure across a hybrid, complex environment. I used to work at FedEx and we're not as environmentally complex as FedEx because we consolidate a lot of stuff on the ERP. But if you throw manufacturing in there, we have pretty much every flavor of platform. As with most deployments, we've got three-tier and four-tier applications. You throw the network and some load-balancers in there and it's fairly complex. If you can use a service model to see exactly what's working and what's not, it really gives you the ability to look at some things.
The solution has also helped to reveal underlying infrastructure issues that affect app performance. Let's say there is a system that is occasionally slow but you don't know why. Then you find out that it was supposed to be configured to use a large number of LDAP servers for authentication but somebody had configured it to one. When you compare the times at which the systems people were having trouble logging on and you look at the CPU and memory usage on your LDAP server, you begin to put things together, without actually analyzing configuration files. You can figure out that the system is configured improperly. When they dig in, they find that it's only talking to one LDAP server. It gives us that kind of diagnostic capability, by looking at everything, and the ability to pin things down.
In terms of root cause analysis, we're still working that through. But mean time to repair is going down because it's becoming much more obvious. Between the events that people are looking at which are prioritized, and the service models which show the actual impacts to the relationships, it's becoming much easier. Depending on the event, it's gone from about four to five hours down to 20 minutes. When it works, it's significant. A lot of it is cultural. When you go from everybody monitoring their own stuff and not talking to anybody else, to everybody looking at the same single pane of glass, and you throw a Service Desk on top of that, which is performing incident management and coordinating some things - between the technology and the culture and the process changes, you're going to see some pretty dramatic improvements.
BMC just did a custom KM for us. Typically, on a given server, we want to know when a drive is three percent. But we've got some mixes of drives, servers which have anywhere from a 100-gig drive to a terabyte drive, and the percentages that we are worried about are not the same. This request came from our SQL group. BMC was able to adjust the alert parameters based upon the size of the logical drives. That was definitely a business innovation. I think that was good for BMC too. Although that's a custom KM which we just deployed, I suspect they will make that part of their standard tool kit.
What is most valuable?
From a TrueSight perspective, we love the Capacity Optimization. We manage to collect almost all our capacity information through agents, without having to deploy a capacity agent. We've already saved some money. We're now provisioning more for obsolescence than we are for expansion because we now know exactly what we've got. One of the nice things about it is that we've now put Capacity Optimization in all our plants and mills, where the money's actually made.
The flexibility of the MRL is great. The various abilities to use native KMs to connect to a lot of things that we're doing with the hardware monitoring into the consolidated stuff, like SharePoint, is great. We're using native monitoring capabilities for all our server hardware, for visibility for applications, for URLs, for webpage response and accuracy, and for monitoring network throughput in a lot of particular instances. We're using lightweight protocols for pinging, for DNS, for LDAP. We use the scripting KMs for a lot of stuff that we have to script ourselves. We're also doing a lot of SNMP polling for devices. We've got some places where we really couldn't use a traditional agent and we deployed a Java agent that we wrote. For example, we might be monitoring UPS's out in the field using a Raspberry Pi and pushing that data back up. The problem with UPS's out in the field, when you have thousands of them, is that you don't know that the battery's bad until the power goes out. This gives us the ability to enable them to report back via SNMP.
What needs improvement?
I can only speak from my perspective because I don't know if some of the issues that we've had are industry-wide or not. For instance, we've got a lot of Microsoft stuff here, and the SCOM interface is very difficult to use. They don't have support for SCCM and some other things so you have to go directly.
The one piece that I would love to see is a general-purpose, configurable agent which would be a framework that you can deploy on anything, whether it be Java or anything else. It would allow you to easily deploy it on a platform that they support.
The KMs and some of the user interface are a little bit quirky. That's the stuff that they will eventually get to. TrueSight is a fairly new platform revision for BMC. I'm seeing a lot of those simple platform things, where you have to go here and do this and you have to go there to do that. They're very working very hard to integrate everything into the same simple console. I think that a lot of the issues that we have are going to slowly, or maybe rapidly, disappeared.
For how long have I used the solution?
We installed it a couple of years ago. We started ramping up and have been using it since then. We really went hot and heavy about nine months ago. We moved from Windows to Linux in January so that's when we really started to invest in event management work with it.
What do I think about the stability of the solution?
On Windows we went to application HA and, quite honestly, it was terrible. They'll tell you it's terrible - or they should. We are very religious about patching, so when you go to multi-node HA stuff and you've got the Windows guys patching your stuff every Saturday night, you become very unstable. What we did was we moved to Linux so that the patching wasn't necessary as often. And we went to operating-system and hardware-level failover with Oracle Solaris virtual machines, and we've been incredibly stable since then.
What do I think about the scalability of the solution?
Regarding scalability, so far, so good. We've got about 22,000 devices that we're working with, of which about 8,000 are directly monitored. The rest are coming in from SolarWinds, the network, and some other things. We're running three TSIMs and one parent, so four infrastructure managers. We've got integration servers all over North and South America and Europe. It's very scalable.
As for increasing the usage of it, the foremost thing in our pipeline is to continue to bring on applications. As part of the service onboarding that I talked about, we're bringing in major applications and sitting down with the service owners. We're going through everything they could possibly want monitored and showing them what we can do for them. We're putting those thresholds in place, training their teams, and bringing their teams on as users. Slowly, over the next year to year-and-a-half, we will bring in all of IT.
How are customer service and technical support?
Tech support varies, it depends on who you get. The first-tier is pretty good. If you get the right guy, it's outstanding. They've actually brought on a lot of new people, but they seem to work together as a team. I won't say they're bad, but I don't like tech support for most companies. Overall, they're on par.
Which solution did I use previously and why did I switch?
Prior to BMC, from a monitoring perspective, we were using 65 other solutions. One of my missions is to either integrate them or consume them. Bringing on TrueSight was the vision of a guy who's no longer here. He fully understood the need for a single pane of glass. He understood, fully, the need to bring light to the monitoring situation. We did some evaluations and proofs of concept and decided on TrueSight.
Quite honestly, if you're a large corporation, you can go look at the studies and you can justify it that way, but if you stop and think about how much better your organization can run, and the things that you need to do from an operations management perspective - and you think about the automation that you can put in place - it's a no-brainer. It's just a matter of choosing which tool.
How was the initial setup?
The initial setup was complex, no doubt, by the time you bring in Professional Services, if you opt to. We didn't follow the standard model because we didn't want them to come, drop in a configured system and say, "Here's the book on how it works," and then walk away. We wanted them to participate in every aspect of it. We brought a lot of it on ourselves, where they told us what to do and we did it. We worked with the Pro Services to do it, so we took longer than it probably should have but we knew more about it than we would have as a result. It's a very flexible product, which means it's a very complex product. We had enough servers and monitors that we had to bring up a multi-tiered, large number of TSIMs. It was because of our service models that we introduced a lot of the complexity ourselves.
Because we're pushing full sets of service models out of our CMDB and into TrueSight to use as a service model, we have to put them at a top level of a TSIM so that all the other TSIMs that feed into them can show up as impact models. We went to a three-tiered architecture with presentation on top, a service management infrastructure manager in the middle, and the integration managers below. So a lot of the complexity in our particular configuration was due to the fact that we didn't want to have to figure out where those services belong, or which piece belonged on which TSIM. We wanted to punch them out to the top and then let TrueSight worry about it. So in the long run, it was complex to install but it is much easier to maintain.
The deployment took about three months. There was one person from BMC and about five people, altogether. We had DBAs involved and we had the hardware guys involved and the network guys involved. It was probably three people full-time but, off and on. Every aspect of some department that would touch this thing was involved at some point.
There is a team of five employees and myself who are not only maintaining it but doing all the monitoring configuration - working with users to collect monitoring requirements, setting thresholds and writing custom MRL and PSL.
At the cultural level, it used to be when we first started it up, people would say, "I have my own monitoring tool and I don't need you people. I'll do my thing." Now, they're saying, "You're doing things for these other people, can you, can you help me out?" It's really grown organically, and we've had to put a team together so quickly that there has not been what should have been in place, which is a major deployment plan, where all of the pieces would fall together. We're starting to work on that now.
What about the implementation team?
We worked directly with BMC. We didn't use any third-party.
What's my experience with pricing, setup cost, and licensing?
The only possible additional cost that I can mention, that you might not be aware of, is that it uses Oracle partitioning, if you use Oracle. There are Oracle partitioning fees that go with that.
Which other solutions did I evaluate?
We looked at some other options. BMC has been around a long time. If you look at the industry ratings, it's way up there, top-right quadrant, along with a couple of other solutions. Its flexibility and its capabilities dovetailed with what we wanted to do and we liked their people. They have a good attitude.
What other advice do I have?
My advice is that it's not going to be as easy as you think, but it's going to be worth more than you think when you get it done. It depends on your situation. It depends on how far advanced you are in operations management. For us, this was a complete cultural, technological, and process overall. It wasn't just replacing one tool with another. It wasn't just putting a tool in place. It was an entire IT renewal and it's still going on.
It's been a long, hard road, both from a cultural perspective and from a technology perspective, just getting people to realize the value. But once they do, they're willing to bend over backward for you.
We had some false alerts. In my job the red light means it's bad and the green light means it's good. There should be no light you think is green but it's bad. We had some of that at the beginning, more our fault than anybody else's. But once we got to the point where the signals were good and people could appreciate what they are getting, we became a very different organization.
The biggest lesson I've learned from it is that you can talk about it, you can visualize it, you can proselytize about it, but until you have a single pane of glass which is actually up and running with a lot of stuff connected to it, you just can't really appreciate the value of it.
The functionality of the solution is not helping, so much, in terms of business innovation. We're not doing business process monitoring at this point. While it might be that the business is not complaining as much, I don't measure that. But from an innovation perspective, it has had people look at things and say, "Well, if you can do this, can you do that?" We get a lot of requests for strange things, some we can do, some we can't. But it's getting people to think about things that hadn't really come up before.
It's a really good tool and most of the issues we've got, they've either fixed or they're fixing to fix. So a nine out ten is right.