BMC TrueSight Operations Management Review

Enables us to monitor a hugely diverse set of hardware products from multiple manufacturers


What is our primary use case?

We're actually hosting the software and providing services to our customers based on all the capabilities that are within TrueSight. We are a very large, global, hardware maintenance provider for data centers. We mostly service the high-end data storage and networking equipment that you would find in data centers and in cloud environments. 

A couple of years ago we started on a journey to really improve our ability to maintain and service our customers. This was all about connectivity, getting connected to those servers and storage platforms. We wanted to get connected to everything that we were maintaining around the world so that we could really implement a "diagnosis before dispatch" approach.

With this solution, we gather all the data from a server that has failed, and we do all the troubleshooting, the problem and root-cause determination - we call that triage - before we ever send a field engineer or anyone to the site. So when we do send a part or do send a field engineer, we know exactly what the root cause of the problem is and what they need to do to fix it. 

How has it helped my organization?

We are using this solution to scale our business and to drive greater efficiencies. The other side of it is that it's much better for our end customers because they no longer have to monitor their own environments for hardware failures. We do that for them. They don't have to recognize that a server has failed. They don't have to pick up the phone or send us an email to open a ticket and send us files to help us troubleshoot the problem. We're really reducing a lot of the effort required on the customer's side to manage their IT environment using this tool because we can detect the failure, we can troubleshoot it remotely. And, when we do implement the corrective action, we're pretty certain of the root cause, based on the technology and the capabilities of TrueSight.

It has improved our time to repair. From the time we get the incident logged to the time we get the customer back up and running, it has improved that by 33 percent or greater. It has also improved our ability to fix it right on the first call. It gives us the root cause of the problem, and it automates that whole triage, it gives us the part number of what's failed. We're now at somewhere around a 97 percent first-time fix rate. And that's only going to get better as we get more experienced with the product. And that's important to our customers. When we come out, we're going to fix it right on the first call and not have to come again and again and again. That's really important to the uptime of their IT.

We have a graphical representation of this very thing. It shows the old way of service delivery, in which the customer first had to recognize they had a problem. Once they recognized they had a program, they had to call in or email and open a ticket. Once they opened a ticket, the whole troubleshooting process would begin. We were often calling them as many as eight times per ticket, just to get information about the failure. That was taking a lot of time from the customer. After that, we would have to dispatch someone with the right part or the right solution, and oftentimes we either brought the wrong part, or we had to bring a handful of parts, which was costly for us and would drive up the cost of the service for the customer. And often there would be a repeat call, because we might not have brought the right part or have sent the right level of skill out on that call. That was the old way of doing it.

The new way of doing it for the end-customer is that we call them to let them know we have spotted a problem with their server, for instance, and that we're working on it. We don't have to bother them for log files or diagnostic logs or any of that information anymore because it all comes packaged with the alert from TrueSight. The customer really only hears from us two times now: once, when we open the ticket to let them know we've seen a problem and again after we've resolved it.

Another example is that many of our customers have equipment in co-location centers and offsite data centers, where they don't even have anyone to see that there's a problem. Now, we are driving a lot of efficiency for them. They don't have to send people out to check on problems anymore or pay somebody who is running the co-lo to go out and check on something. We're able to see it all remotely through the monitoring tool. That's another huge benefit that we've heard about from our customers.

The solution provides us with a single pane of glass where we can ingest data and events from many technologies. In terms of our IT ops management, we have a unique deployment. We actually have it running in our own shop. Everything that we deploy to our customers we deploy internally first. But we've really licensed and implemented TrueSight to drive our services business. We're supporting all of our customers' data centers with the product. We're not connected to all of those yet. We just officially launched the solution in January of 2018. We've got about a year-and-a-half in production with the product and we're getting good adoption. The real answer to its effect on our IT ops management is not so much our internal deployment. It's more about the deployment that we're leveraging for all of our 16,000-plus customers globally.

We've had a number of cases where, through the analytics in TrueSight, we've actually been able to predict failures. For instance, we've already had a couple of cases where, if we see a hard drive on a storage array is going to fail, we'll actually send the part out ahead of the failure. That allows us to replace that drive before it fails - and on the customer's planned downtime. In the old model, it fails, it's down. The customer waits for us to come out, swap it out, and bring everything back up. In the predictive model, we know it's going to fail, we send the part out ahead of the failure, and we replace that drive on the customer's scheduled downtime. The more of that we can do - and as we expand beyond hardware into operating system, application, and the other layers of infrastructure - we'll be able to exploit the machine learning and the AIOps to a greater degree than what we're doing today on the hardware side.

The way we talk to our customers about the functionality of the solution across IT ops management to support business innovation is that because we've significantly reduced the amount of time they have to spend managing service tickets, they have more time to focus on their digital strategies. We say, "Hey, we're giving you some time back. You don't have to spend all this time interacting with your service provider. You're just going to hear from us when you have a problem and after we've fixed it. We won't bother you for log files and all those things." We're actually giving them time to allow them to do more value-added work, like working on their strategic initiatives and their digital transformation initiative. I think we'll be able to expand on that as we go forward.

What is most valuable?

The ability of this platform to monitor the very diverse assets that we maintain around the world is its most valuable feature. We service over 350,000 data center assets. These assets come in the form of servers, storage arrays, networking devices, etc. We've calculated that we service and support over 36,000 data centers around the world.

We're not really tied in with the manufacturers, but we support a vast array of manufacturers' equipment, like HP, IBM, Cisco, Dell, EMC, Hitachi; and I could go down the line. We have a very diverse install base under contract and TrueSight can connect to all of those and monitor all those different platforms. Many of our customers have as many as 20 tools in their IT environments to try to monitor all this stuff. We can do it all with one, and we're hosting it for them. So it really gives us the ability to take some of that burden off the end customer.

The other really important thing to us, and the reason we chose TrueSight, is not only to monitor and to capture failures and alerts when things fail out there, but to do what we call "automated triage." No matter who manufactured the equipment, when we get the message that tells us something has failed, it always looks the same. Whether it's EMC or Dell or IBM, whatever the equipment might be, TrueSight always returns the event in a standard format which gives us the manufacturer, the model, the serial number. It even gives us a list of what has failed, whether it's a hard drive or power supply, for example. It even gives us the part number of that specific device in that specific machine. That really helps automate the troubleshooting and the triage process. That's a big feature for us.

The solution's event management capabilities are proven. We always like to say it performs as advertised. We evaluated over a dozen products before we chose TrueSight, and we found it to be very good at monitoring at the hardware level, which is core to our business. The ability for it to capture those failures, to capture all the events from that very diverse set of equipment which we maintain out there, means we are very impressed with the performance.


In terms of the breadth of the solution's monitoring capabilities, I've already addressed the different types of products, the different manufacturers. The diversity of what we service out there is amazing, and it can really monitor just about everything that we maintain out in the field. But the other aspect of the breadth is the fact that not only does it do hardware really well, but it's really going to help us start to add to our portfolio of services. We're going to be able to use this to monitor operating systems and applications and software and networks, and even all the way to end-user experience. Ultimately, we're going to be able to move into other areas of service, based on the breadth of what it can do in the total IT infrastructure.

For how long have I used the solution?

In production, we have been using it for about a year-and-a-half.

What do I think about the stability of the solution?

We're in a very stable environment now but it took a little time for us to get there. That's because of the multi-tenancy, the scalability, and the volume of traffic that we're driving through their platform. They're very different than what they're used to. It's potentially hundreds, potentially thousands of customers, with a lot of equipment in their data centers flowing through. We are now in a very stable place in production. We feel very comfortable going forward, scaling it out, and adding thousands of customers to it. It took us a little bit of time to get there and we needed a lot of support from BMC, but we feel good about it right now.

What do I think about the scalability of the solution?

We have a unique use case because BMC typically sells this solution into enterprises that are deploying it within their IT, versus to a managed services provider like us where we're supporting thousands of customers. Multi-tenancy and the scalability have been challenges along the way, as we've grown. But BMC has really been a great partner helping us address those things.

Building that kind of scale and multi-tenancy into the product would serve companies, the way we're deploying it. It's a little different than what BMC is used to, but that would be one thing I would put out there. If anything could have gone better as we were ramping this up and adding a lot of volume to it, I would say it's the scalability. That would be one thing that could be improved.

How are customer service and technical support?

BMC's technical support has been great. They've been by our side. They've been working with us. They could have just said, "Look, our product wasn't built to do that. Good luck." But they didn't. They stuck with us and they're still with us today helping us optimize and do things better. They've been a great technology partner for us.

If you previously used a different solution, which one did you use and why did you switch?

Most of the storage products have a native "call home" feature. It's like email alerting, so when a hard drive fails on the storage array, it will send an email. A lot of the manufacturers did that for the warranties. It would send them an email and they could take care of the warranty claims. What we did was redirect those emails to us, because most of what we do is after the warranties have ended on a product. We were getting all these emails from potentially thousands of things that we were maintaining out there, and every email looked different. Emails from HP looked different than those from EMC which looked different than the ones from IBM or Hitachi. Everything was in a different format. It took a long time to sift through these emails to figure out what was actually wrong, and it was very inefficient. That's how we were doing monitoring.

We also had a little black box that we built internally that was using SNMP and some other technologies. But a lot of customers don't want some rogue hardware in their data center. It's a security concern. So that was very limited in its deployment. Overall, by and large, we really weren't monitoring. We were very crude in our methods and there was a very limited number of things that we were monitoring at the time I came in.

That's when we started thinking, "You know, if we either build or buy a world-class monitoring platform and get it connected to everything, we could really differentiate ourselves in the market." That's what led us to start evaluating some commercial, off-the-shelf things like BMC.

How was the initial setup?

We got it up and running pretty quickly. We had it up within three months because we had to buy hardware and build the whole infrastructure, so it was a little more than just installing the software.

Then we did what I call a controlled deployment. We had about ten to 15 customers in a pilot program. We ran that over about a six-month period before we went live in production.

What about the implementation team?

We had a consulting firm that worked with us, a firm which BMC had brought to the table named Column Technologies. That experience was not good. BMC had said these guys were one of the best partners they had, and they probably are. It could have been Column Technologies, it could have been anybody that they brought in. 

Our implementation was so unique and different compared to what they were used to. They were used to going into an end-user and helping them get this solution deployed within their own IT environment, to manage their own back-office IT. But that's not how we were doing it. We were putting it in as a service platform to manage thousands of customers and hundreds of thousands of devices, potentially. So the implementation was very different.

BMC had to work with us pretty extensively on how we were configuring and putting this in to make it work the way we needed it to work. I'm not going to pick on the consultant that much or criticize them too heavily because this installation was very different than what they were used to doing.

We got a lot of support from BMC because it required it. We needed the guys who built the product to help us get this thing implemented in such a way that it would support our business model. Ultimately, we solved those problems and we're in good shape now. But there were some startup issues, that's for sure.

What was our ROI?

I don't know that I have a number available. When we embarked on this journey we had some business-case assumptions about what our internal savings would be. We've got a little more work to do to come up with those numbers. We need to get more volume deployed before we can say we have a reliable percentage of OpEx reduction.

What's my experience with pricing, setup cost, and licensing?

Pricing is all volume-driven. I think we were paying between $80 and $85 per license. That's per unit, for a perpetual license. You pay it one time and then, every year, you pay 20 percent of that for annual maintenance and support. 

But now that we've grown, we've purchased tens of thousands of licenses and the cost per license has gone down to something like less than $30. 

I wouldn't call it an agent cost because the way they price it is based on the number of things you have connected. You can connect hundreds of things to a single agent but you're paying by the number of things. That's how you use the licenses. So it's really priced by endpoint, not by agent.

Which other solutions did I evaluate?

When we were just starting the journey, we looked at ScienceLogic, Centerity Monitor, and we looked at CA. We also looked at the Microsoft product. Those represent a handful of the products we evaluated.

What other advice do I have?

If we had to do it all over again, we would have spent a lot more time in the early going on planning the architecture, on how we were going to build this out. That could have saved us some pain, once we got it up and running and started adding customers and expanding it. If we had spent a little more time with BMC, planning architecturally how we were going to design this to support the scale we needed, it would have helped. That was a lesson learned. And that would be some advice I would give. Depending on how you're planning to use the tool, make sure you spend some time looking at the architecture in the systems and the architectural design of how you're going to implement it to make sure it's going to meet your needs. Make sure it's going to scale appropriately and do what you need it to do.

Our goal is to get this solution connected to every single customer that we're maintaining equipment for, because of the efficiencies and the improvement in the end-user experience. When I say we support over 350,000 assets in 36,000 data centers around the world, that is our maintenance business. We're working to connect TrueSight to all of that. We have sold - not quite yet deployed, but we have sold - about 33,000 licenses, which means assets. We've deployed just under 10,000 of those so far. So we're making good headway and we're very pleased with how it's performing so far.

One lesson that we've learned is that we're now in a great position to expand our portfolio of services which we offer to our customers, well beyond hardware. Without this technology, we could never get there. Prior to us putting this in, it was all done manually. Phone calls, emails, people driving to the site to try and diagnose problems. It was very manual and inefficient and not scalable the way we were doing business. And we were growing so fast. There's no way we could have scaled to where we're at today or scale to where we want to go, even in our core business.

The other lesson we're learning now is our that customers are asking us to do more and this technology is going to help us do more for them and expand our business. It will enable us to expand our portfolio of services. That's our biggest lesson. When we started out it was really all about driving operational efficiency in our hardware maintenance business. And now we've learned we're in a very good position to move into other services, based on what the capabilities of this platform bring to us, beyond hardware - into application monitoring and operating system and network and all the other pieces of the infrastructure. We can start to support them going forward.

It has completely changed our way of thinking about our strategy going forward. It's amazing.

At this point in time, I'd rate it a ten out of ten. We've got something really unique here. We built some integrations, some things of our own around it. And we're feeling really good about it.

Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
Add a Comment
Guest
Sign Up with Email