BMC TrueSight Operations Management Benefits
We brought the product in to handle the following: We're in 35,000 data centers today. We have 16,000 customers and we support about 400,000 assets. Those are big numbers. The pieces of storage equipment we provide have something native from the equipment manufacturers, the OEM, called "phone home." What happens is, when these devices start having a problem they send out an email that says, "I'm having this problem." To put that into perspective, we were trending towards 2,000,000 emails at the end of 2017, and growing. We would have to read 2,000,000 emails to find out what was going on. Something lower than seven percent actually had a problem we really had to read, and something well below one percent of those were actually a service event. Before we brought in TrueSight, there were 8.2 touches via email or phone call after the ticket had come in, including exchanging log files with the customer through to our resolving it. And on the customer side, they had somebody having to look at the equipment to make sure it was actually working. From those 8.2 physical connections with them, we're down to two with TrueSight. And here's the big difference. Instead of these things sending all of that information out in those emails, it's captured in the Knowledge Module, the policy and the agent, on the customer side of the firewall. What TrueSight does is that when it installs it takes a week to come up with what's called a dynamic baseline. It says, "For this piece of equipment in your environment, these are the key performance indicators that we're going to watch for." We can see events live when they happen. There are predictive and proactive warnings of failures or potential problems. But all that we ever get, the only thing that's communicated to us, is when there's a failure. So we can see all the chatter and we can look at that by customer, but we don't really need to. And if it's a predictive event, it will send us a notice saying, "We think this part's going to fail in two weeks," and we can help that customer. But ultimately, what we get is a service ticket: "Failed part at this location. Here's the part number, the serial number, and the recommended remediation." That comes into our support center. Eventually, when we have it all set up the way we envision it, the info will come into the support center and a ticket will be created and it will automatically connect to the tech and the tech will reach out to the customer. We haven't turned that on yet. Right now, it comes in and we read it. We call the customer and say, "You have a failure." In most cases, the customer didn't know they had it yet, because it's that fast. We call them up and say, "You have a problem. We have the part, and when would you like Larry to come on site?" Because it's storage, they have to schedule downtime. Then we go out on site, we fix it, and we're done. So it's two physical touches now: We call them and they say, "Yes, it's completed." So 2,000,000 emails have gone away, pretty much, and it all gets done at the customer site. What we see now, instead, is a couple of hundred or 1,000 service events, versus millions of emails. And we have the right part, the right chassis, the right location. In our industry, there is about a 75 to 78 percent first-time fix rate, meaning repair personnel do not have to go back to a given site within a week. As a company, we were at about an 86 percent first-time fix rate. With TrueSight, we've never gone below 98 percent. It's all done with software. I read all of the service emails from our customers. Customers are used to finding a log file and talking to our expert - and if a customer has five different pieces of equipment, there are five different experts involved. Now, they send a note in and they'll say, "This is resolved. I just want to make sure this process is working the way it's supposed to. I didn't call anybody. You called me to tell me I had a problem that I wasn't quite aware of. Now, I have a part, it's fixed, and we're good. Is that how it's supposed to work?" It's funny, because they were used to eight different interactions with us, as opposed to two. It's really cool. It's taking an extremely manual process and, with the AI piece, literally helping us make better decisions. It's what AI is all about. It's really amazing. I'm excited about it because now, instead of our support center people trying to find the right part, they're calling the customer and saying, "By the way, you have a problem. We have a solution for you, and we notice in the same cluster you may have a failure in a week. Would you like us to look at that while we're there?" It's predictive, proactive maintenance. That is what it enables us to do, versus reactive. Today, when we are proactive, it's for a fan or it's heat or it's a battery. We get notice they are about to fail and they fail pretty quickly thereafter. But when we start getting to operating systems, there are days, as you know, when you have gone on to your computer and it's been slow. On those days of the month, you can probably look in your network and find that there was a big push to get something done. With TrueSight, we'll be able to start proactively predicting these events before they happen, and rerouting the customer so they don't notice a slowdown. Our tagline is all about uptime. TrueSight helps us deliver that. It helps us deliver upfront. View full review »
We don't use APM. We used to. We line-item nixed that for various reasons a few years ago. We also don't use the ITDA, their next-gen log monitoring tool. So we're truly just within the TSOM interface, as well as doing synthetics. That being said, the Knowledge Modules that BMC brings to the market are what make the implementation across our varied infrastructure and applications. It's critical to have those Knowledge Modules. If we had to write things ourselves, or to use a more generic monitoring environment, and then build additional scripts on top of that to monitor the Kubernetes of the world, or the WebLogics of the world, or the Oracles and SQLs of the world - if we had to write scripts ourselves to bring back particular monitoring components and performance metrics and so on - that would be a heavy burden that would keep us from implementing. We don't often run into something that we haven't been able to monitor. It's just a matter of getting people to the table to tell us what they need. When it comes to incident management, we get most of our data from TrueSight, log data, because we don't use the ITDA interface. It would be an effective interface, but for logging we go to our SIEMs, since we're already pumping data to another system there. But TrueSight definitely gives us a view into the health of our business services, which is our primary goal for implementing monitoring. We try very hard not to use event management. What I mean by that is that we do not have a typical NOC. We don't have ten people staring at screens and then escalating as necessary. Along those same lines, we don't spam our incident management environment with events from TrueSight. With a lot of customers I've met over the years, that's essentially the old school way of doing things. Instead, we create events that are truly actionable. If we don't have an actionable event, we don't create it. We use their baseline technology to ensure that we're only sending items that are either about to have a problem or have passed the threshold of having a problem. If you're talking about typical event management, where you create an event and it gets forwarded to some other system, there's a notification about it somewhere else - the whole ITSM cycle - we don't use it for that. We use it for creating smart events that create alerts directly to the teams responsible. As I described before, we have many distributed teams rather than a centralized NOC. In terms of TrueSight helping to maintain the availability of our infrastructure, it's an interesting question because of our distributed systems. We have 8,000 hosts across about 40 different teams, and we have 600 different applications that we run. For those critical tier-one apps, teams are highly involved in their day-to-day operations and watching them very closely. Having those two things - the actionable alerts and the ability to see what the health of their system is at any given time, and to be able to check it against what normal looks like for those applications - gives the teams that use it in such a manner the information they need to be confident that their availability is as it needs to be, or better. As far as a hybrid environment goes, we have our own hosting environment because we are the cloud to our clients. So we're not necessarily in that situation. We don't use assets other than what's in our hosting environment. If, in the past, one of our biggest problems was just plain old infrastructure incidents, basic availability incidents where a server or an application, an interface or an endpoint, may not have been available and no one noticed it until some downstream, business end-result brought it to our attention, we've essentially eliminated 90 percent or more of those. It has been at least three years since we've done any numbers. But at the time, we might have had ten to 15 Sev-One incidents a month. When we last measured it, we were down to one. That was within a couple of years of implementing an enterprise monitoring strategy. As for root cause, when a team is engaged in monitoring to its full extent, we're usually able to get to root cause pretty darn quick. For example, if a team has many servers that could potentially be impacting an application or a business service, tracking something down across those multiple servers and multiple owners could be really tedious and time-consuming. It would be on the order of hours, or at least many minutes, depending on the scope of the issue. With well-implemented monitoring, for our Sev-One apps, they're able to get to the solution almost immediately. If we have monitoring set up properly, the actionable event will tell them precisely where a critical component has failed and they can resolve it. Where it's a different type of incident that we might not have a particular monitor for, they're able to use the performance data, availability data, and other related alerts to get to their issue much faster than they used to. Having a good monitoring implementation has made a world of difference to our operations teams. It's so much so, that if you think back five years, which is an eternity in the IT world, when there was a Sev-One incident back then, someone would walk around tapping people on the shoulder all over the floor. That was very time-consuming. But now they're able to collaborate quickly and say, "It looks like this is the problem right here," in a well-monitored environment, and get right to the root cause. It's helped our mean time to remediation, and I'm being conservative here, by about 70 to 80 percent. That's an absolutely huge impact. View full review »
We are using this solution to scale our business and to drive greater efficiencies. The other side of it is that it's much better for our end customers because they no longer have to monitor their own environments for hardware failures. We do that for them. They don't have to recognize that a server has failed. They don't have to pick up the phone or send us an email to open a ticket and send us files to help us troubleshoot the problem. We're really reducing a lot of the effort required on the customer's side to manage their IT environment using this tool because we can detect the failure, we can troubleshoot it remotely. And, when we do implement the corrective action, we're pretty certain of the root cause, based on the technology and the capabilities of TrueSight. It has improved our time to repair. From the time we get the incident logged to the time we get the customer back up and running, it has improved that by 33 percent or greater. It has also improved our ability to fix it right on the first call. It gives us the root cause of the problem, and it automates that whole triage, it gives us the part number of what's failed. We're now at somewhere around a 97 percent first-time fix rate. And that's only going to get better as we get more experienced with the product. And that's important to our customers. When we come out, we're going to fix it right on the first call and not have to come again and again and again. That's really important to the uptime of their IT. We have a graphical representation of this very thing. It shows the old way of service delivery, in which the customer first had to recognize they had a problem. Once they recognized they had a program, they had to call in or email and open a ticket. Once they opened a ticket, the whole troubleshooting process would begin. We were often calling them as many as eight times per ticket, just to get information about the failure. That was taking a lot of time from the customer. After that, we would have to dispatch someone with the right part or the right solution, and oftentimes we either brought the wrong part, or we had to bring a handful of parts, which was costly for us and would drive up the cost of the service for the customer. And often there would be a repeat call, because we might not have brought the right part or have sent the right level of skill out on that call. That was the old way of doing it. The new way of doing it for the end-customer is that we call them to let them know we have spotted a problem with their server, for instance, and that we're working on it. We don't have to bother them for log files or diagnostic logs or any of that information anymore because it all comes packaged with the alert from TrueSight. The customer really only hears from us two times now: once, when we open the ticket to let them know we've seen a problem and again after we've resolved it. Another example is that many of our customers have equipment in co-location centers and offsite data centers, where they don't even have anyone to see that there's a problem. Now, we are driving a lot of efficiency for them. They don't have to send people out to check on problems anymore or pay somebody who is running the co-lo to go out and check on something. We're able to see it all remotely through the monitoring tool. That's another huge benefit that we've heard about from our customers. The solution provides us with a single pane of glass where we can ingest data and events from many technologies. In terms of our IT ops management, we have a unique deployment. We actually have it running in our own shop. Everything that we deploy to our customers we deploy internally first. But we've really licensed and implemented TrueSight to drive our services business. We're supporting all of our customers' data centers with the product. We're not connected to all of those yet. We just officially launched the solution in January of 2018. We've got about a year-and-a-half in production with the product and we're getting good adoption. The real answer to its effect on our IT ops management is not so much our internal deployment. It's more about the deployment that we're leveraging for all of our 16,000-plus customers globally. We've had a number of cases where, through the analytics in TrueSight, we've actually been able to predict failures. For instance, we've already had a couple of cases where, if we see a hard drive on a storage array is going to fail, we'll actually send the part out ahead of the failure. That allows us to replace that drive before it fails - and on the customer's planned downtime. In the old model, it fails, it's down. The customer waits for us to come out, swap it out, and bring everything back up. In the predictive model, we know it's going to fail, we send the part out ahead of the failure, and we replace that drive on the customer's scheduled downtime. The more of that we can do - and as we expand beyond hardware into operating system, application, and the other layers of infrastructure - we'll be able to exploit the machine learning and the AIOps to a greater degree than what we're doing today on the hardware side. The way we talk to our customers about the functionality of the solution across IT ops management to support business innovation is that because we've significantly reduced the amount of time they have to spend managing service tickets, they have more time to focus on their digital strategies. We say, "Hey, we're giving you some time back. You don't have to spend all this time interacting with your service provider. You're just going to hear from us when you have a problem and after we've fixed it. We won't bother you for log files and all those things." We're actually giving them time to allow them to do more value-added work, like working on their strategic initiatives and their digital transformation initiative. I think we'll be able to expand on that as we go forward. View full review »
Learn what your peers think about BMC TrueSight Operations Management. Get advice and tips from experienced pros sharing their opinions. Updated: April 2020.
419,052 professionals have used our research since 2012.
Because we've used it for so long, we've been measuring results for eons. The standard metric that we use, given to us by our CIO, is that 70 percent or more of our outages need to be alert-driven, not customer-driven. So, if a customer calls in and says, "Hey, I'm having an issue logging in to PeopleSoft," which is one of our applications, we should have already known that there was an issue and handled the alert prior to the customer calling in. A decade ago, we were using Microsoft's and HP's product sets to monitor but it was disparate. The alerts weren't aggregated and we never knew who they would go to. Therefore, we missed a lot of opportunities to be proactive in our organization. Hence, the reason we moved to the product which, at that time, was called ProactiveNet - and then it became BPPM and TrueSight, as it is today. We were able to flip that situation and we have been able to meet that metric for five years running. We had one blip in the year prior to that, and in the years before that, we were knocking it out of the park. So our metric is if we get the alert before someone had to call in, and we're successful in meeting that some 80 to 90 percent of the time. In addition to that, when we look out across the industry, most organizations have anywhere from five to 15 people who are dedicated to monitoring. We have two. We're able to run the entire stack, along with its complementary adjacency tools, with two people. That was one of the many reasons that we made the migration from other products to ProactiveNet/BPPM/ TSOM. At that time, we were a one-man band and really needed to be able to move quickly but also be able to maintain a product and not require tons of manpower to make the product work. The improvements that BMC has made over the last two to three years are really revamping and consolidating the console so that it is truly a single console that you can run it with a single individual, should you need to. We have 342 apps in our ecosystem and my team manages around 280 of those from a support-platform standpoint. And because we have two individuals who are dedicated to the monitoring, they partner with the rest of our admin organization to drive exactly how things need to be alerted. We review them quarterly. That is a testament to a really solid product - that it only takes one or two people to really run the thing and administrate it, versus having an entire staff and that's all they do. The solution provides a single pane of glass where we can ingest data and events from many technologies. I am one of the few, at least from according to BMC, who has screens up in my hallways and I show our top 20 applications from a criticality standpoint - what's most important to our organization, things that I have to run. Everyone sees what's up on those boards every day. I go to it two or three times a day. Because we have that single pane of glass, we see where we're having issues organizationally and we're able to rally resources - whether it's engineering, operations, or our development group - and solve the problem and get those things from red/yellow back to green/blue. The single pane of glass was a key piece of what we needed to have to be successful as a monitoring organization. In terms of the availability of our infrastructure, ours is not a hybrid environment, per se. We don't really measure and/or monitor - because of legalities with most of these FAS providers - how well their systems perform. But what do is measure any of the interfaces that touch or route to those applications, and we have an uptime measurement of about 99 percent for most of our apps. We have a dashboard for that which is managed out of the ITSM group. They partner with us and they pull all of our monitoring data to figure out two key metrics: total uptime and uptime excluding maintenance. Those are the two keys which enable us not only to showcase to our customer base how well the systems are performing but how often they really are available. BMC has helped to reveal underlying infrastructure issues that affect app performance. Four years ago, PeopleSoft was running slow in regard to our payroll run. We run payrolls weekly. If you know anything about payroll, you've got to hit a certain deadline and be able to send the check file to the bank for those direct deposits to show up in people's bank accounts. It's a really sensitive issue when people don't get their checks. With the monitoring tools, we were able to triangulate that it was not an application issue but that it was actually a storage issue. Our solid-state storage was having a firmware issue which was causing slow turnover for the IO, and therefore it was slowing down the entire process of payroll. We were able to triangulate that that was the issue, decide what we needed to do - which was move the storage so that the application could continue to perform. We met the need and were able to get the payroll cut just in time so everyone could get their checks. It was a big win. As for reducing IT ops costs, year over year, my operational expenses grow by three percent, which is mostly salary increase. I've gone from 12 resources to roughly 55 resources organizationally, while growing from 80 apps to 280 apps over the last eight years. Our operational costs have only gone up because of the use of licenses, not because of human capital. The tool has helped us work smart, not hard, and leverage the technology. We haven't necessarily needed to grow our operational expenses to accommodate the new functionality or the new applications which come into our ecosystem. We just set up the monitoring and it does its thing. View full review »
With the service modeling, once we managed to build our import stuff to get our CMD impact models and services into TrueSight, that was a big win. Because once we integrate it with SolarWinds, they will actually be able to see when there's a problem with the plant, and they will know if it is a network problem or a server problem. With the service models, they can actually get right down to the impact of any issue. We're working on some other things to make that easier, like event correlation. So if a network goes out at the plant, they don't need to know that there are problems connecting to 60 servers, rather they've got a problem with the router. We're currently looking at either consolidating the other monitoring tools that we have around the organization or connecting them for the single-pane-of-glass goodness. We're bringing in data from SolarWinds, we're bringing in data from Oracle's OEM, and we're integrated with an application monitoring desktops. It generates an event and a ticket is cut out to the regional support people. They will go to the desktop and say, "Your disk is in danger of imminent failure. We need to go ahead and clone that guy and replace it before you're down." So we're definitely going with a single pane of glass. In terms of our IT ops management, that means it's getting better. We're trying to be more proactive instead of reactive. We've only been heavily into this for nine or ten months so the actual, long-term impacts aren't measurable yet. We're still baselining where we are at. The single pane of glass is a big improvement. There is also the ability to do predictive and corrective, especially for some services which we're monitoring out in the field which are critical to various plant components. It used to be that they would go down and the plant would call. Now we're detecting that they're down, we're restarting them, and we're letting somebody know there's an issue. That's also a big improvement in our manufacturing capabilities. Culturally, it is bringing people together with one place to look and giving them something to talk about when there's an issue. It's bringing IT together. The collaborative and predictive stuff is actually starting to improve. We're not doing a tremendous amount of preventative stuff yet - unless you count when your disk is three percent from being full and you need to do something before it fills up. We're not using some of the more advanced features of the predictive analytics yet. We are starting to look at some data analytics though. We have a data analytics group which we stood up, a couple of people who are starting to use data analytics to do some things. It's improving the overall operation, but the impact is going to be measured a little bit later. We've seen some cost deferrals and some cost savings with some support renewals we haven't had to do on some other tools. But we haven't seen the major cost impacts yet. We have spent a lot, but on cost-avoidance for various support tools we have saved close to $1,000,000. In the nine months we've been operational, we've deferred cost on at least two tools. One was about $750,000 and the other was $250,000 for maintenance. It also helps to maintain the availability of our infrastructure across a hybrid, complex environment. I used to work at FedEx and we're not as environmentally complex as FedEx because we consolidate a lot of stuff on the ERP. But if you throw manufacturing in there, we have pretty much every flavor of platform. As with most deployments, we've got three-tier and four-tier applications. You throw the network and some load-balancers in there and it's fairly complex. If you can use a service model to see exactly what's working and what's not, it really gives you the ability to look at some things. The solution has also helped to reveal underlying infrastructure issues that affect app performance. Let's say there is a system that is occasionally slow but you don't know why. Then you find out that it was supposed to be configured to use a large number of LDAP servers for authentication but somebody had configured it to one. When you compare the times at which the systems people were having trouble logging on and you look at the CPU and memory usage on your LDAP server, you begin to put things together, without actually analyzing configuration files. You can figure out that the system is configured improperly. When they dig in, they find that it's only talking to one LDAP server. It gives us that kind of diagnostic capability, by looking at everything, and the ability to pin things down. In terms of root cause analysis, we're still working that through. But mean time to repair is going down because it's becoming much more obvious. Between the events that people are looking at which are prioritized, and the service models which show the actual impacts to the relationships, it's becoming much easier. Depending on the event, it's gone from about four to five hours down to 20 minutes. When it works, it's significant. A lot of it is cultural. When you go from everybody monitoring their own stuff and not talking to anybody else, to everybody looking at the same single pane of glass, and you throw a Service Desk on top of that, which is performing incident management and coordinating some things - between the technology and the culture and the process changes, you're going to see some pretty dramatic improvements. BMC just did a custom KM for us. Typically, on a given server, we want to know when a drive is three percent. But we've got some mixes of drives, servers which have anywhere from a 100-gig drive to a terabyte drive, and the percentages that we are worried about are not the same. This request came from our SQL group. BMC was able to adjust the alert parameters based upon the size of the logical drives. That was definitely a business innovation. I think that was good for BMC too. Although that's a custom KM which we just deployed, I suspect they will make that part of their standard tool kit. View full review »
We have one application, which is fairly large. In the past, we had Level 1 and 2 NOC support teams who were responsible for watching dashboards. When they saw an issue in the application, they would call Level 2 or 3 support and escalate the call, if necessary. Now, through the use of this product, we have been able to reduce the headcount by five people, as we are able to eliminate the eyes on the glass. We no longer have people watching the dashboard. We have events which are processed automatically through the system and get to the right people. We had six people in L1s, and now have one. So, we reduced five out of six headcount, which is pretty significant. Also, the average length of time used to be 45 minutes before we had the right engineer on the line, fixing the problem. Now, it's probably three to five minutes. The solution affected our end user experience management very positively. Our application teams are very excited about what we're doing with the reduction in headcount. More importantly, the automation that it has brought to us has streamlined so many manual tests, The teams are very happy with the way things are going. The solution will help us maintain the availability of our infrastructure across a hybrid or complex environment. Right now, we can get to an event scenario or problem quicker than we used to. We are right on the cusp of releasing our service impact modeling. This will help us tremendously because we have a multicloud, as well as an on-premise environment. Any component should show the impact across its applications, regardless of where it's located. It has definitely helped in these environments. We have improved our ability to get to a root cause because of the way their tools work. If you follow it down to the lowest level of the diagram, and a problem happens, it lights up a certain model in red. However, if you go down to the lowest member of the tree, you'll see who is the lowest person. So, if it's a database saying, "I'm out of disk space," then it may create all types of chaos. Following that tree down, you'll see the lowest level is the database server, and it has an event disk space issue. Then, right there, that's the root cause of all your application issues. So, it has helped us get to the root cause more quickly. We're just now gaining momentum on the adoption of this product. We have seen with a database out of disk space, because we can get to the root cause quicker, we know what the root cause is. It can be remediated faster, but we can also eliminate the number of people who have to be on outage calls. There is no need to have network people on a call if it's a database issue. We let them deal with other things, so our operation becomes more efficient. The database people know exactly what the problem is, and quickly. View full review »
One case that we like to use a lot: We have a customer who uses F5 load balancers, and they were managing them with CA products. Those load balancers were generating around 11,000 tickets a month. Just moving them from CA to TrueSight, and replicating the same rules, they went from 11,000 tickets a month to 400 tickets a month. TrueSight did a much better job of doing the same thing. Then from there, we were able to tune it. We got it down to about 40 tickets a month. While this is an extreme example (I don't usually see this type of improvement), it shows the power that is there. We are able to more quickly identify problems and get an engineer on it to restart services, etc. It is not fixing the customer's bugs. They've got buggy apps, and it goes down all the time. It is just that we can get them back online faster. View full review »
TrueSight has helped to reduce IT operations costs. The solution has also helped to reveal underlying infrastructure issues that affect app performance. The solution has application monitoring called Application Performance Management. It's an improvement on the old, traditional TMR. It's integrated within the TrueSight solution. It will notify regarding application performance and report issues with applications. View full review »
Sometimes, a lot of customers, if they're not using the products, don't know that they have an IT issue unless a customer contacts them, and says “I've got a problem.” With TSOM, they are able to be more proactive. The IT department gets alerted more quickly, and sometimes, they can resolve issues before the customer even knows that there is an issue. This solution helps our customers to reveal underlying infrastructure issues that affect app performance. It has good monitoring all the way from storage up to the servers. Now, all the things I'm seeing in the cloud are very good, as well. View full review »
Learn what your peers think about BMC TrueSight Operations Management. Get advice and tips from experienced pros sharing their opinions. Updated: April 2020.
419,052 professionals have used our research since 2012.