What is our primary use case?
We use it primarily for monitoring. My organization is an application support organization and part of what we need to do is to make sure is that our infrastructure is running tip-top so that those applications can run, consequently, the same way. We use the tool to do both application monitoring as well as infrastructure monitoring all the way down to storage services, and things like that on the OS layer. We have a full breadth and are able to triangulate what types of issues we're experiencing before our end-users experience those issues.
It monitors our entire platform. Everything in production, every single app, is monitored through the tool. As new applications come into our ecosystem, we have a process. The project team sits down with us. We talk about what the product's capabilities are. Most of the PMs already know that because they've been here for a long time. We set it up, and we move on to the next app. We're expanding it as new tools or new functionality or new applications come into the ecosystem.
How has it helped my organization?
Because we've used it for so long, we've been measuring results for eons. The standard metric that we use, given to us by our CIO, is that 70 percent or more of our outages need to be alert-driven, not customer-driven. So, if a customer calls in and says, "Hey, I'm having an issue logging in to PeopleSoft," which is one of our applications, we should have already known that there was an issue and handled the alert prior to the customer calling in.
A decade ago, we were using Microsoft's and HP's product sets to monitor but it was disparate. The alerts weren't aggregated and we never knew who they would go to. Therefore, we missed a lot of opportunities to be proactive in our organization. Hence, the reason we moved to the product which, at that time, was called ProactiveNet - and then it became BPPM and TrueSight, as it is today. We were able to flip that situation and we have been able to meet that metric for five years running. We had one blip in the year prior to that, and in the years before that, we were knocking it out of the park. So our metric is if we get the alert before someone had to call in, and we're successful in meeting that some 80 to 90 percent of the time.
In addition to that, when we look out across the industry, most organizations have anywhere from five to 15 people who are dedicated to monitoring. We have two. We're able to run the entire stack, along with its complementary adjacency tools, with two people. That was one of the many reasons that we made the migration from other products to ProactiveNet/BPPM/ TSOM. At that time, we were a one-man band and really needed to be able to move quickly but also be able to maintain a product and not require tons of manpower to make the product work. The improvements that BMC has made over the last two to three years are really revamping and consolidating the console so that it is truly a single console that you can run it with a single individual, should you need to.
We have 342 apps in our ecosystem and my team manages around 280 of those from a support-platform standpoint. And because we have two individuals who are dedicated to the monitoring, they partner with the rest of our admin organization to drive exactly how things need to be alerted. We review them quarterly. That is a testament to a really solid product - that it only takes one or two people to really run the thing and administrate it, versus having an entire staff and that's all they do.
The solution provides a single pane of glass where we can ingest data and events from many technologies. I am one of the few, at least from according to BMC, who has screens up in my hallways and I show our top 20 applications from a criticality standpoint - what's most important to our organization, things that I have to run. Everyone sees what's up on those boards every day. I go to it two or three times a day. Because we have that single pane of glass, we see where we're having issues organizationally and we're able to rally resources - whether it's engineering, operations, or our development group - and solve the problem and get those things from red/yellow back to green/blue. The single pane of glass was a key piece of what we needed to have to be successful as a monitoring organization.
In terms of the availability of our infrastructure, ours is not a hybrid environment, per se. We don't really measure and/or monitor - because of legalities with most of these FAS providers - how well their systems perform. But what do is measure any of the interfaces that touch or route to those applications, and we have an uptime measurement of about 99 percent for most of our apps. We have a dashboard for that which is managed out of the ITSM group. They partner with us and they pull all of our monitoring data to figure out two key metrics: total uptime and uptime excluding maintenance. Those are the two keys which enable us not only to showcase to our customer base how well the systems are performing but how often they really are available.
BMC has helped to reveal underlying infrastructure issues that affect app performance. Four years ago, PeopleSoft was running slow in regard to our payroll run. We run payrolls weekly. If you know anything about payroll, you've got to hit a certain deadline and be able to send the check file to the bank for those direct deposits to show up in people's bank accounts. It's a really sensitive issue when people don't get their checks. With the monitoring tools, we were able to triangulate that it was not an application issue but that it was actually a storage issue. Our solid-state storage was having a firmware issue which was causing slow turnover for the IO, and therefore it was slowing down the entire process of payroll. We were able to triangulate that that was the issue, decide what we needed to do - which was move the storage so that the application could continue to perform. We met the need and were able to get the payroll cut just in time so everyone could get their checks. It was a big win.
As for reducing IT ops costs, year over year, my operational expenses grow by three percent, which is mostly salary increase. I've gone from 12 resources to roughly 55 resources organizationally, while growing from 80 apps to 280 apps over the last eight years. Our operational costs have only gone up because of the use of licenses, not because of human capital. The tool has helped us work smart, not hard, and leverage the technology. We haven't necessarily needed to grow our operational expenses to accommodate the new functionality or the new applications which come into our ecosystem. We just set up the monitoring and it does its thing.
What is most valuable?
The solution's event management capabilities are fantastic. We do a best-of-breed. If, on the network side, they use a different tool, we pull all that data in so that we have a single console. It's kind of like the monitor of monitors. We're able to aggregate all the different types of data sets, whether it's log data, app data, OS data, infrastructure data, or network data. We're able to aggregate all those events and then correlate and be able to say we're having an event. Just because we have one or two alerts doesn't necessarily mean that we're having an event. It's when we get several of those that "trip the wire" that we're able to say, "Okay, we are having an event." And the tool allows us to aggregate all of that so that we're managing event-driven versus alert-driven.
The breadth of the solution's monitoring capabilities is also fantastic. A lot of IT organizations that I talk with use a conglomerate of tools to manage their monitoring and it ends up being pocketed. We don't have that problem because we are using it as the monitor of monitors and therefore we are able to take advantage of all of its bells and whistles. As well, we can feed in additional alert data, crunch that, and react appropriately and accordingly, proactively versus reactively. We'll get several low-level alerts saying, "Hey, this may be an issue," and we're able to proactively look at that before it becomes a critical outage. We use almost every aspect of the tool, with the exception of some of the automation because we haven't gotten there and found the need for it. But we're rapidly starting to take advantage of those pieces as well.
A use-case example would be if we have a drive filling up on a particular server for a particular application. If that's a known issue, we can actually orchestrate through the automation component of TSOM to be able to say, "Hey, when we see this type of alert, go try one of these three things and if that fixes the problem, go away. And if it doesn't, go ahead and escalate that as a ticket and we'll have a human go touch that server and remediate the issue." So we're right on the cusp of beginning that journey.
In addition, the entire root-cause analysis functionality within the tool is quite useful. It really comes down to how admins want to leverage it. There are what I call "old-school admins" who want to get on the box and solve it themselves. Then you have the "new-school admins" who go straight to the monitoring tools. It clearly shows you root cause analysis: This is the probable cause, and then they're able to go remediate it more quickly. We use that extensively within the operations team and the products team, which is the team that I own. I don't think the engineering team is quite there yet, but they're beginning to see the value of wanting to see that data and start using the tool themselves.
Regarding mean time to remediation, when I took over this organization, I and the rest of the group were working about 100 hours a week, just trying to keep our major systems running. It wasn't until eight months later, when we actually implemented a more mature monitoring system, that we turned the corner and people were working 60 hours. And now it's somewhere between 40 and 50 hours a week, which is much more maintainable and realistic in the industry. We were doing everything we could to keep those systems running, and we had no idea what would be in the next box of chocolates that we would open up, back when we first started this. There's a direct correlation with TSOM and the BMC product sets that have helped us be successful in working smart and not hard, like we did back in the day.
What needs improvement?
Specifically around application performance monitoring, BMC is definitely not the market leader. The Dynatraces, the New Relics and the like are more of the market leaders in that space. I would like to see them grow that space a little bit more aggressively. It has not really been their bread and butter.
They've been highly focused on cloud initiative. I don't know anyone in the industry who has solved how to monitor cloud, SaaS-based systems, because all of those systems are usually linked through other systems. That would be another area where it would be nice to see if they could find innovative ways to be able to do that.
The third piece would be around out-of-the-box automation. We all have particular types of alerts and events where all we really need to do is be able to turn the functionality on versus creating the functionality. BMC is already addressing that in many cases.
For how long have I used the solution?
We've used it in probably three incarnates of what it is today, so it's been about ten years.
What do I think about the stability of the solution?
We don't have any issues. We're in an HA format so if we do have any issues, things failover quickly and we don't miss a beat. It's the heartbeat of our products, the fact that we provide monitoring services to our businesses, so monitoring can't be down. It can't have a bad day. TrueSight Operations is a highly stable product. It is a beast. It runs really well. There's isn't a lot of care or feeding that we have to do to it to make sure that it stays healthy.
What do I think about the scalability of the solution?
It's highly scalable. We continue to add more servers and more applications within the ecosystem easily and quickly. We continue to review all of those quarterly to make sure that the way that we've tuned the monitoring is still accurate and that it's meeting the needs of both the admins and the business.
How are customer service and technical support?
We have a great relationship with BMC. We're probably different than the average bear. We've got a great account team. When we call customer support, we get answers pretty quickly. We don't have to call them very often, which is a good thing for any vendor. You don't want to have to call support a lot. But when we do, it's usually because we can't figure it out and we're able to get the answers pretty quickly through their organization.
Which solution did I use previously and why did I switch?
We used HP and then we used Microsoft Systems Center Operation Manager, SCOM.
How was the initial setup?
Back in the day, the initial setup was very complex. As it stands today, upgrades are really very easy. It's basically just a matter of refreshing old hardware, turning the system on, and making sure that it picks up all of the agents. Setting up today is infinitely more simple than it was even three or five years ago.
BMC is innovating even further and working towards containerization so that we won't have to do upgrades anymore. We'll just overlay. They've really taken into account how to consolidate consoles so that there aren't so many bits and pieces. That has made it easier for them to do upgrades. Installing the system or deploying the system only takes a couple of weeks in an organization of our size, where it used to, when we originally did it, take four months.
The latest one that we did, we had all the technical bits and pieces done within four weeks. Then we slowly rolled it out as we sunsetted particular agent groups. The total roundtrip was six months to have it fully deployed and embedded and working in the system.
At this point, we do an upgrade every three years, and every five to six years we're upgrading our hardware. This year we actually went fully virtual. Our engineering organization still takes a good bit of time to build servers. We were able to get virtual machines within weeks of the initial setup of the product, and we were able to roll to virtual machines, versus physical machines, relatively simply. It was basically a point-and-shoot install. We pulled over all of our policies and procedures that were already canned - and that was another thing that was more of a challenge in years past because we would have to redo them. This time, all that got pulled in and we were up and running within weeks.
What about the implementation team?
We partnered with BMC this time. Typically, we use a third-party, but in talking with BMC and where we were at - as we use them primarily for consultative - we said, "Hey, what's the best way to go ahead and do the upgrade in the migration?" They gave us the cut plan and then we actually did the physical work ourselves, which saved us some $200,000 in project fees.
With two guys running the system day-to-day, and consultative services from BMC to tell us, "Okay, this is how you do it," we were able to execute both the upgrading project, as well as administrating the product, while still running on the old system. It says a lot about the product's ease of use and capabilities.
Now, my guys are really smart and I'll give them all the credit. They're smarter than the average bears. But the reality is that it's rare to find a product where the people who are running it can be doing a major upgrade at the same time.
What was our ROI?
The very fact that we've been on it for ten years is a testament. We continue to make the investment. We continue to pay the renewal because the return has been fantastic. I don't have any specific data points other than the fact that we've been on the product for ten years. There's a reason for that.
What's my experience with pricing, setup cost, and licensing?
There are no costs in addition to the standard licensing fees. It's a straightforward contract.
Which other solutions did I evaluate?
Every three years, we reevaluate the space. That's just part of the culture that we've established. No one tool stays forever at the top, but BMC's monitoring capabilities and their discovery asset tools are top-of-stack, typically, in any of the research that we do. We continue to use them and we continue to have a great relationship with BMC.
What other advice do I have?
Keep it simple. Make sure that you understand, architecturally, how your applications and your data center are set up. It makes your life easier to know exactly what you're going to need to monitor.
The biggest lesson I have learned from using this solution is to really take full advantage. I joke with the BMC guys that TSOM is like AutoCAD, the engineering tool that people use to design and draw. We only scratch the surface of its full capabilities. The thing that I've learned is that it's a good idea to take advantage of all the bells and whistles as quickly as you can because it really pays dividends to do so.
We are using a little bit of the solution's machine-learning and analytics. That's an adjacency tool called IT Data Analytics and we feed that into our overall, single pane of glass monitoring. I don't know that we've taken full advantage of that quite yet. It is on the roadmap. We'll probably get to that, realistically, next year and in '21, where, as we're seeing those analytics, we will actually link automation to it. So when we see something we'll actually do something. We're a fairly small shop and therefore scale is not an absolutely necessary thing, but it is something that we are striving to move towards. It has affected our application performance in bits and pieces. It's not something that I'd wave the banner on quite yet. We have pocketed instances where ITDA has come back and told us that there was an issue, and we were able to remediate proactively versus reactively. I don't know that we're leveraging the tool's full capabilities where I can say that I have a use case where this was a big win for us.
I don't think that the monitoring tool, TSOM itself, has created or helped to support any business innovation.
As for users of the solution, I have the two admins and then I have, say, half of my organization that consumes it as a tool, so there are about 12 to 15 users. Each of those people is an application admin. Their primary responsibility is the applications that they support. The monitoring is a tool for them to use to ensure that those systems are healthy and top-notch.
I have a senior manager who manages the space. He also manages our asset-discovery tools along with all of our web and third-party space. He is a busy guy but it's all managed under one leader. There are the two folks who administrate it. It's really a very small human-capital resource footprint, in comparison to what it does technologically.
I give TrueSight Operations a nine out of ten. There are always bits and features from other products that we wish we would see in it. Usually, we see them pretty quickly.