We use it for business service and infrastructure monitoring. We use the full gamut of utilities from them and monitoring in the platform.
We use it for business service and infrastructure monitoring. We use the full gamut of utilities from them and monitoring in the platform.
We don't use APM. We used to. We line-item nixed that for various reasons a few years ago. We also don't use the ITDA, their next-gen log monitoring tool. So we're truly just within the TSOM interface, as well as doing synthetics. That being said, the Knowledge Modules that BMC brings to the market are what make the implementation across our varied infrastructure and applications. It's critical to have those Knowledge Modules. If we had to write things ourselves, or to use a more generic monitoring environment, and then build additional scripts on top of that to monitor the Kubernetes of the world, or the WebLogics of the world, or the Oracles and SQLs of the world - if we had to write scripts ourselves to bring back particular monitoring components and performance metrics and so on - that would be a heavy burden that would keep us from implementing. We don't often run into something that we haven't been able to monitor. It's just a matter of getting people to the table to tell us what they need.
When it comes to incident management, we get most of our data from TrueSight, log data, because we don't use the ITDA interface. It would be an effective interface, but for logging we go to our SIEMs, since we're already pumping data to another system there. But TrueSight definitely gives us a view into the health of our business services, which is our primary goal for implementing monitoring.
We try very hard not to use event management. What I mean by that is that we do not have a typical NOC. We don't have ten people staring at screens and then escalating as necessary. Along those same lines, we don't spam our incident management environment with events from TrueSight. With a lot of customers I've met over the years, that's essentially the old school way of doing things. Instead, we create events that are truly actionable. If we don't have an actionable event, we don't create it. We use their baseline technology to ensure that we're only sending items that are either about to have a problem or have passed the threshold of having a problem. If you're talking about typical event management, where you create an event and it gets forwarded to some other system, there's a notification about it somewhere else - the whole ITSM cycle - we don't use it for that. We use it for creating smart events that create alerts directly to the teams responsible. As I described before, we have many distributed teams rather than a centralized NOC.
In terms of TrueSight helping to maintain the availability of our infrastructure, it's an interesting question because of our distributed systems. We have 8,000 hosts across about 40 different teams, and we have 600 different applications that we run. For those critical tier-one apps, teams are highly involved in their day-to-day operations and watching them very closely. Having those two things - the actionable alerts and the ability to see what the health of their system is at any given time, and to be able to check it against what normal looks like for those applications - gives the teams that use it in such a manner the information they need to be confident that their availability is as it needs to be, or better. As far as a hybrid environment goes, we have our own hosting environment because we are the cloud to our clients. So we're not necessarily in that situation. We don't use assets other than what's in our hosting environment.
If, in the past, one of our biggest problems was just plain old infrastructure incidents, basic availability incidents where a server or an application, an interface or an endpoint, may not have been available and no one noticed it until some downstream, business end-result brought it to our attention, we've essentially eliminated 90 percent or more of those. It has been at least three years since we've done any numbers. But at the time, we might have had ten to 15 Sev-One incidents a month. When we last measured it, we were down to one. That was within a couple of years of implementing an enterprise monitoring strategy.
As for root cause, when a team is engaged in monitoring to its full extent, we're usually able to get to root cause pretty darn quick. For example, if a team has many servers that could potentially be impacting an application or a business service, tracking something down across those multiple servers and multiple owners could be really tedious and time-consuming. It would be on the order of hours, or at least many minutes, depending on the scope of the issue. With well-implemented monitoring, for our Sev-One apps, they're able to get to the solution almost immediately. If we have monitoring set up properly, the actionable event will tell them precisely where a critical component has failed and they can resolve it. Where it's a different type of incident that we might not have a particular monitor for, they're able to use the performance data, availability data, and other related alerts to get to their issue much faster than they used to. Having a good monitoring implementation has made a world of difference to our operations teams. It's so much so, that if you think back five years, which is an eternity in the IT world, when there was a Sev-One incident back then, someone would walk around tapping people on the shoulder all over the floor. That was very time-consuming. But now they're able to collaborate quickly and say, "It looks like this is the problem right here," in a well-monitored environment, and get right to the root cause.
It's helped our meantime to remediation, and I'm being conservative here, by about 70 to 80 percent. That's an absolutely huge impact.
We have many operational teams, and for any given team their requirements are different. One team is more reliant on infrastructure monitoring, because they are processing-heavy. Another team might be more reliant on endpoint monitoring where we're ensuring that the third-party endpoints they rely on are up and available. Another team may have fairly immature applications, so that they would rely heavily on log monitoring to catch all the errors that may come up. From a consumer-function standpoint, there isn't any feature that stands out. They're all important because all of our consumers are important.
From an administrative standpoint, what stands out in TrueSight is the ability to implement quickly. When they have a requirement to monitor something, we're able to turn that on quickly in their environment. We're able to set up new apps within a day. Most of the work in monitoring is working with the teams, evangelizing, educating, and making sure that they're bringing their smart requests to the table so that they get visibility into their business service. If the implementation wasn't as easy as it is, it would hinder and probably decrease the adoption of monitoring. But because we can turn requests around pretty quickly and adjust things as teams need adjustment for their different release schedules, administratively, we're able to respond and keep pace with the business and the technology that they're implementing. That is a critical function for us.
Stability is one of those areas of identifying challenges with TrueSight, areas that I'm not entitled to share at this point.
We've been able to implement all the hosts that we care to implement on a couple of servers, with minimal maintenance. We don't use their high-availability solution. We don't really require it because the underlying infrastructure is relatively robust. We haven't had any problems with the scalability. Had we been a couple of times larger, there would've been more to implement server-wise.
The other thing about our implementation is that we send a lot more performance data to our implementation of TrueSight than the typical BMC environment might. We send everything server-side for analysis rather than keeping everything agent-side or emphasizing agent-side, as I've seen a lot of other clients do. I think the tide is turning. I think more people are doing what we're doing where we just push all the data for potential analysis. But we've been able to accomplish what we need without too much infrastructure.
They had an advisory board. We, as a group, and even I specifically, had been asked by them what they needed to continue doing. One of those was continuing to build out Knowledge Modules in various technologies. Some of the ones BMC has made available, we've implemented, and some of the ones BMC has made available don't impact us and we haven't implemented. But I've been in discussions where they say, "What do we need to do," and Knowledge Modules is one of those areas where they've made a commitment to continue adding to them, and we appreciate that.
When we first started, we did not have a monitoring program at anything resembling an enterprise-type level. We were at about 4,000 hosts and we were really not monitoring anything except for a few services. At that, it was bare-bones monitoring. We monitored, maybe, half of our environment at bare-bones.
We went on this journey six-plus years ago to have an enterprise monitoring solution that focuses on business services. One of the reasons we did that is because of the number of incidents that we had that really should never have happened. Now that we're a number of years in, and we've implemented monitoring and brought teams around in the direction of business service rather than just an executable's use of a CPU, we have much fewer incidents.
As a general trend, we're much more capable of seeing what's out there and monitoring what our issues are and taking care of it before the business incident occurs. I don't have any particularly recent examples where our monitoring was able to resolve an incident after it happened. Of course, I don't get notified when people say, "Oh, look, I resolved this," because it's part of their daily operations to find an issue and resolve it. So it's not necessarily a newsflash anymore for us.
It doesn't happen quite as frequently as it used to, but they continue to build Knowledge Modules, every time there are new products on the market. They need to create Knowledge Modules for the implementation to be enhanced. That's one of the key features of the Operations Management. That's definitely something that helps us take advantage of everything BMC has. They're not sitting on their laurels. They're building things out.
The complexity of our environment demanded the complexity of the implementation. More than half of the effort that we had in implementing monitoring was based on the way we did our program. We were basically starting at zero and bringing teams up to speed, evangelizing, educating, getting people onboard.
The implementation of TrueSight itself was just a software implementation. It had its bumps and bruises. None of us were versed in BMC software. There were some learning curves as would typically be expected for any application of this scope, magnitude, and impact.
We had an overall strategy of doing proofs of concept for various, widespread technologies. We took that success and did a wide-to-narrow type of advertisement. We told everybody what was going on and then we brought more specific people into the room and said, "These are good targets for you to implement." During and after that evangelizing and advertising, we started implementing tier-one applications as an onboarding effort. We did that in a deep-dive fashion where we would sit down and interview these teams and really come to understand what makes their business service tick. A lot of our evangelization effort was actually in changing the focus of operations teams to think from a business service perspective. That paid off in dividends later when people were more interested in monitoring the actual functions of their applications rather than just the infrastructure of their application. We've been able to change mindsets over the course of a number of years. The first two or three years we were doing implementations. That was when we did most of that work.
From there, we worked as much as possible to allow folks to implement their own where possible, rather than centralizing it, so that people could keep up with their own demands. We were somewhat limited in TrueSight due to some of the RBAC controls not quite being what we wanted as far as delegating out administrative privileges for implementation. But because we were able to turn requests around pretty well, that burden wasn't too heavy.
From tier-one apps, we kept going and kept educating, bringing people to the table. When new applications come to our company, we still reach out and educate new teams, bring them to the table and use the onboarding process we built and solidified over the course of the first couple of years.
During the first three years, we had two-and-a-half FTEs for implementation. That was for the full program, not just the TrueSight component. It included all those interviewees, all those educational components, all the training, etc. The full program. The actual pressing of the buttons was about half of that. Once you stand it up and start connecting things, it's a matter of administratively using the tool to execute.
Typically, our company builds knowledge for implementing infrastructure/operations activities like this from the ground up. We did not use a third-party. BMC was instrumental in our success in that they made resources available to us, implementation-wise as well as development- and support-wise.
The solution hasn't helped reduce costs in a measurable fashion. That's a measure that we wouldn't undertake. There might be soft costs benefits, such as
Life at our company as an operations person is nicer now because you have confidence that what you're doing makes a difference, that the business service that you're working on is healthy. The business is happier when we're able to talk to them intelligently and say, "I can actually show you that we've been up and successful."
It has helped in our ability to work on smarter things rather than silly incidents. If we eliminate incidents, then we're doing better work. We're able to do the good work of business rather than the sad work of recovery. That's not only quality of life but it's also the ability to get things done. So I know that, at some level, we're doing more with less because of our monitoring. But we don't have any hard numbers from a monitoring perspective.
We're end-of-lifeing it now. Overall, the licensing costs of BMC are a challenge for us in that they're hard costs, whereas open-source monitoring has soft costs, where it's harder to line-item. It's harder to see the cost of implementation for other things. So that change of direction is taking place. It doesn't mean the cost isn't there; it's just soft dollars rather than hard dollars.
We looked at Microsoft SCCM. And, because we had a partnership with CA, we looked at their tools. There were a couple of other minor players we looked at which just didn't have the scope of what we needed to do, because of the breadth of technologies that we use. In the bakeoff, we came down to BMC and Microsoft.
It was a long time ago, so I don't know that it's fair to judge at this point, but from a monitoring perspective, the whole Microsoft suite really wasn't there. There was a lot of scripting. It was easy to identify that the administrative burden was going to be high in that implementation. Conversely, with the BMC stuff, out-of-the-box, administratively, you click and implement. That is one of our components of success, our ability to implement quickly.
On the soft side, BMC as a partner was much more interested in our success than the Microsoft folks were at the time. It's very hard to quantify unless you're there sitting in front of them at the table and working with them, consuming their knowledge. It really is a great partnership.
BMC is at a critical point in redefining TSOM, how it's built. Anybody looking at BMC now needs to jump on the new version of TSOM and skip the current versions. I would wait until their new environment is ready. It will be containerized. Anyone implementing BMC can get used to the environment in a PoC but they shouldn't implement until their new stuff is out. I expect it to be that much different.
Make sure that you have stakeholder buy-in and that they are able to provide the resources with the correct knowledge to implement in a smart fashion. Everybody's definition of "smart" is going to be slightly different. We really hone in on the business service side to make sure that our business functions are healthy and that we're able to understand what's normal and what is out of normal. We work with the teams, even from the point that they're in development of projects, to make sure we're ahead of what's going on rather than reactive. But that means the buy-in of multiple teams: development, operations, support. That amount of effort requires stakeholders with decision-making capabilities to say that it's a priority for them.
We knew up front - and we've been able to validate our assumption - that monitoring doesn't do any good unless you are analyzing your business service for what are the critical components to observe. That's an educational effort and an implementation project. It's that upfront effort that will make your monitoring successful. Where we've been able to engage teams and teams have remained engaged, we've been the most successful in that. We took that to heart upfront, we made that part of our route to success, and we put the effort in. Our monitoring's been successful because of that. If we didn't do that, and we didn't constantly engage teams to make sure that they were aware of capabilities including the ability to give us feedback, and that we can implement quickly, we wouldn't be here. We wouldn't have advanced as far as we have. Most of that advancement was in the first two or three years, and we've just been riding that wave of success since then.
Keep in mind that most companies don't go from nothing to an enterprise monitoring solution; they go from one monitoring solution to another. But if there's anyone in the boat that we were in, where they are the size we were with no monitoring solution, they'll be in the pain that we were in. Implementing a good monitoring program, not just the tool, but a program around it, can make a world of difference to the operations teams, and subsequently to the business as well.
For those teams that are utilizing TrueSight, they don't rely on other monitoring environments. Some of those teams rely on those actionable alerts almost exclusively, and don't really use TrueSight's single pane of glass. We do have some teams that consume TrueSight and use it on a daily basis to ensure that they don't have any events, whether or not they've risen to the level of action. They'll also proactively look at some components, either business function components or infrastructure components, to ensure that they're working as designed and within the parameters of normal.
I don't think the functionality of Operations Management helps to support our business innovation. Our business runs forward and headlong into innovation, regardless of whether or not IT can keep up. We were never an impediment, other than cost. The way we run our overall IT environment is very open and flexible. Monitoring is a way for us to give business the confidence that what we're implementing is healthy, but it doesn't impact their interest in being able to implement what's new. They've always been able to do that and continue to be able to do that.
In terms of machine-learning, I mentioned above the baselining which, depending on how it's implemented, might be called machine-learning, but in TrueSight they just have a straight calculation-type of activity. We have other monitoring solutions that we're implementing as well, and that topic may be more applicable to them, but not in the TrueSight world. The TrueSight world is a straight application implementation. It's nothing exciting on that end.
I have to give our BMC partners a lot of credit for where they're planning to take TrueSight based on their roadmap, although it is speculative. I don't think the areas for improvement from us would be any different than anything they've already heard.
If someone were to implement the full suite of BMC products, you'd have to give it a nine out of ten. TSOM by itself, I have to give it a seven out of ten.