BMC TrueSight Capacity Optimization Review

Enables us to right-size systems to free up resources, and identify performance problems down to the process level

What is our primary use case?

  • We collect performance data on all things, whether it's private cloud or physical servers. 
  • We provide sizing information, performance analysis, and we do forecasting and modeling for when we need more resources. 
  • We identify oversubscribed systems in the environment so that, instead of buying new hardware, we find often that we have just given too much processing power to systems. Our tools let us identify that very quickly.

So one of our primary tasks is capacity management. 

We certainly use this tool extensively for performance-problem diagnostics. And we're looking to leverage the capabilities as we look to migrate to cloud. We already have the modeling capabilities to model from on-prem, physical, or virtual to cloud, within the tool. But we're also looking at potentially trying to expand to some of the new features, like Cloud Cost Control. We're very excited about the way BMC is growing.

How has it helped my organization?

On the capacity management side, just within the last month, there have been three systems that we've gone back to review, and we've said, "Oh my gosh, these are way oversubscribed." We've looked at what their max needs are, and we've trimmed down and freed up a bunch of resources. We estimate that saved us close to $100,000 just by right-sizing the systems that were there. That freed up resources so that we didn't have to buy new hardware. So we saved the company money that way. 

There are countless other examples. In our customer system, we have identified so many performance problems and pointed right down to the process level, whether it was a memory problem, or whether it was a hung-thread problem. These are all things that we've been able to use the tool to get to.

We've used it on almost every system in the company, including things like the telephony system. With the vendor that we have - it's notorious now - we have been able to always pinpoint that they have a memory leak that seems to get perpetuated and brought back into their code, after releases. They'll address it and get it fixed, and then we'll get a release and, all of a sudden, that memory leak comes back. But our tools allow us to see those things.

And there is the correlation analysis which we already do with the business. We know when we're expecting some kind of campaign to increase business, and we know that our field force is going to be doing a lot more quotes. We will run those analyses to make sure that we have enough capacity on the floor to handle what that expected new load is. 

It helps maintain the availability of our infrastructure across our hybrid environment. We are collecting on HDI environments now, Nutanix and Exadata. We totally impact availability because we're making sure that we catch anything unusual. 

One of the features I didn't mention about TrueSight Capacity Optimization is exception reporting, which is something we've successfully enabled. Rather than wait until something is saturated - CPU or memory - we track what's normal. Exception reporting looks at the last 30 days, say, every Monday at 10:00 AM, and it gets a norm. So if this Monday is 50 percent above or below, then it flags it in our exception report. We have caught a number of things long before they could've crashed systems and impacted availability, just through this exception reporting. It's another automated report that runs daily and is reviewed. There are hundreds of things we've caught with the exception reporting - whether it be memory, or CPU-related, or IO-related - before they got so bad that systems were impacted.

In terms of the solution helping to reveal underlying infrastructure issues that affect app performance, it identifies whether we have a CPU or memory constraint, which impacts application performance. We have also used the tool very effectively when a chassis has been saturated and has caused some weird IO problems. There are multiple applications and servers having problems, and until we tie them together, that they're in the same chassis - when we stack those servers and the performance data together for that chassis - then we find that "Oh, we have an IO bottleneck." We've used the tool to identify and prove that out to the storage teams. That's definitely more of a hardware or infrastructure limitation that impacts applications. The solution can definitely be used to help identify them.

We believe it helps reduce our mean time to remediation. When we're involved and using the solution, we usually get to the bottom of things a lot quicker, probably 50 percent quicker. Sometimes people are having problems and they don't call us, and sometimes it goes on for hours; it never goes on for a day, now. We've proven ourselves to be too valuable. But in the past, they would try to figure stuff out for hours or a day or two and then, finally, somebody would come to us. We help point everybody in the right direction very fast, and I think our data helps reduce that time, certainly by at least 50 percent.

These are all real things we do every day. And it's not all. There's more we could be doing.

What is most valuable?

There are some wonderful, very rich, deeply embedded aspects of the tool that allow things like workload characterization. Workload characterization is super-important because it lets us figure things out. Many people know, for example, that with Microsoft Word, Word.exe is the executable. Everybody knows their executable, but they don't always know what it does. It also launches other things. This tool has the ability and insight to track those things and to know: "Oh, you wanted this executable, but this one started this, so you must want this, too." And it tells you what it had to add, what it was using or was spawning. There are so many customers fellow customers I know who don't use this level of the tool.

There are also the modeling capabilities in the tool. 

There are also some of the new features, and I did a presentation at BMC Engage on this. We ingest a ton of business data. We are an insurance company and we have business data, like how many quotes are done an hour, and how many policies are sold per hour. The correlation engine in the new TrueSight Capacity Optimization components is wonderful. We can do correlation analysis over months of data, and then we run models to tell our business: "If you do 1,000 more quotes an hour, we're going to have to upgrade, and we're going to need this much more hardware." 

These are wonderful things. Our business partners love it. Our infrastructure people love it because they count on us so they're not surprised with something like, "Oh, all of a sudden we have to go buy servers? Well, we didn't have it in the budget." 

There are the old, rich features that are incredibly important to me. And then there are some of these new features which are wonderful. We're very excited about the future. Where the tool's headed, and what it can do for us.

There is an event component. We are currently manually entering events. That helps with custom views - as releases go out for software, that's an event. Then, all the charts and graphs for that system will have that as a flag and it shows up. The charting's all automatic. It's all on a website. It's refreshed every morning. The users can see, "Oh, why is my system high? Oh, well you know what, it happened right after this release." That kind of event management is part of Capacity Optimization. We also put in things other than releases, such as hardware failures. We know there are also capabilities to interact with the change management system. We'd like to investigate that, but it's a down-the-road project.

What needs improvement?

Since I already have a sneak peek into the next releases, I'm very happy about what's going to be included. 

I would like to see continued support for the legacy parts of the tool, the old, seasoned parts that are very valuable to me. That is a message I continue to give to BMC: All the new stuff's great, but don't take away this really important stuff. That's my biggest fear, that I might lose some of my old functionality that is still extremely valuable. I want to make sure we don't lose any functionality, and that they just still keep delivering on what they're doing. I don't have anything more to ask than what they're offering.

For how long have I used the solution?

We've used this product for over 20 years.

What do I think about the stability of the solution?

As a customer of over 20 years, it's pretty stable. In fact, it's not just pretty stable, it's pretty rock-solid. We use a lot of other tools. We use a lot of data from other tools, we bring it in and ingest it, but this is certainly the most stable product of all the other products and tools I've seen. We don't use any other tools specifically for this job, but we do interact with so many other software vendors' products which have far more problems than we ever have.

What do I think about the scalability of the solution?

We haven't had an issue with scalability to this point. As far as I can say, it can scale. As capacity planners, we know where the limits are, and we know how to scale. It's what we do for everything else, with this tool, so we can do it for this tool when we need to scale.

There's a cost per license in the current model. We are actually talking with BMC about how that might look in the future. There are some opportunities there. But so far it's very scalable and BMC will work with customers if they need that help.

How are customer service and technical support?

Because we've used the product for 20 years we have direct lines to product management and to the leads.

But I'll be honest, when we use technical support, if we just call in, I get very frustrated. They are the front-line support people and they really don't know that much. They've got a script and they want certain information. I always supply that information when I open a case because I know they want it, and they still come back and ask for it a second time. That frustrates me. To be honest, I haven't opened a case like that for a long time. I know who to talk to and I just email. I tend to circumvent tech support a little bit.

The fact of the matter is that we have very few problems. It goes back to stability. Really, truly, rarely do I have to go and open up anything. If I do, it's usually for new functionality. The old stuff is solid. Rock solid.

Which solution did I use previously and why did I switch?

We did not have a capacity management discipline or a capacity planning team in 1998 or 1999. There was a desire within our senior management that we have that team. I was chosen to investigate and find tools. We got down to two different tools that we liked very much and BMC was the tool we chose. Obviously 22 years later... And the other one doesn't exist anymore.

How was the initial setup?

The initial setup was pretty straightforward. There are some technical hurdles to get over. We did write our silent installs for Windows, Linux, and Unix systems. Whenever there's a new release, a new agent, we have to redo those. But it's not the hardest technical problem to solve, if you're installing software. It's no different than any other software, in my mind.

What about the implementation team?

We did it ourselves. Again, 20 years ago, the tool looked very different in terms of the way it was deployed. We have actually created our own deployment model, our own methodology outside and around the tools. We've developed a number of in-house scripts that I've maintained for 20 years that, in my mind, enhance how the tool collects data, and how it's processed, and how it's managed

Everything still works fine. We've had no problems with upgrades.

What's my experience with pricing, setup cost, and licensing?

Right now, the licensing structure is by server. Everybody is licensed somewhat differently, depending on how big they are, how many licenses they have.

Which other solutions did I evaluate?

We did a bake-off. We were doing a lot of business with Dell computing, as far as purchasing servers from them, and they opened up their labs to us. We brought both tools in, and people from both companies, and we ran some load against some servers. We had them measure them and asked them to forecast how the servers would behave and when they would break. We used more computers in the lab to generate more load and then we actually saw who was right.

I wouldn't say that either tool was more right than the other. What was interesting at that time was that the competitor was really new and they did not share their information. They weren't letting us see how they were doing the work. There was just one person for BMC, while there were three people from the other vendor. They huddled around and kept us out. They kept saying, "We'll tell you in a minute, we'll let you know." The one person from BMC we worked with was wonderful. He sat down, was open, let us play around, explained how it worked. It worked. It has worked. End of story.

In its space, TrueSight Capacity Optimization is a ten out of ten. We all have room for improvement. But in its space, it's a ten. Believe me, I have been challenged, and we have actually done proofs of concept with competitors, the existing ones today. TeamQuest is one, but they've got a new name now because they've been bought out by somebody else. That's usually the one that's closest. Metron-Athene has also been purchased by somebody else. Those are the two competitors. We have evaluated them. They are maybe a seven or six out of ten, in my mind, based on how we use the product.

What other advice do I have?

Ask questions, ask to meet other customers. Ask for customer references, for whatever products you're looking at. Talk to others who are using the products. If they're your peers, they know what the pains are. I'm not saying there's no pain with this product. There is pain, sure. But in its space, it's a ten.

Some of the data management is painful. Some of the new features haven't been implemented in quite the way I would like to get to levels of detail. For example, Visualizer parser doesn't take everything it should out of the Visualizer files. We've had to put in a work-around, but the work-around is not as accurate as what's in the file.

Any time you're managing large environments with thousands and thousands of servers, there's some overhead to that. That's going to be true with, and a challenge for, all tools; it's not specific to BMC.

The biggest thing I have learned from using this product is that I have a passion for capacity management. More importantly, I have a passion for performance analysis as well. Over these 20 years, things have changed quite a bit. What servers were 20 years ago versus what they are today; and hyper-converged infrastructures; VMware didn't exist 20 years ago. All these things provide different metrics that you have to look at. This tool has grown with all that. BMC has done a great job at partnering. When VMware first came out, they were quick to partner. There are important metrics that are different than anything else you'd ever know, and you had to know about them.

BMC has stayed with new technologies as they come out. Today they were talking about a new partnership with a company that is big into RPA bots.

**Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
Add a Comment