We just raised a $30M Series A: Read our story

BMC TrueSight Operations Management OverviewUNIXBusinessApplication

BMC TrueSight Operations Management is the #2 ranked solution in our list of top Event Monitoring tools. It is most often compared to BMC Helix Monitor: BMC TrueSight Operations Management vs BMC Helix Monitor

What is BMC TrueSight Operations Management?

TrueSight Operations Management is a broad ITOM solution that delivers performance monitoring, event management, end user experience management, AIOps use cases and automated remediation and ticketing . It deploys machine learning and analytics to dynamically learn behavior, correlate, analyze, and prioritize event data so IT operations teams can predict, find and fix issues faster across complex, hybrid environments. TrueSight Operations Management provides a converged view of application and infrastructure performance across physical, virtual, multi-cloud and container environments. With visibility into web, mobile, and on-premises applications, TrueSight provides the insight IT operations needs to deliver high-quality digital services quickly and effectively enough to keep pace with business demands. TrueSight helps IT to ensure that the applications and services that drive the business continue to perform optimally by examining operational norms, automatically revealing abnormalities, measuring service impact, and proactively identifying risk.

BMC TrueSight Operations Management is also known as ProactiveNet, TrueSight Operations Management.

BMC TrueSight Operations Management Buyer's Guide

Download the BMC TrueSight Operations Management Buyer's Guide including reviews and more. Updated: October 2021

BMC TrueSight Operations Management Customers

Ensono, Transamerica, Boston Scientific, Park Place Technologies, inContact, TD Ameritrade, PNC Bank

BMC TrueSight Operations Management Video

Archived BMC TrueSight Operations Management Reviews (more than two years old)

Filter by:
Filter Reviews
Industry
Loading...
Filter Unavailable
Company Size
Loading...
Filter Unavailable
Job Level
Loading...
Filter Unavailable
Rating
Loading...
Filter Unavailable
Considered
Loading...
Filter Unavailable
Order by:
Loading...
  • Date
  • Highest Rating
  • Lowest Rating
  • Review Length
Search:
Showingreviews based on the current filters. Reset all filters
JB
Monitoring Architect at a manufacturing company with 10,001+ employees
Real User
We have reduced headcount and shrunk the mean time to resolve

Pros and Cons

  • "We have one application, which is fairly large. In the past, we had Level 1 and 2 NOC support teams who were responsible for watching dashboards. When they saw an issue in the application, they would call Level 2 or 3 support and escalate the call, if necessary. Now, through the use of this product, we have been able to reduce the headcount by five people, as we are able to eliminate the eyes on the glass. We no longer have people watching the dashboard. We have events which are processed automatically through the system and get to the right people. We had six people in L1s, and now have one. So, we reduced five out of six headcount, which is pretty significant."
  • "In a large company of our size, we need multiple people in our company trained. So, I have to take the training classes. Then, I have to go and train the rest of my organization. I would prefer to say to the other people on my team, "Go to this link and..." Or, "Here's a list of training sessions that you can go to which are online and that are free." I think it would help the adoption of their product in the marketplace, personally."

What is our primary use case?

From a senior management perspective, they want to get an understanding, when there is an outage, what is the impact of that outage across the entire suite of the company's products. We have an Event Manager that integrates all of our monitoring tools. Since we are a large company, we have about 26 different monitoring tools in use. The idea is getting all of them into a framework which can feed such a model that displays the impact of an outage.

How has it helped my organization?

We have one application, which is fairly large. In the past, we had Level 1 and 2 NOC support teams who were responsible for watching dashboards. When they saw an issue in the application, they would call Level 2 or 3 support and escalate the call, if necessary. Now, through the use of this product, we have been able to reduce the headcount by five people, as we are able to eliminate the eyes on the glass. We no longer have people watching the dashboard. We have events which are processed automatically through the system and get to the right people. We had six people in L1s, and now have one. So, we reduced five out of six headcount, which is pretty significant. 

Also, the average length of time used to be 45 minutes before we had the right engineer on the line, fixing the problem. Now, it's probably three to five minutes.

The solution affected our end user experience management very positively. Our application teams are very excited about what we're doing with the reduction in headcount. More importantly, the automation that it has brought to us has streamlined so many manual tests, The teams are very happy with the way things are going.

The solution will help us maintain the availability of our infrastructure across a hybrid or complex environment. Right now, we can get to an event scenario or problem quicker than we used to. We are right on the cusp of releasing our service impact modeling. This will help us tremendously because we have a multicloud, as well as an on-premise environment. Any component should show the impact across its applications, regardless of where it's located. It has definitely helped in these environments.

We have improved our ability to get to a root cause because of the way their tools work. If you follow it down to the lowest level of the diagram, and a problem happens, it lights up a certain model in red. However, if you go down to the lowest member of the tree, you'll see who is the lowest person. So, if it's a database saying, "I'm out of disk space," then it may create all types of chaos. Following that tree down, you'll see the lowest level is the database server, and it has an event disk space issue. Then, right there, that's the root cause of all your application issues. So, it has helped us get to the root cause more quickly.

We're just now gaining momentum on the adoption of this product. We have seen with a database out of disk space, because we can get to the root cause quicker, we know what the root cause is. It can be remediated faster, but we can also eliminate the number of people who have to be on outage calls. There is no need to have network people on a call if it's a database issue. We let them deal with other things, so our operation becomes more efficient. The database people know exactly what the problem is, and quickly.

What is most valuable?

The most valuable feature is the event management piece of it. We have it integrated with a number of our different products. Thus, we can create events into a single Event Manager, which will create a Remedy ticket for us. This is a huge feature for us.

We have 26 different monitoring tools. The way this product works it allows us to define a custom event call. We can take all of our monitoring tools, and say "If you can put an event into this specific format, then we have a way of creating a common event across all of our monitoring tools." By doing that, we have a single back-end process that acts on all of the events. So, we only do a data transformation upfront when we are receiving events. This simplifies our back-end.

The solution has helped to reveal underlying infrastructure issues affecting app performance. We constantly have network issues. The network team had been capturing them, but it wasn't integrated into any impact model. By integrating them into an impact model, we could now catch and see the impact of them to our applications. 

What needs improvement?

It's a complex system. The implementation is fairly challenging. They have done a good job lately of getting videos out there. We would like more videos and self-training, though. Right now, you have to go to BMC's training classes to get a good understanding of the product, and those training classes are very expensive. While I understand they are a business and trying to make money, a lot of their competition has training available via YouTube. There is much more accessibility to competitors' training. 

In a large company of our size, we need multiple people in our company trained. So, I have to take the training classes. Then, I have to go and train the rest of my organization. I would prefer to say to the other people on my team, "Go to this link and..." Or, "Here's a list of training sessions that you can go to which are online and that are free." I think it would help the adoption of their product in the marketplace, personally.

It's a far more complex technology than I perceived at the beginning to deploy. I would have thought that the integration between their products would have been more seamless than it has been. This is what has made it a lot more complex than I anticipated.

From a technical standpoint, some of their products still have a dependency on Oracle Databases, and they are very well integrated in the cloud for a lot of their components. There is another database technology called Postgres, which they are partially integrated with. However, if they were to get all of their platforms integrated into Postgres, it would be much less expensive for companies, such as mine, to go to high availability, etc. The architecture really needs to be upgraded. I know they're doing a lot of this, but they need to keep doing it, and accelerate their process, so they can remain competitive.

For how long have I used the solution?

We have been working with the product for the last year. We went live with the product in April.

What do I think about the stability of the solution?

Stability of the product is about a seven out of 10. As far as stability goes, it has mostly been very good. With some of the newer stuff on 11.3, we have to call to support a lot of times and get a patch sent to us because certain things just don't work. Those pieces would have hurt stability, but once you get it running, it's very good.

What do I think about the scalability of the solution?

The overall scalability of its platform and the ability to support its website is pretty good. We have a couple people on our team who seem like they are pretty proficient at it. They can do things rather quickly. 

I don't use PATROL. It is pretty good if you use their native products, like PATROL for monitoring. We integrate other monitoring tools into TSOM, so we don't use PATROL. I am familiar with it though, and I have been trained on it. I feel like it's pretty labor-intensive to manage. For example, if I have a number of different classes of servers, there are a lot of screens that I have to fill out, deploy, and push out to my systems. There has to be a more efficient way to do this. My company is always pressuring us to be more scalable. It is not very scalable in the administration of its monitoring. It could be better.

For TrueSight Operations Manager, there are a limited number of people who use it, no more than 15 to 20 system administrators and support personnel, who are mostly in administrative functions. The reason that there are so few users utilizing the system is because all the events are automated. Most of our support teams and users look at Remedy, and there are over 3000 users looking at Remedy. So, a lot of the users of our overall system have no need to look at a TrueSight console. Their work is done through the way we have designed the system. They get a Remedy ticket and what's called a PagerDuty notification. They know when they get those two things that there's an issue along with all the information's contained within those two systems. They don't need to go to the TrueSight console. 

How are customer service and technical support?

The technical support team is very good. I wish there where more the people. This includes the ones who I work with on the phone, as well as their field technical people. They are very good. 

I don't know if their technical support differs from their project team, but we are constantly revolving people in and out of our project because they get different assignments within BMC. Thus, I wish there more technical support people who had more longevity on our account. We will have a CMDB person assigned to us on the project from BMC, but in just three weeks, we'll find out, "Oh, that person has been reassigned, and they have to go to another account, where they have to do something different." We are constantly having to retrain people coming in from BMC. So, there is no permanence with their people on our projects.

This issue of changing technical staff is not limited to BMC. However, their resource pool seems sort of small.

We are constantly facing issues with having to call support because things didn't work as we expected them to, and I don't know why that is. We use BMC Atrium CMDB product (service impact model) and publishing service impact models seems to be challenging and problematic. We are constantly calling support, who gives them a bug fix, which fixes the problem. However, those bugs shouldn't have existed in the first place. If there is a bug fix in it that somebody knows how to fix, it shouldn't have happened in the first place.

Which solution did I use previously and why did I switch?

BMC is one of our longest running partnerships. We have been using Remedy for many years. We have been using parts of this system since 1998. However, we have never put it altogether in the way that we're doing now. We didn't replace anybody else. We had used their products before, but not to their full advantage.

How was the initial setup?

It's a complex system. We were dealing with a highly customized Remedy system which caused us a lot of issues. We had to wait for a Remedy upgrade to occur before we could deploy our systems. We were at this for about a year, and most of that time was waiting to get the Remedy implementation in place. Once the Remedy implementation and upgrade were completed, there were a lot of challenges with our CMDB data and the integration of the CMDB to a service model along with the publishing of a service model. 

We have Remedy, a service model, TrueSight Operations Manager, and TSIM. With a lot of technologies in play, making them all work together has been challenging, since each one of them is a fairly sophisticated technology. BMC could do something to make it easier.

It took about three months to deploy the core technology which solved our problem. We have been waiting a very long time on the Remedy upgrade, which was over a year. However, this was because our company had highly customized the prior Remedy version. Without that in the equation, the technology took us around three months to deploy. 

We are still enhancing it. That time frame was just to get it deployed. To make the full use and benefit of it, that will take well over a year. Both the technology and the organization, who is using it, need to be matured.

Right now, four or five of our core products are monitored and feeding this environment. Because we've been successful at it, we anticipate integrating more of our products and the monitoring of those products into our system. We have already built the integrations for the different monitors. It is just getting the different teams to want to use this system. That's why it's an organizational maturity thing. We could take them on very quickly, but there has to be a willingness on their part to do so. Part of our strategy is to make them want to use this system. That's on the event side. 

On the service impact side, we're working with senior management. This Friday, we have a demo with the CIO with this technology, because he is the one who is putting the pressure on the different application teams to onboard with us.

We have a multiyear onboarding strategy, where we're onboarding more applications and integrating them into this particular environment. Today, they are being monitored by their own support teams, who are now beginning to see the success that we are having. The challenge that we are having organizationally is, when we onboard their applications, we expose the issues of their products through Remedy tickets and outages. A lot of times, these teams want to hide that. So, we have political issues, as well as technological hurdles to deal with.

What about the implementation team?

We did a lot of it ourselves, since we had the knowledge in-house, specifically on the event management side. We were an ADDM environment. So, we had bits and pieces of technology knowledge in our company, but in order to pull it all together, we used Wipro, as well as BMC in India to drive this. That's still the case. We're still using them to get this whole thing deployed in various pieces. 

Overall, our experience with BMC and Wipro has been positive. However, there have been challenges because the technical people have moved from our account to another account. We have a rotating team of people, which gets very challenging for continuity.

For deployment and maintenance of TrueSight, we need about four people. For the whole enterprise solution, we need 25 people for 24/7/365 support.

What was our ROI?

We have reduced headcount and shrunk the mean time to resolve. That's how we justify the expense of the product. It has really worked out.

It has helped us reduce IT Ops costs. We were able to replace the headcount of five out of six Level 1 technicians. We repurposed those people to higher level tasks. Without this solution, we would not have replaced that job function.

What's my experience with pricing, setup cost, and licensing?

We did a five-year, multimillion dollar deal.

We haven't licensed the solution's machine-learning and analytics to deploy artificial intelligence for IT ops.

Which other solutions did I evaluate?

We looked at some of their competitors, but because of the technology and the base of knowledge that we already had in place, it made sense to stick with BMC. We decided to focus on making their products integrate the way they were supposed to and were designed to. That's what we've been doing since we had the knowledge and license in-house.

We also evaluated ServiceNow and BigPanda.

On the pros side, BMC and ServiceNow were very similar products. My biggest concern with BMC was they seemed to be declining in market penetration versus ServiceNow, which has been expanding considerably over the last several years. That was my biggest concern with moving forward with BMC. The pros with BMC were that we already had the knowledge in-house and the technology was proven for us. We knew it was fairly solid, so we felt confident that we would be successful with it.

One of the other differences between the two companies is the marketing organization from ServiceNow was a lot more consistent than from BMC. We probably get more calls even today from the ServiceNow account rep than we do from our BMC team. They show up every once in a while, and they do a big dog and pony show, then they go away for a bit. So, I don't think their marketing is as strong as it should be, or we're not a big enough customer for them. However, with the amount of investment we have in their product, they should be around more often.

There is one piece of BMC technology that we decided not to use. That's their Atrium Orchestrator. We use a different third-party orchestrator called Ayehu. We just found the Atrium Orchestrator from BMC to be too complex.

What other advice do I have?

Make sure you have knowledgeable people on your staff. Give yourself plenty of time for deployment, if you think it will take three months, make it six months. Look at past companies' experience on time to deploy, knowledge, and staffing requirements.

The solution's event management capabilities are very good. In some ways, they are based on very old technology. I first started using it way back in the late nineties and the basic core of the product does not appear to have changed much since then. Back then, it was a very good product. So that's not necessarily a bad thing. The other things that the company has done since then. Its enhanced the website portal, which I have a very positive impression of. 

The website is fairly new, and it could be a little bit better. However, if I were to compare it to some of the other tools out there, it has a much nicer GUI and presentation. The web presentation is much more advanced than BMC's TSOM server.

We still have multiple panes of glass. E.g., we have an Event Manager screen along with a Remedy screen. We're getting closer to a single pane of glass and have fewer panes of glass. Where we had a lot of dashboards before, we now don't have anything, as we've replaced all of them. So, there are no panes of glass in our support. So, if you are a support personnel at our company, you are not looking at a screen. Instead you are looking at your cell phone, because we reach out to you when there's a problem and you don't have to look at anything.

We are using about five percent of our environment. We have what is called a limited deployment right now, because we have so much integration and automation going on. We needed to mature the support teams and the rest of the organization as a whole in what we're doing. Once we have achieved that, I anticipate a 100 percent of our applications are going to be feeding this system. After that, we will greatly extend our use.

Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
Doug Greene
Sr. Director Operations at a comms service provider with 10,001+ employees
Real User
Enables us to triangulate, using multiple sets of data - including log, app, OS, network, and more - and find issues

Pros and Cons

  • "The solution's event management capabilities are fantastic. We do a best of breed. If, on the network side, they use a different tool, we pull all that data in so that we have a single console. It's kind of like the monitor of monitors. We're able to aggregate all the different types of data sets, whether it's log data, app data, OS data, infrastructure data, or network data. We're able to aggregate all those events and then correlate and be able to say we're having an event."
  • "Specifically around application performance monitoring, BMC is definitely not the market leader. The Dynatraces, the New Relics and the like are more of the market leaders in that space. I would like to see them grow that space a little bit more aggressively. It has not really been their bread and butter."

What is our primary use case?

We use it primarily for monitoring. My organization is an application support organization and part of what we need to do is to make sure is that our infrastructure is running tip-top so that those applications can run, consequently, the same way. We use the tool to do both application monitoring as well as infrastructure monitoring all the way down to storage services, and things like that on the OS layer. We have a full breadth and are able to triangulate what types of issues we're experiencing before our end-users experience those issues.

It monitors our entire platform. Everything in production, every single app, is monitored through the tool. As new applications come into our ecosystem, we have a process. The project team sits down with us. We talk about what the product's capabilities are. Most of the PMs already know that because they've been here for a long time. We set it up, and we move on to the next app. We're expanding it as new tools or new functionality or new applications come into the ecosystem.

How has it helped my organization?

Because we've used it for so long, we've been measuring results for eons. The standard metric that we use, given to us by our CIO, is that 70 percent or more of our outages need to be alert-driven, not customer-driven. So, if a customer calls in and says, "Hey, I'm having an issue logging in to PeopleSoft," which is one of our applications, we should have already known that there was an issue and handled the alert prior to the customer calling in.

A decade ago, we were using Microsoft's and HP's product sets to monitor but it was disparate. The alerts weren't aggregated and we never knew who they would go to. Therefore, we missed a lot of opportunities to be proactive in our organization. Hence, the reason we moved to the product which, at that time, was called ProactiveNet - and then it became BPPM and TrueSight, as it is today. We were able to flip that situation and we have been able to meet that metric for five years running. We had one blip in the year prior to that, and in the years before that, we were knocking it out of the park. So our metric is if we get the alert before someone had to call in, and we're successful in meeting that some 80 to 90 percent of the time.

In addition to that, when we look out across the industry, most organizations have anywhere from five to 15 people who are dedicated to monitoring. We have two. We're able to run the entire stack, along with its complementary adjacency tools, with two people. That was one of the many reasons that we made the migration from other products to ProactiveNet/BPPM/ TSOM. At that time, we were a one-man band and really needed to be able to move quickly but also be able to maintain a product and not require tons of manpower to make the product work. The improvements that BMC has made over the last two to three years are really revamping and consolidating the console so that it is truly a single console that you can run it with a single individual, should you need to.

We have 342 apps in our ecosystem and my team manages around 280 of those from a support-platform standpoint. And because we have two individuals who are dedicated to the monitoring, they partner with the rest of our admin organization to drive exactly how things need to be alerted. We review them quarterly. That is a testament to a really solid product - that it only takes one or two people to really run the thing and administrate it, versus having an entire staff and that's all they do.

The solution provides a single pane of glass where we can ingest data and events from many technologies. I am one of the few, at least from according to BMC, who has screens up in my hallways and I show our top 20 applications from a criticality standpoint - what's most important to our organization, things that I have to run. Everyone sees what's up on those boards every day. I go to it two or three times a day. Because we have that single pane of glass, we see where we're having issues organizationally and we're able to rally resources - whether it's engineering, operations, or our development group - and solve the problem and get those things from red/yellow back to green/blue. The single pane of glass was a key piece of what we needed to have to be successful as a monitoring organization.

In terms of the availability of our infrastructure, ours is not a hybrid environment, per se. We don't really measure and/or monitor - because of legalities with most of these FAS providers - how well their systems perform. But what do is measure any of the interfaces that touch or route to those applications, and we have an uptime measurement of about 99 percent for most of our apps. We have a dashboard for that which is managed out of the ITSM group. They partner with us and they pull all of our monitoring data to figure out two key metrics: total uptime and uptime excluding maintenance. Those are the two keys which enable us not only to showcase to our customer base how well the systems are performing but how often they really are available.

BMC has helped to reveal underlying infrastructure issues that affect app performance. Four years ago, PeopleSoft was running slow in regard to our payroll run. We run payrolls weekly. If you know anything about payroll, you've got to hit a certain deadline and be able to send the check file to the bank for those direct deposits to show up in people's bank accounts. It's a really sensitive issue when people don't get their checks. With the monitoring tools, we were able to triangulate that it was not an application issue but that it was actually a storage issue. Our solid-state storage was having a firmware issue which was causing slow turnover for the IO, and therefore it was slowing down the entire process of payroll. We were able to triangulate that that was the issue, decide what we needed to do - which was move the storage so that the application could continue to perform. We met the need and were able to get the payroll cut just in time so everyone could get their checks. It was a big win.

As for reducing IT ops costs, year over year, my operational expenses grow by three percent, which is mostly salary increase. I've gone from 12 resources to roughly 55 resources organizationally, while growing from 80 apps to 280 apps over the last eight years. Our operational costs have only gone up because of the use of licenses, not because of human capital. The tool has helped us work smart, not hard, and leverage the technology. We haven't necessarily needed to grow our operational expenses to accommodate the new functionality or the new applications which come into our ecosystem. We just set up the monitoring and it does its thing.

What is most valuable?

The solution's event management capabilities are fantastic. We do a best-of-breed. If, on the network side, they use a different tool, we pull all that data in so that we have a single console. It's kind of like the monitor of monitors. We're able to aggregate all the different types of data sets, whether it's log data, app data, OS data, infrastructure data, or network data. We're able to aggregate all those events and then correlate and be able to say we're having an event. Just because we have one or two alerts doesn't necessarily mean that we're having an event. It's when we get several of those that "trip the wire" that we're able to say, "Okay, we are having an event." And the tool allows us to aggregate all of that so that we're managing event-driven versus alert-driven.

The breadth of the solution's monitoring capabilities is also fantastic. A lot of IT organizations that I talk with use a conglomerate of tools to manage their monitoring and it ends up being pocketed. We don't have that problem because we are using it as the monitor of monitors and therefore we are able to take advantage of all of its bells and whistles. As well, we can feed in additional alert data, crunch that, and react appropriately and accordingly, proactively versus reactively. We'll get several low-level alerts saying, "Hey, this may be an issue," and we're able to proactively look at that before it becomes a critical outage. We use almost every aspect of the tool, with the exception of some of the automation because we haven't gotten there and found the need for it. But we're rapidly starting to take advantage of those pieces as well.

A use-case example would be if we have a drive filling up on a particular server for a particular application. If that's a known issue, we can actually orchestrate through the automation component of TSOM to be able to say, "Hey, when we see this type of alert, go try one of these three things and if that fixes the problem, go away. And if it doesn't, go ahead and escalate that as a ticket and we'll have a human go touch that server and remediate the issue." So we're right on the cusp of beginning that journey.

In addition, the entire root-cause analysis functionality within the tool is quite useful. It really comes down to how admins want to leverage it. There are what I call "old-school admins" who want to get on the box and solve it themselves. Then you have the "new-school admins" who go straight to the monitoring tools. It clearly shows you root cause analysis: This is the probable cause, and then they're able to go remediate it more quickly. We use that extensively within the operations team and the products team, which is the team that I own. I don't think the engineering team is quite there yet, but they're beginning to see the value of wanting to see that data and start using the tool themselves.

Regarding mean time to remediation, when I took over this organization, I and the rest of the group were working about 100 hours a week, just trying to keep our major systems running. It wasn't until eight months later, when we actually implemented a more mature monitoring system, that we turned the corner and people were working 60 hours. And now it's somewhere between 40 and 50 hours a week, which is much more maintainable and realistic in the industry. We were doing everything we could to keep those systems running, and we had no idea what would be in the next box of chocolates that we would open up, back when we first started this. There's a direct correlation with TSOM and the BMC product sets that have helped us be successful in working smart and not hard, like we did back in the day.

What needs improvement?

Specifically around application performance monitoring, BMC is definitely not the market leader. The Dynatraces, the New Relics and the like are more of the market leaders in that space. I would like to see them grow that space a little bit more aggressively. It has not really been their bread and butter. 

They've been highly focused on cloud initiative. I don't know anyone in the industry who has solved how to monitor cloud, SaaS-based systems, because all of those systems are usually linked through other systems. That would be another area where it would be nice to see if they could find innovative ways to be able to do that.

The third piece would be around out-of-the-box automation. We all have particular types of alerts and events where all we really need to do is be able to turn the functionality on versus creating the functionality. BMC is already addressing that in many cases.

For how long have I used the solution?

We've used it in probably three incarnates of what it is today, so it's been about ten years.

What do I think about the stability of the solution?

We don't have any issues. We're in an HA format so if we do have any issues, things failover quickly and we don't miss a beat. It's the heartbeat of our products, the fact that we provide monitoring services to our businesses, so monitoring can't be down. It can't have a bad day. TrueSight Operations is a highly stable product. It is a beast. It runs really well. There's isn't a lot of care or feeding that we have to do to it to make sure that it stays healthy.

What do I think about the scalability of the solution?

It's highly scalable. We continue to add more servers and more applications within the ecosystem easily and quickly. We continue to review all of those quarterly to make sure that the way that we've tuned the monitoring is still accurate and that it's meeting the needs of both the admins and the business.

How are customer service and technical support?

We have a great relationship with BMC. We're probably different than the average bear. We've got a great account team. When we call customer support, we get answers pretty quickly. We don't have to call them very often, which is a good thing for any vendor. You don't want to have to call support a lot. But when we do, it's usually because we can't figure it out and we're able to get the answers pretty quickly through their organization.

Which solution did I use previously and why did I switch?

We used HP and then we used Microsoft Systems Center Operation Manager, SCOM.

How was the initial setup?

Back in the day, the initial setup was very complex. As it stands today, upgrades are really very easy. It's basically just a matter of refreshing old hardware, turning the system on, and making sure that it picks up all of the agents. Setting up today is infinitely more simple than it was even three or five years ago. 

BMC is innovating even further and working towards containerization so that we won't have to do upgrades anymore. We'll just overlay. They've really taken into account how to consolidate consoles so that there aren't so many bits and pieces. That has made it easier for them to do upgrades. Installing the system or deploying the system only takes a couple of weeks in an organization of our size, where it used to, when we originally did it, take four months.

The latest one that we did, we had all the technical bits and pieces done within four weeks. Then we slowly rolled it out as we sunsetted particular agent groups. The total roundtrip was six months to have it fully deployed and embedded and working in the system.

At this point, we do an upgrade every three years, and every five to six years we're upgrading our hardware. This year we actually went fully virtual. Our engineering organization still takes a good bit of time to build servers. We were able to get virtual machines within weeks of the initial setup of the product, and we were able to roll to virtual machines, versus physical machines, relatively simply. It was basically a point-and-shoot install. We pulled over all of our policies and procedures that were already canned - and that was another thing that was more of a challenge in years past because we would have to redo them. This time, all that got pulled in and we were up and running within weeks.

What about the implementation team?

We partnered with BMC this time. Typically, we use a third-party, but in talking with BMC and where we were at - as we use them primarily for consultative - we said, "Hey, what's the best way to go ahead and do the upgrade in the migration?" They gave us the cut plan and then we actually did the physical work ourselves, which saved us some $200,000 in project fees.

With two guys running the system day-to-day, and consultative services from BMC to tell us, "Okay, this is how you do it," we were able to execute both the upgrading project, as well as administrating the product, while still running on the old system. It says a lot about the product's ease of use and capabilities.

Now, my guys are really smart and I'll give them all the credit. They're smarter than the average bears. But the reality is that it's rare to find a product where the people who are running it can be doing a major upgrade at the same time.

What was our ROI?

The very fact that we've been on it for ten years is a testament. We continue to make the investment. We continue to pay the renewal because the return has been fantastic. I don't have any specific data points other than the fact that we've been on the product for ten years. There's a reason for that.

What's my experience with pricing, setup cost, and licensing?

There are no costs in addition to the standard licensing fees. It's a straightforward contract.

Which other solutions did I evaluate?

Every three years, we reevaluate the space. That's just part of the culture that we've established. No one tool stays forever at the top, but BMC's monitoring capabilities and their discovery asset tools are top-of-stack, typically, in any of the research that we do. We continue to use them and we continue to have a great relationship with BMC.

What other advice do I have?

Keep it simple. Make sure that you understand, architecturally, how your applications and your data center are set up. It makes your life easier to know exactly what you're going to need to monitor.

The biggest lesson I have learned from using this solution is to really take full advantage. I joke with the BMC guys that TSOM is like AutoCAD, the engineering tool that people use to design and draw. We only scratch the surface of its full capabilities. The thing that I've learned is that it's a good idea to take advantage of all the bells and whistles as quickly as you can because it really pays dividends to do so.

We are using a little bit of the solution's machine-learning and analytics. That's an adjacency tool called IT Data Analytics and we feed that into our overall, single pane of glass monitoring. I don't know that we've taken full advantage of that quite yet. It is on the roadmap. We'll probably get to that, realistically, next year and in '21, where, as we're seeing those analytics, we will actually link automation to it. So when we see something we'll actually do something. We're a fairly small shop and therefore scale is not an absolutely necessary thing, but it is something that we are striving to move towards. It has affected our application performance in bits and pieces. It's not something that I'd wave the banner on quite yet. We have pocketed instances where ITDA has come back and told us that there was an issue, and we were able to remediate proactively versus reactively. I don't know that we're leveraging the tool's full capabilities where I can say that I have a use case where this was a big win for us.

I don't think that the monitoring tool, TSOM itself, has created or helped to support any business innovation.

As for users of the solution, I have the two admins and then I have, say, half of my organization that consumes it as a tool, so there are about 12 to 15 users. Each of those people is an application admin. Their primary responsibility is the applications that they support. The monitoring is a tool for them to use to ensure that those systems are healthy and top-notch.

I have a senior manager who manages the space. He also manages our asset-discovery tools along with all of our web and third-party space. He is a busy guy but it's all managed under one leader. There are the two folks who administrate it. It's really a very small human-capital resource footprint, in comparison to what it does technologically.

I give TrueSight Operations a nine out of ten. There are always bits and features from other products that we wish we would see in it. Usually, we see them pretty quickly.

Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
Learn what your peers think about BMC TrueSight Operations Management. Get advice and tips from experienced pros sharing their opinions. Updated: October 2021.
542,608 professionals have used our research since 2012.
ME
Service Delivery Manager at a financial services firm with 1,001-5,000 employees
Real User
Knowledge Modules are what make the implementation across our varied infrastructure, but RBAC controls need some work

Pros and Cons

  • "From an administrative standpoint, what stands out in TrueSight is the ability to implement quickly. When they have a requirement to monitor something, we're able to turn that on quickly in their environment. We're able to set up new apps within a day."
  • "We were somewhat limited in TrueSight due to some of the RBAC controls not quite being what we wanted as far as delegating out administrative privileges for implementation. But because we were able to turn requests around pretty well, that burden wasn't too heavy."

What is our primary use case?

We use it for business service and infrastructure monitoring. We use the full gamut of utilities from them and monitoring in the platform.

How has it helped my organization?

We don't use APM. We used to. We line-item nixed that for various reasons a few years ago. We also don't use the ITDA, their next-gen log monitoring tool. So we're truly just within the TSOM interface, as well as doing synthetics. That being said, the Knowledge Modules that BMC brings to the market are what make the implementation across our varied infrastructure and applications. It's critical to have those Knowledge Modules. If we had to write things ourselves, or to use a more generic monitoring environment, and then build additional scripts on top of that to monitor the Kubernetes of the world, or the WebLogics of the world, or the Oracles and SQLs of the world - if we had to write scripts ourselves to bring back particular monitoring components and performance metrics and so on - that would be a heavy burden that would keep us from implementing. We don't often run into something that we haven't been able to monitor. It's just a matter of getting people to the table to tell us what they need.

When it comes to incident management, we get most of our data from TrueSight, log data, because we don't use the ITDA interface. It would be an effective interface, but for logging we go to our SIEMs, since we're already pumping data to another system there. But TrueSight definitely gives us a view into the health of our business services, which is our primary goal for implementing monitoring.

We try very hard not to use event management. What I mean by that is that we do not have a typical NOC. We don't have ten people staring at screens and then escalating as necessary. Along those same lines, we don't spam our incident management environment with events from TrueSight. With a lot of customers I've met over the years, that's essentially the old school way of doing things. Instead, we create events that are truly actionable. If we don't have an actionable event, we don't create it. We use their baseline technology to ensure that we're only sending items that are either about to have a problem or have passed the threshold of having a problem. If you're talking about typical event management, where you create an event and it gets forwarded to some other system, there's a notification about it somewhere else - the whole ITSM cycle - we don't use it for that. We use it for creating smart events that create alerts directly to the teams responsible. As I described before, we have many distributed teams rather than a centralized NOC.

In terms of TrueSight helping to maintain the availability of our infrastructure, it's an interesting question because of our distributed systems. We have 8,000 hosts across about 40 different teams, and we have 600 different applications that we run. For those critical tier-one apps, teams are highly involved in their day-to-day operations and watching them very closely. Having those two things - the actionable alerts and the ability to see what the health of their system is at any given time, and to be able to check it against what normal looks like for those applications - gives the teams that use it in such a manner the information they need to be confident that their availability is as it needs to be, or better. As far as a hybrid environment goes, we have our own hosting environment because we are the cloud to our clients. So we're not necessarily in that situation. We don't use assets other than what's in our hosting environment.

If, in the past, one of our biggest problems was just plain old infrastructure incidents, basic availability incidents where a server or an application, an interface or an endpoint, may not have been available and no one noticed it until some downstream, business end-result brought it to our attention, we've essentially eliminated 90 percent or more of those. It has been at least three years since we've done any numbers. But at the time, we might have had ten to 15 Sev-One incidents a month. When we last measured it, we were down to one. That was within a couple of years of implementing an enterprise monitoring strategy.

As for root cause, when a team is engaged in monitoring to its full extent, we're usually able to get to root cause pretty darn quick. For example, if a team has many servers that could potentially be impacting an application or a business service, tracking something down across those multiple servers and multiple owners could be really tedious and time-consuming. It would be on the order of hours, or at least many minutes, depending on the scope of the issue. With well-implemented monitoring, for our Sev-One apps, they're able to get to the solution almost immediately. If we have monitoring set up properly, the actionable event will tell them precisely where a critical component has failed and they can resolve it. Where it's a different type of incident that we might not have a particular monitor for, they're able to use the performance data, availability data, and other related alerts to get to their issue much faster than they used to. Having a good monitoring implementation has made a world of difference to our operations teams. It's so much so, that if you think back five years, which is an eternity in the IT world, when there was a Sev-One incident back then, someone would walk around tapping people on the shoulder all over the floor. That was very time-consuming. But now they're able to collaborate quickly and say, "It looks like this is the problem right here," in a well-monitored environment, and get right to the root cause.

It's helped our mean time to remediation, and I'm being conservative here, by about 70 to 80 percent. That's an absolutely huge impact.

What is most valuable?

We have many operational teams, and for any given team their requirements are different. One team is more reliant on infrastructure monitoring, because they are processing-heavy. Another team might be more reliant on endpoint monitoring where we're ensuring that the third-party endpoints they rely on are up and available. Another team may have fairly immature applications, so that they would rely heavily on log monitoring to catch all the errors that may come up. From a consumer-function standpoint, there isn't any feature that stands out. They're all important because all of our consumers are important. 

From an administrative standpoint, what stands out in TrueSight is the ability to implement quickly. When they have a requirement to monitor something, we're able to turn that on quickly in their environment. We're able to set up new apps within a day. Most of the work in monitoring is working with the teams, evangelizing, educating, and making sure that they're bringing their smart requests to the table so that they get visibility into their business service. If the implementation wasn't as easy as it is, it would hinder and probably decrease the adoption of monitoring. But because we can turn requests around pretty quickly and adjust things as teams need adjustment for their different release schedules, administratively, we're able to respond and keep pace with the business and the technology that they're implementing. That is a critical function for us.

For how long have I used the solution?

We've been using TrueSight Operations Management for almost six years.

What do I think about the stability of the solution?

Stability is one of those areas of identifying challenges with TrueSight, areas that I'm not entitled to share at this point.

What do I think about the scalability of the solution?

We've been able to implement all the hosts that we care to implement on a couple of servers, with minimal maintenance. We don't use their high-availability solution. We don't really require it because the underlying infrastructure is relatively robust. We haven't had any problems with the scalability. Had we been a couple of times larger, there would've been more to implement server-wise. 

The other thing about our implementation is that we send a lot more performance data to our implementation of TrueSight than the typical BMC environment might. We send everything server-side for analysis rather than keeping everything agent-side or emphasizing agent-side, as I've seen a lot of other clients do. I think the tide is turning. I think more people are doing what we're doing where we just push all the data for potential analysis. But we've been able to accomplish what we need without too much infrastructure.

How are customer service and technical support?

They had an advisory board. We, as a group, and even I specifically, had been asked by them what they needed to continue doing. One of those was continuing to build out Knowledge Modules in various technologies. Some of the ones BMC has made available, we've implemented, and some of the ones BMC has made available don't impact us and we haven't implemented. But I've been in discussions where they say, "What do we need to do," and Knowledge Modules is one of those areas where they've made a commitment to continue adding to them, and we appreciate that.

Which solution did I use previously and why did I switch?

When we first started, we did not have a monitoring program at anything resembling an enterprise-type level. We were at about 4,000 hosts and we were really not monitoring anything except for a few services. At that, it was bare-bones monitoring. We monitored, maybe, half of our environment at bare-bones.

We went on this journey six-plus years ago to have an enterprise monitoring solution that focuses on business services. One of the reasons we did that is because of the number of incidents that we had that really should never have happened. Now that we're a number of years in, and we've implemented monitoring and brought teams around in the direction of business service rather than just an executable's use of a CPU, we have much fewer incidents.

As a general trend, we're much more capable of seeing what's out there and monitoring what our issues are and taking care of it before the business incident occurs. I don't have any particularly recent examples where our monitoring was able to resolve an incident after it happened. Of course, I don't get notified when people say, "Oh, look, I resolved this," because it's part of their daily operations to find an issue and resolve it. So it's not necessarily a newsflash anymore for us.

It doesn't happen quite as frequently as it used to, but they continue to build Knowledge Modules, every time there are new products on the market. They need to create Knowledge Modules for the implementation to be enhanced. That's one of the key features of the Operations Management. That's definitely something that helps us take advantage of everything BMC has. They're not sitting on their laurels. They're building things out.

How was the initial setup?

The complexity of our environment demanded the complexity of the implementation. More than half of the effort that we had in implementing monitoring was based on the way we did our program. We were basically starting at zero and bringing teams up to speed, evangelizing, educating, getting people onboard.

The implementation of TrueSight itself was just a software implementation. It had its bumps and bruises. None of us were versed in BMC software. There were some learning curves as would typically be expected for any application of this scope, magnitude, and impact.

We had an overall strategy of doing proofs of concept for various, widespread technologies. We took that success and did a wide-to-narrow type of advertisement. We told everybody what was going on and then we brought more specific people into the room and said, "These are good targets for you to implement." During and after that evangelizing and advertising, we started implementing tier-one applications as an onboarding effort. We did that in a deep-dive fashion where we would sit down and interview these teams and really come to understand what makes their business service tick. A lot of our evangelization effort was actually in changing the focus of operations teams to think from a business service perspective. That paid off in dividends later when people were more interested in monitoring the actual functions of their applications rather than just the infrastructure of their application. We've been able to change mindsets over the course of a number of years. The first two or three years we were doing implementations. That was when we did most of that work.

From there, we worked as much as possible to allow folks to implement their own where possible, rather than centralizing it, so that people could keep up with their own demands. We were somewhat limited in TrueSight due to some of the RBAC controls not quite being what we wanted as far as delegating out administrative privileges for implementation. But because we were able to turn requests around pretty well, that burden wasn't too heavy.

From tier-one apps, we kept going and kept educating, bringing people to the table. When new applications come to our company, we still reach out and educate new teams, bring them to the table and use the onboarding process we built and solidified over the course of the first couple of years.

During the first three years, we had two-and-a-half FTEs for implementation. That was for the full program, not just the TrueSight component. It included all those interviewees, all those educational components, all the training, etc. The full program. The actual pressing of the buttons was about half of that. Once you stand it up and start connecting things, it's a matter of administratively using the tool to execute.

What about the implementation team?

Typically, our company builds knowledge for implementing infrastructure/operations activities like this from the ground up. We did not use a third-party. BMC was instrumental in our success in that they made resources available to us, implementation-wise as well as development- and support-wise.

What was our ROI?

The solution hasn't helped reduce costs in a measurable fashion. That's a measure that we wouldn't undertake. There might be soft costs benefits, such as 

  • impact on the quality of life for operations folks
  • our ability to show our clients that the services we provide to them are healthy
  • giving the business teams, our relationship teams, the ability to speak intelligently, rather than just colloquially, about how our systems are running.

Life at our company as an operations person is nicer now because you have confidence that what you're doing makes a difference, that the business service that you're working on is healthy. The business is happier when we're able to talk to them intelligently and say, "I can actually show you that we've been up and successful." 

It has helped in our ability to work on smarter things rather than silly incidents. If we eliminate incidents, then we're doing better work. We're able to do the good work of business rather than the sad work of recovery. That's not only quality of life but it's also the ability to get things done. So I know that, at some level, we're doing more with less because of our monitoring. But we don't have any hard numbers from a monitoring perspective.

What's my experience with pricing, setup cost, and licensing?

We're end-of-lifeing it now. Overall, the licensing costs of BMC are a challenge for us in that they're hard costs, whereas open-source monitoring has soft costs, where it's harder to line-item. It's harder to see the cost of implementation for other things. So that change of direction is taking place. It doesn't mean the cost isn't there; it's just soft dollars rather than hard dollars.

Which other solutions did I evaluate?

We looked at Microsoft SCCM. And, because we had a partnership with CA, we looked at their tools. There were a couple of other minor players we looked at which just didn't have the scope of what we needed to do, because of the breadth of technologies that we use. In the bakeoff, we came down to BMC and Microsoft.

It was a long time ago, so I don't know that it's fair to judge at this point, but from a monitoring perspective, the whole Microsoft suite really wasn't there. There was a lot of scripting. It was easy to identify that the administrative burden was going to be high in that implementation. Conversely, with the BMC stuff, out-of-the-box, administratively, you click and implement. That is one of our components of success, our ability to implement quickly. 

On the soft side, BMC as a partner was much more interested in our success than the Microsoft folks were at the time. It's very hard to quantify unless you're there sitting in front of them at the table and working with them, consuming their knowledge. It really is a great partnership.

What other advice do I have?

BMC is at a critical point in redefining TSOM, how it's built. Anybody looking at BMC now needs to jump on the new version of TSOM and skip the current versions. I would wait until their new environment is ready. It will be containerized. Anyone implementing BMC can get used to the environment in a PoC but they shouldn't implement until their new stuff is out. I expect it to be that much different.

Make sure that you have stakeholder buy-in and that they are able to provide the resources with the correct knowledge to implement in a smart fashion. Everybody's definition of "smart" is going to be slightly different. We really hone in on the business service side to make sure that our business functions are healthy and that we're able to understand what's normal and what is out of normal. We work with the teams, even from the point that they're in development of projects, to make sure we're ahead of what's going on rather than reactive. But that means the buy-in of multiple teams: development, operations, support. That amount of effort requires stakeholders with decision-making capabilities to say that it's a priority for them.

We knew up front - and we've been able to validate our assumption - that monitoring doesn't do any good unless you are analyzing your business service for what are the critical components to observe. That's an educational effort and an implementation project. It's that upfront effort that will make your monitoring successful. Where we've been able to engage teams and teams have remained engaged, we've been the most successful in that. We took that to heart upfront, we made that part of our route to success, and we put the effort in. Our monitoring's been successful because of that. If we didn't do that, and we didn't constantly engage teams to make sure that they were aware of capabilities including the ability to give us feedback, and that we can implement quickly, we wouldn't be here. We wouldn't have advanced as far as we have. Most of that advancement was in the first two or three years, and we've just been riding that wave of success since then.

Keep in mind that most companies don't go from nothing to an enterprise monitoring solution; they go from one monitoring solution to another. But if there's anyone in the boat that we were in, where they are the size we were with no monitoring solution, they'll be in the pain that we were in. Implementing a good monitoring program, not just the tool, but a program around it, can make a world of difference to the operations teams, and subsequently to the business as well.

For those teams that are utilizing TrueSight, they don't rely on other monitoring environments. Some of those teams rely on those actionable alerts almost exclusively, and don't really use TrueSight's single pane of glass. We do have some teams that consume TrueSight and use it on a daily basis to ensure that they don't have any events, whether or not they've risen to the level of action. They'll also proactively look at some components, either business function components or infrastructure components, to ensure that they're working as designed and within the parameters of normal.

I don't think the functionality of Operations Management helps to support our business innovation. Our business runs forward and headlong into innovation, regardless of whether or not IT can keep up. We were never an impediment, other than cost. The way we run our overall IT environment is very open and flexible. Monitoring is a way for us to give business the confidence that what we're implementing is healthy, but it doesn't impact their interest in being able to implement what's new. They've always been able to do that and continue to be able to do that.

In terms of machine-learning, I mentioned above the baselining which, depending on how it's implemented, might be called machine-learning, but in TrueSight they just have a straight calculation-type of activity. We have other monitoring solutions that we're implementing as well, and that topic may be more applicable to them, but not in the TrueSight world. The TrueSight world is a straight application implementation. It's nothing exciting on that end.

I have to give our BMC partners a lot of credit for where they're planning to take TrueSight based on their roadmap, although it is speculative. I don't think the areas for improvement from us would be any different than anything they've already heard.

If someone were to implement the full suite of BMC products, you'd have to give it a nine out of ten. TSOM by itself, I have to give it a seven out of ten.

Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
SF
IT Manager at a manufacturing company with 1,001-5,000 employees
Real User
Single pane of glass has resulted in dramatic improvements; it is bringing people together

Pros and Cons

  • "We're using native monitoring capabilities for all our server hardware, for visibility for applications, for URLs, for webpage response and accuracy, and for monitoring network throughput in a lot of particular instances. We're using lightweight protocols for pinging, for DNS, for LDAP."
  • "The one piece that I would love to see is a general-purpose, configurable agent which would be a framework that you can deploy on anything, whether it be Java or anything else. It would allow you to easily deploy it on a platform that they support."

What is our primary use case?

We stood up an event management group and our responsibility is to monitor the entire company, globally: systems, applications, and infrastructure. We're modeling those out as services. We've got about 800 services that we're modeling out from the CMDB right now and monitoring pretty much everything.

We are big users of the service models. We use CA's SDM system, which we're evaluating. But in the meantime, we wrote the interface between TrueSight and CA to cut tickets and also to, in reverse, give ticket statuses in TrueSight. We're also going through a process of onboarding our services for event management where we go through a checklist of about eight different items and bring them on as a service with SLAs. Some individuals on our Service Desk - and eventually all will be - are dedicated to doing 24/7, 365 monitoring of the services, the events, and the applications.

One of the primary things we're doing is using this as a vehicle, within our "One-IT" initiative - which includes event management - to truly bring people together from a cultural and technological perspective. The goal is that everybody will have the same place to see what's going on. No longer will they have to worry about their application. Is it the databases? Is the network? And how long do they have to spend trying to figure it out? Culturally, the Service Desk is coordinating some of those impacts when they happen, so that the right people are on the call, based on what the service model says. All in all, it's a very flexible tool, which means it's complex but very powerful.

We're using Operations Management, Capacity Optimization, some App Visibility with some of the Synthetic scripting and we're just starting to deploy some Java agents on some app servers.

How has it helped my organization?

With the service modeling, once we managed to build our import stuff to get our CMD impact models and services into TrueSight, that was a big win. Because once we integrate it with SolarWinds, they will actually be able to see when there's a problem with the plant, and they will know if it is a network problem or a server problem. With the service models, they can actually get right down to the impact of any issue. We're working on some other things to make that easier, like event correlation. So if a network goes out at the plant, they don't need to know that there are problems connecting to 60 servers, rather they've got a problem with the router.

We're currently looking at either consolidating the other monitoring tools that we have around the organization or connecting them for the single-pane-of-glass goodness. We're bringing in data from SolarWinds, we're bringing in data from Oracle's OEM, and we're integrated with an application monitoring desktops. It generates an event and a ticket is cut out to the regional support people. They will go to the desktop and say, "Your disk is in danger of imminent failure. We need to go ahead and clone that guy and replace it before you're down." So we're definitely going with a single pane of glass. In terms of our IT ops management, that means it's getting better. We're trying to be more proactive instead of reactive. We've only been heavily into this for nine or ten months so the actual, long-term impacts aren't measurable yet. We're still baselining where we are at.

The single pane of glass is a big improvement.

There is also the ability to do predictive and corrective, especially for some services which we're monitoring out in the field which are critical to various plant components. It used to be that they would go down and the plant would call. Now we're detecting that they're down, we're restarting them, and we're letting somebody know there's an issue. That's also a big improvement in our manufacturing capabilities. Culturally, it is bringing people together with one place to look and giving them something to talk about when there's an issue. It's bringing IT together. The collaborative and predictive stuff is actually starting to improve.

We're not doing a tremendous amount of preventative stuff yet - unless you count when your disk is three percent from being full and you need to do something before it fills up. We're not using some of the more advanced features of the predictive analytics yet. We are starting to look at some data analytics though. We have a data analytics group which we stood up, a couple of people who are starting to use data analytics to do some things.

It's improving the overall operation, but the impact is going to be measured a little bit later. We've seen some cost deferrals and some cost savings with some support renewals we haven't had to do on some other tools. But we haven't seen the major cost impacts yet. We have spent a lot, but on cost-avoidance for various support tools we have saved close to $1,000,000. In the nine months we've been operational, we've deferred cost on at least two tools. One was about $750,000 and the other was $250,000 for maintenance.

It also helps to maintain the availability of our infrastructure across a hybrid, complex environment. I used to work at FedEx and we're not as environmentally complex as FedEx because we consolidate a lot of stuff on the ERP. But if you throw manufacturing in there, we have pretty much every flavor of platform. As with most deployments, we've got three-tier and four-tier applications. You throw the network and some load-balancers in there and it's fairly complex. If you can use a service model to see exactly what's working and what's not, it really gives you the ability to look at some things.

The solution has also helped to reveal underlying infrastructure issues that affect app performance. Let's say there is a system that is occasionally slow but you don't know why. Then you find out that it was supposed to be configured to use a large number of LDAP servers for authentication but somebody had configured it to one. When you compare the times at which the systems people were having trouble logging on and you look at the CPU and memory usage on your LDAP server, you begin to put things together, without actually analyzing configuration files. You can figure out that the system is configured improperly. When they dig in, they find that it's only talking to one LDAP server. It gives us that kind of diagnostic capability, by looking at everything, and the ability to pin things down.

In terms of root cause analysis, we're still working that through. But mean time to repair is going down because it's becoming much more obvious. Between the events that people are looking at which are prioritized, and the service models which show the actual impacts to the relationships, it's becoming much easier. Depending on the event, it's gone from about four to five hours down to 20 minutes. When it works, it's significant. A lot of it is cultural. When you go from everybody monitoring their own stuff and not talking to anybody else, to everybody looking at the same single pane of glass, and you throw a Service Desk on top of that, which is performing incident management and coordinating some things - between the technology and the culture and the process changes, you're going to see some pretty dramatic improvements.

BMC just did a custom KM for us. Typically, on a given server, we want to know when a drive is three percent. But we've got some mixes of drives, servers which have anywhere from a 100-gig drive to a terabyte drive, and the percentages that we are worried about are not the same. This request came from our SQL group. BMC was able to adjust the alert parameters based upon the size of the logical drives. That was definitely a business innovation. I think that was good for BMC too. Although that's a custom KM which we just deployed, I suspect they will make that part of their standard tool kit.

What is most valuable?

From a TrueSight perspective, we love the Capacity Optimization. We manage to collect almost all our capacity information through agents, without having to deploy a capacity agent. We've already saved some money. We're now provisioning more for obsolescence than we are for expansion because we now know exactly what we've got. One of the nice things about it is that we've now put Capacity Optimization in all our plants and mills, where the money's actually made.

The flexibility of the MRL is great. The various abilities to use native KMs to connect to a lot of things that we're doing with the hardware monitoring into the consolidated stuff, like SharePoint, is great. We're using native monitoring capabilities for all our server hardware, for visibility for applications, for URLs, for webpage response and accuracy, and for monitoring network throughput in a lot of particular instances. We're using lightweight protocols for pinging, for DNS, for LDAP. We use the scripting KMs for a lot of stuff that we have to script ourselves. We're also doing a lot of SNMP polling for devices. We've got some places where we really couldn't use a traditional agent and we deployed a Java agent that we wrote. For example, we might be monitoring UPS's out in the field using a Raspberry Pi and pushing that data back up. The problem with UPS's out in the field, when you have thousands of them, is that you don't know that the battery's bad until the power goes out. This gives us the ability to enable them to report back via SNMP.

What needs improvement?

I can only speak from my perspective because I don't know if some of the issues that we've had are industry-wide or not. For instance, we've got a lot of Microsoft stuff here, and the SCOM interface is very difficult to use. They don't have support for SCCM and some other things so you have to go directly. 

The one piece that I would love to see is a general-purpose, configurable agent which would be a framework that you can deploy on anything, whether it be Java or anything else. It would allow you to easily deploy it on a platform that they support.

The KMs and some of the user interface are a little bit quirky. That's the stuff that they will eventually get to. TrueSight is a fairly new platform revision for BMC. I'm seeing a lot of those simple platform things, where you have to go here and do this and you have to go there to do that. They're very working very hard to integrate everything into the same simple console. I think that a lot of the issues that we have are going to slowly, or maybe rapidly, disappeared.

For how long have I used the solution?

We installed it a couple of years ago. We started ramping up and have been using it since then. We really went hot and heavy about nine months ago. We moved from Windows to Linux in January so that's when we really started to invest in event management work with it.

What do I think about the stability of the solution?

On Windows we went to application HA and, quite honestly, it was terrible. They'll tell you it's terrible - or they should. We are very religious about patching, so when you go to multi-node HA stuff and you've got the Windows guys patching your stuff every Saturday night, you become very unstable. What we did was we moved to Linux so that the patching wasn't necessary as often. And we went to operating-system and hardware-level failover with Oracle Solaris virtual machines, and we've been incredibly stable since then.

What do I think about the scalability of the solution?

Regarding scalability, so far, so good. We've got about 22,000 devices that we're working with, of which about 8,000 are directly monitored. The rest are coming in from SolarWinds, the network, and some other things. We're running three TSIMs and one parent, so four infrastructure managers. We've got integration servers all over North and South America and Europe. It's very scalable.

In terms of users, it's mostly IT right now and a few business people. We've also got 300 to 400 service providers who log on and look at things occasionally. A lot of them just use the ticketing system. They don't actually get into BMC. They just work their tickets and close their tickets.

As for increasing the usage of it, the foremost thing in our pipeline is to continue to bring on applications. As part of the service onboarding that I talked about, we're bringing in major applications and sitting down with the service owners. We're going through everything they could possibly want monitored and showing them what we can do for them. We're putting those thresholds in place, training their teams, and bringing their teams on as users. Slowly, over the next year to year-and-a-half, we will bring in all of IT.

How are customer service and technical support?

Tech support varies, it depends on who you get. The first-tier is pretty good. If you get the right guy, it's outstanding. They've actually brought on a lot of new people, but they seem to work together as a team. I won't say they're bad, but I don't like tech support for most companies. Overall, they're on par.

Which solution did I use previously and why did I switch?

Prior to BMC, from a monitoring perspective, we were using 65 other solutions. One of my missions is to either integrate them or consume them. Bringing on TrueSight was the vision of a guy who's no longer here. He fully understood the need for a single pane of glass. He understood, fully, the need to bring light to the monitoring situation. We did some evaluations and proofs of concept and decided on TrueSight.

Quite honestly, if you're a large corporation, you can go look at the studies and you can justify it that way, but if you stop and think about how much better your organization can run, and the things that you need to do from an operations management perspective - and you think about the automation that you can put in place - it's a no-brainer. It's just a matter of choosing which tool.

How was the initial setup?

The initial setup was complex, no doubt, by the time you bring in Professional Services, if you opt to. We didn't follow the standard model because we didn't want them to come, drop in a configured system and say, "Here's the book on how it works," and then walk away. We wanted them to participate in every aspect of it. We brought a lot of it on ourselves, where they told us what to do and we did it. We worked with the Pro Services to do it, so we took longer than it probably should have but we knew more about it than we would have as a result. It's a very flexible product, which means it's a very complex product. We had enough servers and monitors that we had to bring up a multi-tiered, large number of TSIMs. It was because of our service models that we introduced a lot of the complexity ourselves.

Because we're pushing full sets of service models out of our CMDB and into TrueSight to use as a service model, we have to put them at a top level of a TSIM so that all the other TSIMs that feed into them can show up as impact models. We went to a three-tiered architecture with presentation on top, a service management infrastructure manager in the middle, and the integration managers below. So a lot of the complexity in our particular configuration was due to the fact that we didn't want to have to figure out where those services belong, or which piece belonged on which TSIM. We wanted to punch them out to the top and then let TrueSight worry about it. So in the long run, it was complex to install but it is much easier to maintain. 

The deployment took about three months. There was one person from BMC and about five people, altogether. We had DBAs involved and we had the hardware guys involved and the network guys involved. It was probably three people full-time but, off and on. Every aspect of some department that would touch this thing was involved at some point. 

There is a team of five employees and myself who are not only maintaining it but doing all the monitoring configuration - working with users to collect monitoring requirements, setting thresholds and writing custom MRL and PSL.

At the cultural level, it used to be when we first started it up, people would say, "I have my own monitoring tool and I don't need you people. I'll do my thing." Now, they're saying, "You're doing things for these other people, can you, can you help me out?" It's really grown organically, and we've had to put a team together so quickly that there has not been what should have been in place, which is a major deployment plan, where all of the pieces would fall together. We're starting to work on that now.

What about the implementation team?

We worked directly with BMC. We didn't use any third-party.

What's my experience with pricing, setup cost, and licensing?

The only possible additional cost that I can mention, that you might not be aware of, is that it uses Oracle partitioning, if you use Oracle. There are Oracle partitioning fees that go with that.

Which other solutions did I evaluate?

We looked at some other options. BMC has been around a long time. If you look at the industry ratings, it's way up there, top-right quadrant, along with a couple of other solutions. Its flexibility and its capabilities dovetailed with what we wanted to do and we liked their people. They have a good attitude.

What other advice do I have?

My advice is that it's not going to be as easy as you think, but it's going to be worth more than you think when you get it done. It depends on your situation. It depends on how far advanced you are in operations management. For us, this was a complete cultural, technological, and process overall. It wasn't just replacing one tool with another. It wasn't just putting a tool in place. It was an entire IT renewal and it's still going on. 

It's been a long, hard road, both from a cultural perspective and from a technology perspective, just getting people to realize the value. But once they do, they're willing to bend over backward for you.

We had some false alerts. In my job the red light means it's bad and the green light means it's good. There should be no light you think is green but it's bad. We had some of that at the beginning, more our fault than anybody else's. But once we got to the point where the signals were good and people could appreciate what they are getting, we became a very different organization.

The biggest lesson I've learned from it is that you can talk about it, you can visualize it, you can proselytize about it, but until you have a single pane of glass which is actually up and running with a lot of stuff connected to it, you just can't really appreciate the value of it.

The functionality of the solution is not helping, so much, in terms of business innovation. We're not doing business process monitoring at this point. While it might be that the business is not complaining as much, I don't measure that. But from an innovation perspective, it has had people look at things and say, "Well, if you can do this, can you do that?" We get a lot of requests for strange things, some we can do, some we can't. But it's getting people to think about things that hadn't really come up before.

It's a really good tool and most of the issues we've got, they've either fixed or they're fixing to fix. So a nine out ten is right.

Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
SL
Sr Application Engineer BMC at a tech services company with 1,001-5,000 employees
MSP
Integration of the monitoring and Console access is valuable and event management is a strong point

Pros and Cons

  • "Using the TrueSight platform we can monitor server performance and notify the customers using the integrated ticketing for events. We can let them know if there are any issues with a server, or application, or database."
  • "One of the things that the TrueSight environment is missing is some of the HA abilities. The data collection server called the ISM doesn't really have the HA functionality or workload balancing. It was missing from the previous product as well. It's missing redundancy."

What is our primary use case?

We are using it to monitor open systems and some iSeries systems.

How has it helped my organization?

TrueSight has helped to reduce IT operations costs.

The solution has also helped to reveal underlying infrastructure issues that affect app performance. The solution has application monitoring called Application Performance Management. It's an improvement on the old, traditional TMR. It's integrated within the TrueSight solution. It will notify regarding application performance and report issues with applications.

What is most valuable?

One of the valuable features is the integration of the monitoring and the Console access.

We manage our open systems. Using the TrueSight platform we can monitor server performance and notify the customers using the integrated ticketing for events. We can let them know if there are any issues with a server, or application, or database.

The solution's event management capabilities are a strong point for TrueSight. They are based on the previous BMC Event Manager which was very stable and pretty powerful. It was an excellent product.

What needs improvement?

One of the things that the TrueSight environment is missing is some of the HA abilities. The data collection server called the ISM doesn't really have the HA functionality or workload balancing. It was missing from the previous product as well. It's missing redundancy.

In addition, it needs some details such as auditing inside the product - there is no auditing for the policies.

What do I think about the stability of the solution?

It's pretty stable. TrueSight uses a major BMC product called Patrol, and Patrol has been around for many years. It's one of the best products and it's pretty stable.

What do I think about the scalability of the solution?

In addition to the traditional Patrol Agent, BMC TrueSight added the predictive functionality so we can predict a trend instead of having a static threshold. We can let people know, in addition to what is happening, what is going to happen. We can predict that and have the ability to do a cost analysis.

How are customer service and technical support?

BMC's technical support is pretty good. They do have ups and downs. In the past, it was very good but there was a certain period of time where they had support from overseas, from India. The quality of support was not as good as the traditional one, but I do see that it is getting better now.

How was the initial setup?

The initial setup was pretty straightforward. The documentation was pretty good. The deployment was not very buggy, and the Patrol Agent was pretty stable.

What about the implementation team?

We deployed it ourselves.

Which other solutions did I evaluate?

We did a comparison with a different product on the market. We had a CA product which I believe was called Spectrum. We compared BMC with that and InSoft. We felt that the BMC product was much better than CA's product. We also had an HPE product in the old days, and BMC is a better solution.

We had BMC for a long time. We had multiple products which we compared, and BMC is a better solution, so we removed the CA product. BMC is better in terms of support. It takes fewer people to support, it's easier to configure, and easier to change the configuration. It's also easier to change the special settings. And it's easier to maintain.

What other advice do I have?

BMC products are very good. All products have pros and cons. For example, all the enterprise monitoring solutions are not really set up for multi-tenancy. BMC products are very stable and the support is good, and the configuration, especially, is easier to do. I think it will come down in pricing, although the cost is not something I am not involved in.

We started using TrueSight in the early stages. Like every product, TrueSight, as a new product of BMC, was going to take some until BMC improved it, got all the bugs out, got all the features added. It's not perfect but I do see improvement. When a product is in its infancy, it will always have some issues. I do see BMC trying to improve that. It's getting better now. It's pretty stable. It's a very good tool for traditional open systems and mid-range.

I would rate TrueSight Operations Management at eight out of ten. It's not a ten, because, as I mentioned, it is missing some capabilities in HA solutions. In the past, we had load- balancing HA. Now, it has to rely on an external load balancer to achieve HA. 

But I have to say that my view is limited because we do not have the whole suite of BMC products. There are certain things we do not own, like automation and deployment. If we had the full BMC suite, I would probably give it a ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
George Klarmann
CEO at Transcendence IT
Real User
It has good monitoring all the way from storage up to the servers

Pros and Cons

  • "Its event management capabilities are very open and flexible. I haven't seen a use case scenario with a customer that we couldn't actually solve the problem for, so it's really good. There are some interesting things that happen in an enterprise network (things that people don't normally expect), and the event management product is very flexible. You can solve problems as far as your imagination can go with it."
  • "I would like to see a little more out-of-the-box event correlation and expanded AIOps type capabilities. Where you can train your artificial intelligence operations to be able to memorize an issue once you encounter one scenario, so if you encounter that same problem, you can get to the root cause very quickly."

What is our primary use case?

My customers use it to monitor their enterprise and define their services. With a lot of the AIOps features and things like Service Impact Manager, my customers are able to reduce their mean time to repair (MTTR). MTTR is very important for companies who are reliant on their critical IT applications.

How has it helped my organization?

Sometimes, a lot of customers, if they're not using the products, don't know that they have an IT issue unless a customer contacts them, and says “I've got a problem.” With TSOM, they are able to be more proactive. The IT department gets alerted more quickly, and sometimes, they can resolve issues before the customer even knows that there is an issue.

This solution helps our customers to reveal underlying infrastructure issues that affect app performance. It has good monitoring all the way from storage up to the servers. Now, all the things I'm seeing in the cloud are very good, as well.

What is most valuable?

A lot of the integrations with all the other BMC products are fantastic, because it has a great discovery tool which can model applications and integrate those into TSOM. Then TSOM, once an alert is detected, can automatically create tickets in the ITSM system, which is Helix.

Its event management capabilities are very open and flexible. I haven't seen a use case scenario with a customer that we couldn't actually solve the problem for, so it's really good. There are some interesting things that happen in an enterprise network (things that people don't normally expect), and the event management product is very flexible. You can solve problems as far as your imagination can go with it.

What needs improvement?

I would like to see a little more out-of-the-box event correlation and expanded AIOps type capabilities. Where you can train your artificial intelligence operations to be able to memorize an issue once you encounter one scenario, so if you encounter that same problem, you can get to the root cause very quickly.

What do I think about the stability of the solution?

It is very stable.

What do I think about the scalability of the solution?

It's very scalable. It is horizontally scalable. Right now, I'm hearing good things from the product team that they want to do some things as far as vertical scalability, as well.

How are customer service and technical support?

I really don't have interactions with support myself directly. Some people who work for my company, they do that. 

The technical support does a good job. I used to be a BMC consultant back until 2013, and in those times, the support was very good.

What was our ROI?

It does help reduce IT operations costs. A lot of times, people are doing things manually. They might have a network operating center where they just monitor screens all the time, so you can reduce labor costs. From a scale perspective, if they have thousands and thousands of servers in their data centers, you can determine which ones are more critical over the other ones and focus on the critical pieces, not the pieces which don't matter.

Which other solutions did I evaluate?

I have customers who use other products. Sometimes, they ask me to evaluate other products that they're considering.

I am working with one large company right now who is looking at the ServiceNow Event Management product, and it's a little immature right now. Therefore, I told them about that. There are also other products that feed into the ServiceNow product set which are very expensive and very difficult to implement. This is one place where I have made the recommendation of BMC, specifically over ServiceNow.

What other advice do I have?

Understand your use cases. Take a look at the use case to product fit. I don't really recommend many other products. We are sort of committed to the BMC product set because it's good. We have a lot of experience with it, and came from a company who was acquired by BMC Software. The product manager for that company, April Hickel, she's done very well. She is the product manager for TSOM now. I know her, her innovative capabilities, and her whole team. I've been working with them for a long time, so I know not only that the product is good, but the roadmap is good, and the people behind it are very good.

If you have a good imagination you can solve anything, but you need the right tool to be able to apply that to.

Disclosure: My company has a business relationship with this vendor other than being a customer: Partner.
SO
Sr Manager at a tech services company with 1,001-5,000 employees
MSP
It covers so many different technologies which can roll up into a single console

Pros and Cons

  • "It is breadth. It covers so many different technologies which can roll up into a single console."
  • "The noise reduction for ticketing works much better than we have seen in a lot of other companies."
  • "I definitely would like to see more improvement in the self-diagnostics. I need to know when anything is not working or collecting, long before our customer finds it."

What is our primary use case?

My company is a data center service provider. We host and manage IT for all types of different companies, using TrueSight to manage and monitor the health performance availability of all our customers' environments: networks, servers, databases, websites, and all their back-end IT.

Right now, the focus is pushing DevOps and AIOps in our more traditional data center management. We are not using it in the cloud space today. Therefore, the focus is the traditional data center space, but for us, that is a very large space.

How has it helped my organization?

One case that we like to use a lot: We have a customer who uses F5 load balancers, and they were managing them with CA products. Those load balancers were generating around 11,000 tickets a month. Just moving them from CA to TrueSight, and replicating the same rules, they went from 11,000 tickets a month to 400 tickets a month. TrueSight did a much better job of doing the same thing. Then from there, we were able to tune it. We got it down to about 40 tickets a month. While this is an extreme example (I don't usually see this type of improvement), it shows the power that is there.

We are able to more quickly identify problems and get an engineer on it to restart services, etc. It is not fixing the customer's bugs. They've got buggy apps, and it goes down all the time. It is just that we can get them back online faster.

What is most valuable?

  • It is breadth. It covers so many different technologies which can roll up into a single console.
  • The noise reduction for ticketing works much better than we have seen in a lot of other companies. 
  • We're starting to get into the machine learning pieces to further enhance the intelligence of events.

What needs improvement?

Continue to improve the maturity of the product overall. 

I definitely would like to see more improvement in the self-diagnostics. I need to know when anything is not working or collecting, long before our customer finds it.

I would like to see continued improved integration with some of their partners. We use a lot of Intuity software. While the connections are good, they could be better. We use App Visibility, as part of the TrueSight suite. Previously, we were a big BMC TMRT customer previously. They gave up a lot of features of TMRT to get App Visibility in. Features that our customers used. They still complain about this weekly: When are we going to get this report or view back.

When we took this issue back to BMC, they said, "It wasn't an upgrade from TMRT. It's a brand new product. It just happens to be serving the same market." From my user standpoint, we went from BMC TMRT to BMC App Visibility, giving up all these features. For us, it was an upgrade that we lost features on. I need that stuff back, at the end of the day, as a service provider. The customers need to feel comfortable that the data is there. They need to have accurate SLA type reports. The SLA reports that we get on TrueSight today are unfortunately worthless. They go to the whole integer. So, they all show 100 percent, when we've got contracts which are 99.996 percent and are now rounding to 100. Well, if we were at .9995, that's an SLA miss. Things like this are a problem. We have to do all this manually on the side. We can't roll this back, as the versions that we used to use are long out of support.

The biggest issue is probably the gaps in the reporting that I need for my end customers. That is a very public and embarrassing, I can't give you the report that you need. Also, the reliability of the ISNs needs improving. Having a customer find a machine that stopped collecting before we do, that is not what you want when you're a service provider.

For how long have I used the solution?

We have been a BMC client since 2001. We've been through many generations of the product.

What do I think about the stability of the solution?

The stability has a bit more maturing to do. There is still room for improvement. Overall, it's pretty good, depending on which layer you're looking at. At the highest level, which is the presentation server, we find that we have to restart that every two months or so, just because it stops responding. I would like it to be a bit better. We don't have any real understanding of what's causing that. The next layer down is the infrastructure manager level. That's probably about the same, every couple of months it stops responding. As you then go farther down to the data collection layer: the ISN level. Those aren't as stable as they need to be. They will go for six months fine, then fail three times in a row in two weeks. It doesn't give us a good alarm, and unfortunately, we've missed an event. Then, the customers notice something, and that didn't pass its events. So, a little more maturity is needed here.

What do I think about the scalability of the solution?

It's scaling fairly nice, but not as large as we would like. We are not seeing the type of scalability that BMC claims. For example, they say that you can run 900 agents against an ISN. We find the ISN stability goes down when you hit 500 or 600. So, you're only at two-thirds of the capacity. I forget how many millions of things that the TSIM was supposed to be able to handle. We are no where near that capacity. We're spinning up more TSIMs because it's just not scaling as advertised.

How are customer service and technical support?

Technical support is a mixed bag. Some tickets go in and are handled very quickly and well. However, we have had tickets which go in and have been out there for months, and some of them were fairly complex. They will go up to Tier 2 or Tier 3, then park. I'm assuming that we're running into a software bug, or something, but those tickets that stall out are frustrating.

How was the initial setup?

It was complex. I wish we had put Professional Services into the deal. Being a service provider, we are attached to companies all over the world with very strict auditing and security requirements. Therefore, designing the architecture to work in that environment was fairly complex. I was just talking to a product owner about the problems that we still have.

Once we get the architecture, the deployment went fairly smoothly. The policy creation and management were much more complex than in their previous products. It is probably more powerful, but not as easy to administer.

They have rolled things, which were multiple products separately in the past, into a single product. They've had to do some consolidation, or adjustments, to be able to merge them quickly to get their product to ship. This left some things missing. Some features that used to be there are gone. Features that we used to use. So, there are pain points, as we figure out how to work around the new gaps.

What about the implementation team?

We did it ourselves. 

Globally, I've got six engineers and 12 operators who worked on the deployment. This is a sizable group. However, I'm currently supporting global operations of a couple hundred clients, and they're major clients. 

What was our ROI?

TrueSight has helped reduce IT operations costs. From a software standpoint, I have been able to eliminate a lot of other tools, saving approximately half a million dollars a year in other maintenance costs. That is easy savings. The more important one is the labor savings: more reliable, simplified tickets. 

The time savings are recognized by the operations teams, not my team. Therefore, it's hard to know the time savings, but if an operations person takes at least 15 minutes to analyze a ticket and their ticket volume is reduced by 10,000 a month, then TrueSight does save time.

We've been reducing ticket noise five to ten percent annually every year, and it has been cumulative. This means less tickets, noise, and operator intervention.

What's my experience with pricing, setup cost, and licensing?

It is a large, complex product. So, there is a commitment of manpower to deploy it, as it is not a cheap product.

We license per named endpoint for most of the products: servers, network devices, databases, etc. You pay for the initial license and maintenance. The way that my company looks at it is we figure out our monthly costs over five years, and right now, we are between five to six dollars. We need to get that down to about four dollars. That's included in the maintenance.

There is a big upfront cost when you buy the license, then there is annual maintenance. We look at, if I bought a license and paid for maintenance for five years, then average it out, what would be my monthly cost. We have had some of the competing tools come in around four dollars. This is coming in as a premium, which is why I don't have it deployed as I would like it. Therefore, we're in negotiations right now. If I can get it down to the four dollar range, I will triple my deployment in a year and a half. If they could could me to the right price point, there are 10,000 to 15,000 servers that I would install it on.

Which other solutions did I evaluate?

As we've acquired other companies, we've picked up pretty much every other tool set out there: CA, IBM, SolarWinds, etc. We have played with pretty much everything. The BMC TrueSight platform wins probably 80 percent of the time if you look feature by feature. It's a good, strong platform. It's ability to run on all the OSs that I've got is a huge thing. We do a lot with IBM iSeries, and a lot of vendors don't cover that. So, this is a big positive on the platform.

Being able to roll everything up to a single database and single feed out for reporting are all very big positives. The same type of consolidation rules under CA, if you write them in BMC, they just work when they didn't work in CA. Things like that make BMC great.

What other advice do I have?

You really want to plan out your policy and architecture in great detail before you start any deployments. It is a complex product. You don't want to have to go redo it. Pick a small environment, test out your plan, test it out a second time, beat it up, and once you're happy with it, then go nuts by deploying it everywhere. It's great once it's there, you just have to get past that design hurdle, because there are things that aren't necessarily intuitive.

I have a mixed bag impression of the usability. The end user experience is mostly good, as it's a very clean interface. There are some quibbles with it. You have to drill into a lot of layers to get into the data that you want. However, when you hit "Back", it takes you all the way back out of the tree. Then, you have to redrill into all those layers. That is a bit of an annoyance for end users. From an administration side, it is still sort of heavy, and policies are very complex. Therefore, it takes a fairly senior level engineer to build it and get it to work well. But, once it's working well, I can monitor tens of thousands of things.

Definitely get multiple references from each of the clients, since all salesmen lie. They all promise the possible best scenario, and I have found depending on the client that you get very different experiences. So, the claims that the BMC sales guys have made are all achievable in a perfect environment. No one has a perfect environment. 

Claims from CA, I have found to be outright fabrications, such as, "We can do this." Then, we buy the product. "Oh well, you actually need Professional Services, and you're going to need like three years of custom coding." Millions of dollars down the drain with them. 

Other vendors have different levels. They all come in very rosy, and sometimes too much. So, talk to people who have really done it. Take their advice. Don't assume that they didn't know what they were doing. There are a lot of good engineers out there. If the company is struggling, assume you will also struggle.

Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
Paul Mercina
Director Product Management at Park Place Technologies
Real User
Enables us to monitor a hugely diverse set of hardware products from multiple manufacturers

Pros and Cons

  • "The ability of this platform to monitor the very diverse assets that we maintain around the world is its most valuable feature... We support a vast array of manufacturers' equipment, like HP, IBM, Cisco, Dell, EMC, Hitachi... We can do it all with [this] one [solution]."
  • "We have a unique use case because BMC typically sells this solution into enterprises that are deploying it within their IT, versus to a managed services provider like us where we're supporting thousands of customers. Multi-tenancy and the scalability have been challenges along the way, as we've grown... If anything could have gone better as we were ramping this up and adding a lot of volume to it, I would say it's the scalability. That would be one thing that could be improved."

What is our primary use case?

We're actually hosting the software and providing services to our customers based on all the capabilities that are within TrueSight. We are a very large, global, hardware maintenance provider for data centers. We mostly service the high-end data storage and networking equipment that you would find in data centers and in cloud environments. 

A couple of years ago we started on a journey to really improve our ability to maintain and service our customers. This was all about connectivity, getting connected to those servers and storage platforms. We wanted to get connected to everything that we were maintaining around the world so that we could really implement a "diagnosis before dispatch" approach.

With this solution, we gather all the data from a server that has failed, and we do all the troubleshooting, the problem and root-cause determination - we call that triage - before we ever send a field engineer or anyone to the site. So when we do send a part or do send a field engineer, we know exactly what the root cause of the problem is and what they need to do to fix it. 

How has it helped my organization?

We are using this solution to scale our business and to drive greater efficiencies. The other side of it is that it's much better for our end customers because they no longer have to monitor their own environments for hardware failures. We do that for them. They don't have to recognize that a server has failed. They don't have to pick up the phone or send us an email to open a ticket and send us files to help us troubleshoot the problem. We're really reducing a lot of the effort required on the customer's side to manage their IT environment using this tool because we can detect the failure, we can troubleshoot it remotely. And, when we do implement the corrective action, we're pretty certain of the root cause, based on the technology and the capabilities of TrueSight.

It has improved our time to repair. From the time we get the incident logged to the time we get the customer back up and running, it has improved that by 33 percent or greater. It has also improved our ability to fix it right on the first call. It gives us the root cause of the problem, and it automates that whole triage, it gives us the part number of what's failed. We're now at somewhere around a 97 percent first-time fix rate. And that's only going to get better as we get more experienced with the product. And that's important to our customers. When we come out, we're going to fix it right on the first call and not have to come again and again and again. That's really important to the uptime of their IT.

We have a graphical representation of this very thing. It shows the old way of service delivery, in which the customer first had to recognize they had a problem. Once they recognized they had a program, they had to call in or email and open a ticket. Once they opened a ticket, the whole troubleshooting process would begin. We were often calling them as many as eight times per ticket, just to get information about the failure. That was taking a lot of time from the customer. After that, we would have to dispatch someone with the right part or the right solution, and oftentimes we either brought the wrong part, or we had to bring a handful of parts, which was costly for us and would drive up the cost of the service for the customer. And often there would be a repeat call, because we might not have brought the right part or have sent the right level of skill out on that call. That was the old way of doing it.

The new way of doing it for the end-customer is that we call them to let them know we have spotted a problem with their server, for instance, and that we're working on it. We don't have to bother them for log files or diagnostic logs or any of that information anymore because it all comes packaged with the alert from TrueSight. The customer really only hears from us two times now: once, when we open the ticket to let them know we've seen a problem and again after we've resolved it.

Another example is that many of our customers have equipment in co-location centers and offsite data centers, where they don't even have anyone to see that there's a problem. Now, we are driving a lot of efficiency for them. They don't have to send people out to check on problems anymore or pay somebody who is running the co-lo to go out and check on something. We're able to see it all remotely through the monitoring tool. That's another huge benefit that we've heard about from our customers.

The solution provides us with a single pane of glass where we can ingest data and events from many technologies. In terms of our IT ops management, we have a unique deployment. We actually have it running in our own shop. Everything that we deploy to our customers we deploy internally first. But we've really licensed and implemented TrueSight to drive our services business. We're supporting all of our customers' data centers with the product. We're not connected to all of those yet. We just officially launched the solution in January of 2018. We've got about a year-and-a-half in production with the product and we're getting good adoption. The real answer to its effect on our IT ops management is not so much our internal deployment. It's more about the deployment that we're leveraging for all of our 16,000-plus customers globally.

We've had a number of cases where, through the analytics in TrueSight, we've actually been able to predict failures. For instance, we've already had a couple of cases where, if we see a hard drive on a storage array is going to fail, we'll actually send the part out ahead of the failure. That allows us to replace that drive before it fails - and on the customer's planned downtime. In the old model, it fails, it's down. The customer waits for us to come out, swap it out, and bring everything back up. In the predictive model, we know it's going to fail, we send the part out ahead of the failure, and we replace that drive on the customer's scheduled downtime. The more of that we can do - and as we expand beyond hardware into operating system, application, and the other layers of infrastructure - we'll be able to exploit the machine learning and the AIOps to a greater degree than what we're doing today on the hardware side.

The way we talk to our customers about the functionality of the solution across IT ops management to support business innovation is that because we've significantly reduced the amount of time they have to spend managing service tickets, they have more time to focus on their digital strategies. We say, "Hey, we're giving you some time back. You don't have to spend all this time interacting with your service provider. You're just going to hear from us when you have a problem and after we've fixed it. We won't bother you for log files and all those things." We're actually giving them time to allow them to do more value-added work, like working on their strategic initiatives and their digital transformation initiative. I think we'll be able to expand on that as we go forward.

What is most valuable?

The ability of this platform to monitor the very diverse assets that we maintain around the world is its most valuable feature. We service over 350,000 data center assets. These assets come in the form of servers, storage arrays, networking devices, etc. We've calculated that we service and support over 36,000 data centers around the world.

We're not really tied in with the manufacturers, but we support a vast array of manufacturers' equipment, like HP, IBM, Cisco, Dell, EMC, Hitachi; and I could go down the line. We have a very diverse install base under contract and TrueSight can connect to all of those and monitor all those different platforms. Many of our customers have as many as 20 tools in their IT environments to try to monitor all this stuff. We can do it all with one, and we're hosting it for them. So it really gives us the ability to take some of that burden off the end customer.

The other really important thing to us, and the reason we chose TrueSight, is not only to monitor and to capture failures and alerts when things fail out there, but to do what we call "automated triage." No matter who manufactured the equipment, when we get the message that tells us something has failed, it always looks the same. Whether it's EMC or Dell or IBM, whatever the equipment might be, TrueSight always returns the event in a standard format which gives us the manufacturer, the model, the serial number. It even gives us a list of what has failed, whether it's a hard drive or power supply, for example. It even gives us the part number of that specific device in that specific machine. That really helps automate the troubleshooting and the triage process. That's a big feature for us.

The solution's event management capabilities are proven. We always like to say it performs as advertised. We evaluated over a dozen products before we chose TrueSight, and we found it to be very good at monitoring at the hardware level, which is core to our business. The ability for it to capture those failures, to capture all the events from that very diverse set of equipment which we maintain out there, means we are very impressed with the performance.


In terms of the breadth of the solution's monitoring capabilities, I've already addressed the different types of products, the different manufacturers. The diversity of what we service out there is amazing, and it can really monitor just about everything that we maintain out in the field. But the other aspect of the breadth is the fact that not only does it do hardware really well, but it's really going to help us start to add to our portfolio of services. We're going to be able to use this to monitor operating systems and applications and software and networks, and even all the way to end-user experience. Ultimately, we're going to be able to move into other areas of service, based on the breadth of what it can do in the total IT infrastructure.

For how long have I used the solution?

In production, we have been using it for about a year-and-a-half.

What do I think about the stability of the solution?

We're in a very stable environment now but it took a little time for us to get there. That's because of the multi-tenancy, the scalability, and the volume of traffic that we're driving through their platform. They're very different than what they're used to. It's potentially hundreds, potentially thousands of customers, with a lot of equipment in their data centers flowing through. We are now in a very stable place in production. We feel very comfortable going forward, scaling it out, and adding thousands of customers to it. It took us a little bit of time to get there and we needed a lot of support from BMC, but we feel good about it right now.

What do I think about the scalability of the solution?

We have a unique use case because BMC typically sells this solution into enterprises that are deploying it within their IT, versus to a managed services provider like us where we're supporting thousands of customers. Multi-tenancy and the scalability have been challenges along the way, as we've grown. But BMC has really been a great partner helping us address those things.

Building that kind of scale and multi-tenancy into the product would serve companies, the way we're deploying it. It's a little different than what BMC is used to, but that would be one thing I would put out there. If anything could have gone better as we were ramping this up and adding a lot of volume to it, I would say it's the scalability. That would be one thing that could be improved.

How are customer service and technical support?

BMC's technical support has been great. They've been by our side. They've been working with us. They could have just said, "Look, our product wasn't built to do that. Good luck." But they didn't. They stuck with us and they're still with us today helping us optimize and do things better. They've been a great technology partner for us.

Which solution did I use previously and why did I switch?

Most of the storage products have a native "call home" feature. It's like email alerting, so when a hard drive fails on the storage array, it will send an email. A lot of the manufacturers did that for the warranties. It would send them an email and they could take care of the warranty claims. What we did was redirect those emails to us, because most of what we do is after the warranties have ended on a product. We were getting all these emails from potentially thousands of things that we were maintaining out there, and every email looked different. Emails from HP looked different than those from EMC which looked different than the ones from IBM or Hitachi. Everything was in a different format. It took a long time to sift through these emails to figure out what was actually wrong, and it was very inefficient. That's how we were doing monitoring.

We also had a little black box that we built internally that was using SNMP and some other technologies. But a lot of customers don't want some rogue hardware in their data center. It's a security concern. So that was very limited in its deployment. Overall, by and large, we really weren't monitoring. We were very crude in our methods and there was a very limited number of things that we were monitoring at the time I came in.

That's when we started thinking, "You know, if we either build or buy a world-class monitoring platform and get it connected to everything, we could really differentiate ourselves in the market." That's what led us to start evaluating some commercial, off-the-shelf things like BMC.

How was the initial setup?

We got it up and running pretty quickly. We had it up within three months because we had to buy hardware and build the whole infrastructure, so it was a little more than just installing the software.

Then we did what I call a controlled deployment. We had about ten to 15 customers in a pilot program. We ran that over about a six-month period before we went live in production.

What about the implementation team?

We had a consulting firm that worked with us, a firm which BMC had brought to the table named Column Technologies. That experience was not good. BMC had said these guys were one of the best partners they had, and they probably are. It could have been Column Technologies, it could have been anybody that they brought in. 

Our implementation was so unique and different compared to what they were used to. They were used to going into an end-user and helping them get this solution deployed within their own IT environment, to manage their own back-office IT. But that's not how we were doing it. We were putting it in as a service platform to manage thousands of customers and hundreds of thousands of devices, potentially. So the implementation was very different.

BMC had to work with us pretty extensively on how we were configuring and putting this in to make it work the way we needed it to work. I'm not going to pick on the consultant that much or criticize them too heavily because this installation was very different than what they were used to doing.

We got a lot of support from BMC because it required it. We needed the guys who built the product to help us get this thing implemented in such a way that it would support our business model. Ultimately, we solved those problems and we're in good shape now. But there were some startup issues, that's for sure.

What was our ROI?

I don't know that I have a number available. When we embarked on this journey we had some business-case assumptions about what our internal savings would be. We've got a little more work to do to come up with those numbers. We need to get more volume deployed before we can say we have a reliable percentage of OpEx reduction.

What's my experience with pricing, setup cost, and licensing?

Pricing is all volume-driven. I think we were paying between $80 and $85 per license. That's per unit, for a perpetual license. You pay it one time and then, every year, you pay 20 percent of that for annual maintenance and support. 

But now that we've grown, we've purchased tens of thousands of licenses and the cost per license has gone down to something like less than $30. 

I wouldn't call it an agent cost because the way they price it is based on the number of things you have connected. You can connect hundreds of things to a single agent but you're paying by the number of things. That's how you use the licenses. So it's really priced by endpoint, not by agent.

Which other solutions did I evaluate?

When we were just starting the journey, we looked at ScienceLogic, Centerity Monitor, and we looked at CA. We also looked at the Microsoft product. Those represent a handful of the products we evaluated.

What other advice do I have?

If we had to do it all over again, we would have spent a lot more time in the early going on planning the architecture, on how we were going to build this out. That could have saved us some pain, once we got it up and running and started adding customers and expanding it. If we had spent a little more time with BMC, planning architecturally how we were going to design this to support the scale we needed, it would have helped. That was a lesson learned. And that would be some advice I would give. Depending on how you're planning to use the tool, make sure you spend some time looking at the architecture in the systems and the architectural design of how you're going to implement it to make sure it's going to meet your needs. Make sure it's going to scale appropriately and do what you need it to do.

Our goal is to get this solution connected to every single customer that we're maintaining equipment for, because of the efficiencies and the improvement in the end-user experience. When I say we support over 350,000 assets in 36,000 data centers around the world, that is our maintenance business. We're working to connect TrueSight to all of that. We have sold - not quite yet deployed, but we have sold - about 33,000 licenses, which means assets. We've deployed just under 10,000 of those so far. So we're making good headway and we're very pleased with how it's performing so far.

One lesson that we've learned is that we're now in a great position to expand our portfolio of services which we offer to our customers, well beyond hardware. Without this technology, we could never get there. Prior to us putting this in, it was all done manually. Phone calls, emails, people driving to the site to try and diagnose problems. It was very manual and inefficient and not scalable the way we were doing business. And we were growing so fast. There's no way we could have scaled to where we're at today or scale to where we want to go, even in our core business.

The other lesson we're learning now is our that customers are asking us to do more and this technology is going to help us do more for them and expand our business. It will enable us to expand our portfolio of services. That's our biggest lesson. When we started out it was really all about driving operational efficiency in our hardware maintenance business. And now we've learned we're in a very good position to move into other services, based on what the capabilities of this platform bring to us, beyond hardware - into application monitoring and operating system and network and all the other pieces of the infrastructure. We can start to support them going forward.

It has completely changed our way of thinking about our strategy going forward. It's amazing.

At this point in time, I'd rate it a ten out of ten. We've got something really unique here. We built some integrations, some things of our own around it. And we're feeling really good about it.

Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
John_Rooney
Vice President of Managed Services at Park Place Technologies
Real User
Enables us to proactively service our customers and even warn them about problems before they occur

Pros and Cons

  • "The fact that they have a very integrated relationship with Sentry Software, the Knowledge Module, is valuable... The richest feature for us is the number of Knowledge Modules that we can load into the product to add breadth of service to the customer. It enables us to move up the operational stack from hardware, to operating system, to application, and to cloud... That enables us to provide one pane of glass over all those layers - hardware, OS, app, and cloud."
  • "Reporting would be an area for improvement in TrueSight... We have almost 800 customers today on TrueSight and just under 10,000 assets. We need to be able to give a customer some information. If the customer's product fails, they'll ask us, "Did it have a problem beforehand?" We have all those events and we know all the problems it had beforehand. We have to be able to give them access to that kind of reporting. That's an enhancement that we need."

What is our primary use case?

Park Place Technologies brought in TrueSight for three reasons. The first reason was the Presentation Server - the architecture. 

The second reason was the fact it has the AIOps piece. 

The third reason was their partner, called Sentry Software, out of France. We are a hardware maintenance company. We're probably one of the largest providers, worldwide, for replacing drives and storage equipment. We brought TrueSight in as a means of seeing if we could reduce the number of physical touches on a service ticket from eight to two. We've been accomplishing that with TrueSight and the Sentry software.

We provide post-warranty support for storage equipment and data center equipment. For example, if it's a VNX piece of storage gear that goes off warranty, we come in and we maintain it at a high level off of what the customer paid the OEM. We do the parts and the service in 35,000 data centers worldwide. TrueSight is enabling us to get that done in an automated fashion.

Sentry is the Knowledge Module we use in TrueSight. It has all the information about the storage equipment that we maintain. It tells us the part, the chassis, serial number, and all the detail that we spend a lot of time on phone calls with the customer trying to ascertain. We're doing that automatically now.

How has it helped my organization?

We brought the product in to handle the following: We're in 35,000 data centers today. We have 16,000 customers and we support about 400,000 assets. Those are big numbers. The pieces of storage equipment we provide have something native from the equipment manufacturers, the OEM, called "phone home." What happens is, when these devices start having a problem they send out an email that says, "I'm having this problem." To put that into perspective, we were trending towards 2,000,000 emails at the end of 2017, and growing. We would have to read 2,000,000 emails to find out what was going on. Something lower than seven percent actually had a problem we really had to read, and something well below one percent of those were actually a service event.

Before we brought in TrueSight, there were 8.2 touches via email or phone call after the ticket had come in, including exchanging log files with the customer through to our resolving it. And on the customer side, they had somebody having to look at the equipment to make sure it was actually working. From those 8.2 physical connections with them, we're down to two with TrueSight.

And here's the big difference. Instead of these things sending all of that information out in those emails, it's captured in the Knowledge Module, the policy and the agent, on the customer side of the firewall. What TrueSight does is that when it installs it takes a week to come up with what's called a dynamic baseline. It says, "For this piece of equipment in your environment, these are the key performance indicators that we're going to watch for." We can see events live when they happen. There are predictive and proactive warnings of failures or potential problems. But all that we ever get, the only thing that's communicated to us, is when there's a failure. So we can see all the chatter and we can look at that by customer, but we don't really need to. And if it's a predictive event, it will send us a notice saying, "We think this part's going to fail in two weeks," and we can help that customer.

But ultimately, what we get is a service ticket: "Failed part at this location. Here's the part number, the serial number, and the recommended remediation." That comes into our support center.

Eventually, when we have it all set up the way we envision it, the info will come into the support center and a ticket will be created and it will automatically connect to the tech and the tech will reach out to the customer. We haven't turned that on yet.

Right now, it comes in and we read it. We call the customer and say, "You have a failure." In most cases, the customer didn't know they had it yet, because it's that fast. We call them up and say, "You have a problem. We have the part, and when would you like Larry to come on site?" Because it's storage, they have to schedule downtime. Then we go out on site, we fix it, and we're done. So it's two physical touches now: We call them and they say, "Yes, it's completed."

So 2,000,000 emails have gone away, pretty much, and it all gets done at the customer site. What we see now, instead, is a couple of hundred or 1,000 service events, versus millions of emails. And we have the right part, the right chassis, the right location.

In our industry, there is about a 75 to 78 percent first-time fix rate, meaning repair personnel do not have to go back to a given site within a week. As a company, we were at about an 86 percent first-time fix rate. With TrueSight, we've never gone below 98 percent.

It's all done with software. I read all of the service emails from our customers. Customers are used to finding a log file and talking to our expert - and if a customer has five different pieces of equipment, there are five different experts involved. Now, they send a note in and they'll say, "This is resolved. I just want to make sure this process is working the way it's supposed to. I didn't call anybody. You called me to tell me I had a problem that I wasn't quite aware of. Now, I have a part, it's fixed, and we're good. Is that how it's supposed to work?" It's funny, because they were used to eight different interactions with us, as opposed to two. It's really cool.

It's taking an extremely manual process and, with the AI piece, literally helping us make better decisions. It's what AI is all about. It's really amazing. I'm excited about it because now, instead of our support center people trying to find the right part, they're calling the customer and saying, "By the way, you have a problem. We have a solution for you, and we notice in the same cluster you may have a failure in a week. Would you like us to look at that while we're there?" It's predictive, proactive maintenance. That is what it enables us to do, versus reactive.

Today, when we are proactive, it's for a fan or it's heat or it's a battery. We get notice they are about to fail and they fail pretty quickly thereafter. But when we start getting to operating systems, there are days, as you know, when you have gone on to your computer and it's been slow. On those days of the month, you can probably look in your network and find that there was a big push to get something done. With TrueSight, we'll be able to start proactively predicting these events before they happen, and rerouting the customer so they don't notice a slowdown. Our tagline is all about uptime. TrueSight helps us deliver that. It helps us deliver upfront.

What is most valuable?

The fact that they have a very integrated relationship with Sentry Software, the Knowledge Module, is valuable. We have one Knowledge Module that we're using today, which is the Sentry KM. We're bringing on the operating system Knowledge Module. The richest feature for us is the number of Knowledge Modules that we can load into the product to add breadth of service to the customer. It enables us to move up the operational stack from hardware, to operating system, to application, and to cloud. It's one presentation layer, one path with these Knowledge Modules, which we can add to it to get greater breadth.

That enables Park Place to provide one pane of glass over all those layers - hardware, OS, app, and cloud - which gives us a really good opportunity with the AIOps piece to get root cause analysis. And that's what our customers want: one pane of glass and a detailed root cause. If you've ever been in a data center when something goes wrong, the first thing they ask is, "What happened? What went wrong? Why did it break?"

It's the Knowledge Module which is the biggest feature that benefits us.

What needs improvement?

Reporting would be an area for improvement in TrueSight. In its purest form, TrueSight is an enterprise product, meaning one company would run it in its internal data centers and internal IT organization. But our company is more of a managed-service provider. We have almost 800 customers today on TrueSight and just under 10,000 assets. We need to be able to give a customer some information. If the customer's product fails, they'll ask us, "Did it have a problem beforehand?" We have all those events and we know all the problems it had beforehand. We have to be able to give them access to that kind of reporting. That's an enhancement that we need.

For how long have I used the solution?

We white label TrueSight, but it's TrueSight at its core and we've had it installed here for just under three years. Version 10.7 is our production instance and we're using version 11.3 in Azure. We're moving to a cloud platform and we're doing that with 11.3. I was hired about 15 months ago.

What do I think about the stability of the solution?

The stability of TrueSight, in its natural form, is very good. We had stability issues with it because we were doing things that were outside of that normal boundary. We were bringing in way too much information. We didn't know how to filter it. Once we got the filtering in place, it became very stable.

In our six- to nine-month process of doing the proofs of concept, when we got to that ninth month we were bringing on as many customers as we could and we were getting everything we could possibly get from all of them. It took us about three months to tune that down, with BMC's help. The product was always stable before that. The product itself didn't fail. We just overwhelmed it. If you talk about data lakes, we flooded the lake every day. And it didn't stop. We just kept bringing more stuff in. Once we added the filters, we tuned those valves, it stayed up and has been running really well.

What do I think about the scalability of the solution?

Our focus is to get 16,000 customers in TrueSight. We're walking up that scale every day. Once we figured the filtering out, we started getting the scalability. Prior to that, we were going the wrong way on scalability. But the elasticity seems to be there, the ability scale.

How are customer service and technical support?

On a scale from one to five, with five being the best, I would give BMC technical support four-and-a-half. It's not a five because the reporting piece is still missing. We need the reporting piece. They can't give us all the help, because that help is just not fully there.

Which solution did I use previously and why did I switch?

We had our own homegrown system - an email box - that the stuff came into. That's all we had. We did not use a competing product.

We went down an RFP path over three years ago. Our company has grown pretty dramatically. Between 2015 and two weeks ago, we made 14 acquisitions. There was no way we could grow the business mechanically with an 8.2-touch model in place. The support centers would be the biggest expense in the company.

It was a two-pronged approach to looking at resolving the opportunity that our growth created. The first approach was the customer, to give them quality of service: not having to get log files, not having to figure out what's going on at their end, and not having to call us. 

The second approach, for the Park Place support center, was to give them better tools to provide better service to the customer. We wanted our support center to go from trying to figure out what the right part was, to letting the customer know they were about to have a problem. That's a big difference. Both of these approaches have happened.

If you put that around the world with our growth, we now have a global approach with regional focus and local delivery. Because the systems are reporting the information, we don't worry about time zones or language. All that stuff goes away. The machines speak MIB, and the MIB communicates through TrueSight, and we get the information. We don't have to speak the local language until we go out and fix the problem, because the customer is not calling the support center anymore. We have a global footprint with a regional focus. In APAC, they're looking at problems that could be happening overnight in the US and vice-versa, or in EMEA. The problem is resolved, the customer is communicated with, and the person providing that communication speaks the local language.

The machines are literally running this thing, and all we are is the delivery model. TrueSight crosses all those barriers. It crosses time zones, it crosses language; it has all the pieces we need to know about repair, including the part and the location. It knows everything we need to know about the equipment, all the software, the LPARs, etc. It gives that to us in the support center, we contact the customer, and then we speak the local language and we bring the part locally.

How was the initial setup?

It's very straightforward from a setup perspective. We were able to install it and get it running relatively quickly. That's not the hard part. 

The complexity comes in because, instead of it being what I would call an off-the-shelf product, TrueSight is a series of products with an encyclopedia of tools and they all add benefit. But getting those tools to work, that's where the complexity is; knowing exactly which piece to pull and to connect. An example would be putting filters in place. That took us a while. 

If you look at an average installation, it takes three to six months to get up and running. We got up way faster than that, but it has taken us about a year to get the engine to run at the capacity its capable of. It's like gas mileage where you have to drive it properly to get the right gas mileage. That has taken us some time to do. But once we got there, we have certainly been getting everything that's promised.

Park Place was up and running within a month to two months. Our production product was probably nine months out. That's when we started figuring out the filtering. We brought everything in and opened all of the spigots up, and we had all this volume coming in. With BMC's help - they were very helpful in this capacity - we were able to turn the valves to the proper flow, so we weren't flooding the thing every day.

Our implementation strategy was to put it up in a proof of concept first in a DevOps environment because our goal was to bring it out to customers. Once we got it into production, we started bring customers on as PoCs. We did about six months' worth of bringing on the customers, making sure we could bring it out and get its sea legs. Then we started deploying customers as fast as we could. And that's when we went from 10.5 to 10.7, and now we're moving to the new platform with 11.3.

What about the implementation team?

We installed it ourselves, but with BMC's help. We did it ourselves with them looking over our shoulder. 

To get to the 10.7 and 11.3, their services, the Premier Support, created a "cookbook" for us to do that migration. That was extremely helpful. 

And from a consulting perspective, as part of the Premier Support, we were able to get the right consultants in to help us fine-tune that motor. They would come in and look at it and say, "By the way, you can filter this stuff out because you're not actually using it." I liken it to our cell phones when our data plans are out of whack with our use. We pay way more than we need for our minutes and a consultant comes in and says, "You can do this, this, and this, and be more efficient." They've been very helpful there.

What was our ROI?

As an example I looked at recently, we had a customer that was doing 27,000 emails a month. That would mean that if we spent 30 seconds reading each email, it would total 2,700 hours per year just reading emails. And that's not solving the issues. That works out to 1.3 FTEs just reading the emails for that customer. Suppose all-in, in the US, we're paying FTEs $72,000 to $75,000 just to read emails. Our license fees are certainly less than that for that one customer.

In terms of ROI, we haven't fully gotten there yet. We've reduced those 2,000,000 emails by 42 percent, so far. We haven't gotten them all done yet. But who do you think is on our list to get moved over to this solution? That customer 27,000 emails, we're going to move them over as fast as we can.

Our ROI is to get people off the old, manual system with 8.2 touches and down to two touches. Once we start hitting critical mass, the product will certainly pay for itself in a very reasonable period of time.

What's my experience with pricing, setup cost, and licensing?

We pay license fees of between $150 and $200 per asset.

In terms of the product's pricing, we don't pay per item and it's not crazy. It's cost-effective enough for us to offer it for free on storage, and we've got some 4,000 storage assets using the product every day.

We bought a large block of licenses. Interestingly enough, we provide TrueSight for free for our storage customers. We thought it was that important, to give them the licenses for the Knowledge Module and the policy. We do charge for network and we do charge for servers.

There is an enterprise software license fee, and then you pay a percentage for your maintenance, and then Premier Support. For example, if you buy a two-year license for the product, then the maintenance fee is added to that for two years at X percent a year. Then there's a small fee on top of that for Premier Support, which I would highly recommend to a company. Standard support gives you normal support processes, while Premier Support is 24/7. It's at a much higher level of support. For a production environment, I would strongly recommend it. In comparison to the extra cost, the value of Premier Support is very worthwhile.

Which other solutions did I evaluate?

We did an RFP and looked at seven products. Although I wasn't here when the company did that, I know they looked at Nagios.

One of the two key reasons for choosing TrueSight was the AI piece, the artificial intelligence; that's the promise of the future for us. We get some of it today. We have the predictive and proactive parts today, but we're going to grow that as we grow up that stack to go to OS, and application, and cloud, to get more AI value.

And the other one was the knowledge module relationship they have with Sentry Software. We're in storage hardware. That's the number one product out in the market. Sentry is a partner with BMC and that has been the lifeblood of our whole "global, regional, local" approach.

What other advice do I have?

The advice I would give is not to make a mistake and think it's an off-the-shelf product like Office 365. Understand that it's a very robust set of tools and procedures. You really should define what you want to do with it before you bring it in the door. If you had asked us before we brought it in, we had an idea, but we didn't know exactly how we wanted to utilize it and that was because we didn't know the capabilities of it. We thought we could do X, and we found out what we really needed to do was Y. It was that gap that we had to fill, and that took us time. So the better you can define your requirements, the quicker you'll gain the true value with your outcome.

Believe me, we're seeing true value. But if we had had a better definition of what we needed up front... We thought we had all the information in the RFP but we probably didn't. I'm not sure you ever can do that, but do a good job of architecting the scope or the spec of what you're trying to do and then get their input. They can give you that information and that's when you get your true value. When those two things meet, you get the value prop.

Working with BMC has been interesting. It's been very helpful. They're part of our team, which is great. They bring their partners to the table. Their partners don't have an agenda. Everything that we get done is literally for us as the end-user and for our customers. I've not had that often before with software companies.

They invest in customer satisfaction to the point that we've asked them to implement some things that are a little bit beyond the normal scope of TrueSight. We're using it for 800 customers in an instance of TrueSight, where it really should be one TrueSight for one customer. And they've helped us make all that work, and arm-in-arm.

With Sentry it has been a team effort. Sometimes we don't know who on the call is not on our team. We're all having the same conversation, and it's not a situation where "BMC said," or "Sentry said," or "we said." It's one common unit. We had a call yesterday about architecture and making that whole piece work. I said to their architect, "Gee, you know I really like that document you put together." He said, "John, you can use any piece of that you possibly want. Go ahead and take it and do anything you need to do with it, make it work your way." That doesn't happen very often, where someone is building their own thing and they come back to you and say, "Yeah, you can use it any way you want. Just make sure it makes for you."

We have 11 people who are installing agents and policies at our customers' sites. Their job is the implementation with our customers.

In terms of people actually running TrueSight within the company and our IT infrastructure, we have parts of a couple of people. It's a part of their job. It's almost like shift work. We have a part of a full-time person on a daily basis engaged with TrueSight care and feeding. Running the product requires less than two people, all-in.

We will be hiring a new person to be a TrueSight architect, because we're bringing on more of those KMs and we need somebody who can help us do the rules management. They're not going to be here running the product, they're going to be adding new features.

Overall, I would give the product a very solid nine. If I had the reporting piece, I would give it a ten. It has provided more value than we expected and it does what it says it's going to do. You can't ask for more from a product than that.

Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
it_user814440
Lead Engineer at a tech services company with 501-1,000 employees
Real User
Proactive monitoring helps minimize downtime, although requires lots of servers

How has it helped my organization?

It helps to minimize downtime of applications by enabling proactive monitoring.

What is most valuable?

Wide support for monitoring Strong event management Service management capability Baselining (analytics) Easy to integrate other tools with it

What needs improvement?

Deployment requires lots of resources (servers). It has too many consoles. Pricing is very high.

For how long have I used the solution?

More than five years.

What do I think about the stability of the solution?

No stability issues. We can make it stable by allocating enough resources.

What do I think about the scalability of the solution?

No issues with scalability. We can increase resources vertically, according to growth in infrastructure.

How are customer service

How has it helped my organization?

It helps to minimize downtime of applications by enabling proactive monitoring.

What is most valuable?

  • Wide support for monitoring
  • Strong event management
  • Service management capability
  • Baselining (analytics)
  • Easy to integrate other tools with it

What needs improvement?

Deployment requires lots of resources (servers). It has too many consoles. Pricing is very high.

For how long have I used the solution?

More than five years.

What do I think about the stability of the solution?

No stability issues. We can make it stable by allocating enough resources.

What do I think about the scalability of the solution?

No issues with scalability. We can increase resources vertically, according to growth in infrastructure.

How are customer service and technical support?

Support quality is good.

Which solution did I use previously and why did I switch?

Have not used any other product.

How was the initial setup?

It’s bit complex.

Disclosure: My company has a business relationship with this vendor other than being a customer: Solution partner.
ITCS user
It Consultant at a tech services company with 11-50 employees
Consultant
Eases cross launch between multiple tools and improved monitoring views and dashboards

Pros and Cons

  • "It provides common administration, and a Single Sign-On Platform with RBAC, which eases the cross launch between multiple tools"

    What is our primary use case?

    Performance and Availability monitoring.

    Putting all the infrastructure and application and various monitoring into a Service Context for Service Monitoring. 

    How has it helped my organization?

    Faster, more efficient, better views for the operators to view. A more centralized approach to managing the infrastructures. Improved app visibility features.

    What is most valuable?

    TrueSight Operations Manager is a combination of different components (applications) like Presentation Server, Impact Manager, and AppVisibility Manager and IT's Data Analytics, but it provides a seamless integration and a holistic view with Application and Infrastructure Health views.

    It provides common administration, and a Single Sign-On Platform with RBAC, which eases the cross launch between multiple tools and saves the need to configure users for all the different components and improving monitoring views.

    What needs improvement?

    There are no broader areas of improvement. It would vary, environment by environment. As such, there are no outstanding bugs or defects that are not documented.

    For how long have I used the solution?

    One to three years.

    What do I think about the stability of the solution?

    No. TSOM 10.7 is quite stable provided it is installed with the vendor recommendations, which are created by the experience drawn from customers and the complexity of the environment.

    What do I think about the scalability of the solution?

    No issues with scalabilty. The customers where I have implemented this have ranged from small to very large. I have never faced any deployment challenges in any of these cases.

    How are customer service and technical support?

    Excellent support from the vendor. Support technicians and developers are all available to help if there is an issue. Support cases are tracked and the resolution (of the same) is pushed to be done faster.

    Which solution did I use previously and why did I switch?

    No, I have always worked with BMC Solutions for infrastructure and application monitoring.

    How was the initial setup?

    Yes, the setup after the design of the solution was pretty straightforward. The vendor has a lot of free Webinars where they will explain the best practices to design a solution and the best ways to implement it. These guidelines can be used to build custom guidelines for the customer.

    What about the implementation team?

    Implemented with in-house team, have been interacting with Vendor team as well with excellent expertise in Truesight Operations Manager.

    What's my experience with pricing, setup cost, and licensing?

    I have not dealt with the pricing or licensing, so I cannot comment.

    Which other solutions did I evaluate?

    Not applicable.

    What other advice do I have?

    It is quite an efficient tool. There are continuous improvements being performed to satisfy the customer needs, but like any other tool or automation, it has some issues.

    TrueSight offers a global solution with possibility of end to end integration.

    Disclosure: My company has a business relationship with this vendor other than being a customer: Partners.
    ITCS user
    Senior Performance Analyst and BMC ProactiveNet administrator at a government with 10,001+ employees
    Real User
    The tailoring of the knowledge modules has been particularly useful

    What is our primary use case?

    Monitoring applications and servers. We also monitor individual pieces of management software, like WebLogic.

    How has it helped my organization?

    Proactively monitoring 24/7/365 on all of our servers. This allows technical staff to focus on other areas and our operators can monitor the systems.

    What is most valuable?

    The tailoring of the knowledge modules has been particularly useful as I can streamline the agents to only report on critical events.

    What needs improvement?

    The knowledge modules could be more lightweight in size. At present, the installation packages can be quite large.

    For how long have I used the solution?

    One to three years.

    What is our primary use case?

    Monitoring applications and servers. We also monitor individual pieces of management software, like WebLogic.

    How has it helped my organization?

    Proactively monitoring 24/7/365 on all of our servers. This allows technical staff to focus on other areas and our operators can monitor the systems.

    What is most valuable?

    The tailoring of the knowledge modules has been particularly useful as I can streamline the agents to only report on critical events.

    What needs improvement?

    The knowledge modules could be more lightweight in size. At present, the installation packages can be quite large.

    For how long have I used the solution?

    One to three years.
    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    it_user802980
    Senior Software Engineer with 201-500 employees
    Vendor
    Online documentation is often incorrect/incomplete. It is helpful to be able to apply rule-based routing to alerts.

    Pros and Cons

    • "It is very helpful to be able to apply rule-based routing to alerts."
    • "TSOM's ability to consolidate alerts into a single location and provide filtering of alerts is great."
    • "It has provided us with a single location to host all events to be viewed/monitored by our NOC. This has greatly helped them to streamline their processes."
    • "BMC's solutions for cloud monitoring (monitoring of AWS and Azure resources) are very poor in stability and customization."
    • "BMC's online documentation is often incorrect or incomplete."

    What is our primary use case?

    We utilize BMC TSOM to monitor our entire infrastructure and all applications that lie therein. Our infrastructure is hosted both in our datacenters and in cloud hosted services (AWS and Azure).

    How has it helped my organization?

    It has provided us with a single location to host all events to be viewed/monitored by our NOC. This has greatly helped them to streamline their processes.

    What is most valuable?

    TSOM's ability to consolidate alerts into a single location and provide filtering of alerts is great. It is very helpful to be able to apply rule-based routing to alerts as well.

    What needs improvement?

    • BMC's solutions for cloud monitoring (monitoring of AWS and Azure resources) are very poor in stability and customization. 
    • BMC's online documentation is often incorrect or incomplete.

    For how long have I used the solution?

    One to three years.
    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    Benjamin McKrill
    Enterprise Monitoring Automation Engineer at a healthcare company with 10,001+ employees
    Real User
    Top 20
    Allows our operations team to have one single application to reference when investigating issues in our environment

    What is our primary use case?

    We utilize BMC TrueSight Operations Management to proactively monitor all of our physical and virtual server environments. Coupled with Entuity for TrueSight Operations Management, we can have a holistic view of our Network and Server environments health in a single pane of glass.

    How has it helped my organization?

    It allows our operations team to have one single application to reference when investigating issues in our environment.

    What is most valuable?

    Signature baselines, which have allowed us to fine tune many of our events and significantly reduce the number of events generated.

    What needs improvement?

    I would really like to see out-of-the-box support for monitoring uninterruptible power supplies.

    What is our primary use case?

    We utilize BMC TrueSight Operations Management to proactively monitor all of our physical and virtual server environments. Coupled with Entuity for TrueSight Operations Management, we can have a holistic view of our Network and Server environments health in a single pane of glass.

    How has it helped my organization?

    It allows our operations team to have one single application to reference when investigating issues in our environment.

    What is most valuable?

    Signature baselines, which have allowed us to fine tune many of our events and significantly reduce the number of events generated.

    What needs improvement?

    I would really like to see out-of-the-box support for monitoring uninterruptible power supplies.

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    ITCS user
    Enterprise Monitoring Automation Administrator at a healthcare company with 10,001+ employees
    Real User
    We can verify uptimes as another source of keeping devices in compliance

    What is our primary use case?

    We use it to scan and monitor our server environment. This allows us to monitor devices which are introduced as they are spun up, to see that there are no unknown devices, then we can verify uptimes as well as patching as another source of keeping devices in compliance.

    How has it helped my organization?

    Allows reliable access to server hardware info, uptime statuses, current patching, and much more. This allows us to make sure we have an updated inventory, as we feed this into our inventory system along with info from Atrium CMDB.

    What is most valuable?

    The ability to pull hosts together to show what processes are running, so it can be used for change management.

    What needs improvement?

    More modules for less popular applications and better…

    What is our primary use case?

    We use it to scan and monitor our server environment. This allows us to monitor devices which are introduced as they are spun up, to see that there are no unknown devices, then we can verify uptimes as well as patching as another source of keeping devices in compliance.

    How has it helped my organization?

    Allows reliable access to server hardware info, uptime statuses, current patching, and much more. This allows us to make sure we have an updated inventory, as we feed this into our inventory system along with info from Atrium CMDB.

    What is most valuable?

    The ability to pull hosts together to show what processes are running, so it can be used for change management.

    What needs improvement?

    More modules for less popular applications and better documentation. Documentation can be great at times, but lacking in other areas.

    For how long have I used the solution?

    One to three years.
    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    ITCS user
    Project Manager with 51-200 employees
    Vendor
    The ability to fulfill the role as a manager of managers is fantastic. We integrated a number of other monitoring tools into BMC.​

    Pros and Cons

    • "I believe that the ease of use and UI is great"
    • "I think the ease of deployment needs to be looked at. It would be great if the deployment was faster and easier."

    How has it helped my organization?

    I have used the BMC product in two separate instances, the one was as a monitor of monitors for an ops bridge to have a single view of all monitoring tools reporting into one source, this worked extremely well.

    The other instance was as a managed services looking at multiple different customers across South Africa.

    What is most valuable?

    I believe that the ease of use and UI is great. The ability to fulfill the role as a manager of managers is fantastic. We integrated a number of other monitoring tools into BMC.

    What needs improvement?

    I think the ease of deployment needs to be looked at. It would be great if the deployment was faster and easier.

    What do I think about the stability of the solution?

    We experienced no issues with stability on both BMC and HP.

    What do I think about the scalability of the solution?

    Only issue we experienced with scalability was that the maximum growth needs to be catered for in the initial build. Planning needs to be done carefully.

    How are customer service and technical support?

    Technical support from BMC was good, had to wait a little longer some times for a response which complicated things with the client.

    Which solution did I use previously and why did I switch?

    The companies I worked for were BMC shops from start to finish, made use of Remedy, BCO, Control M etc. Companies wanted the best of breed.

    How was the initial setup?

    The setups were not complex but there was a large amount of pre-deployment and planning that went into the solutions.

    What's my experience with pricing, setup cost, and licensing?

    The solutions are not the cheapest but are robust and stable. License model is rather complex and BMC do often change the model.

    Which other solutions did I evaluate?

    Other products were evaluated, such as HP and IBM as well as various opensource solutions.

    What other advice do I have?

    My advice would be do not cut on the planning time as well as testing time, UAT, SIT as well as FIT.

    Also make sure that you have the correct infrastructure in place and also cater for the intended growth.

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    ITCS user
    CEO at a tech services company
    Consultant
    Before choosing this product, we evaluated other options, and we still do. Mainly, it ends in a mixture of tools, and using open source-based tools reporting up into it.

    Pros and Cons

    • "The Event Management is outstanding; still is the most interesting part of the product."
    • "The sizing (which is difficult), the maintenance of it and the upgrade paths. This is a difficult area which is not easy to cover, as every client has a different approach of implementing the product."

    How has it helped my organization?

    We do work as independent consultants, but mainly the focus is on a crisp and reliable base layer for Service Level and Business Service Management with a working CMDB. In order to map the data and events correctly, you have to have a solid foundation.

    What is most valuable?

    The Event Management is outstanding; still is the most interesting part of the product.

    What needs improvement?

    The sizing (which is difficult), the maintenance of it and the upgrade paths. This is a difficult area which is not easy to cover, as every client has a different approach of implementing the product.

    What do I think about the stability of the solution?

    Stability is mainly a sizing issue. The product needs to be correctly sized and architectured. For this, you need skill and experience. If you follow this advice, you will have no issues. If you implement without a plan or architecture, you will be lost.

    What do I think about the scalability of the solution?

    This is related to stability. You need to know what you have, then all will go well.

    How are customer service and technical support?

    Customer Service:

    People buy from people. If your account rep is a good one, all goes well. You cannot answer that easily. I have seen light and shadow, as one could say.

    Technical Support:

    Support has room for improvement. Very often, you find yourself answering the very same questions over and over again. I would give it a 6-7.

    Which solution did I use previously and why did I switch?

    Some of the clients I have came from other solutions; mainly because they were outdated they switched, or because they were discontinued. The same applies in the other direction, especially if the clients had the wrong account rep.

    How was the initial setup?

    Initial setup seems to be easy. The deeper you go, the more you need to know about the product, especially about its agents. Some functions are under-represented, especially the Agent Consoles, which are a little too basic compared to the old versions. So you still use a mix of versions which leads to no savings in hardware at all. HA setups are complex (best to use VMotion). Ports are not that well documented. Again, experience is the point. If you know the products under the hood for a long time, you will do good; otherwise, you might run into problems. This is the same for lots of products in the area. If you know what you do, all goes well.

    What about the implementation team?

    We normally do these kinds of implementations; I am a consultant, not a real end-user, as the clients no longer have the expertise on board (no matter which product they use).

    What was our ROI?

    Monitoring is like an insurance. If you have it, you feel safe. If you do not have it and run into an accident, you wished you had it.

    What's my experience with pricing, setup cost, and licensing?

    Use conservative figures. In terms of hardware, monitored servers and also effort. The product is not cheap. But as with other products, you get what you pay for.

    Which other solutions did I evaluate?

    Before choosing this product, we evaluated other options, and we still do. Mainly, it ends in a mixture of tools, and using open source-based tools reporting up into it, like Zabbix, OP5, Nagios XI or something like that.

    What other advice do I have?

    Estimate enough time for the implementation. Never trust anyone who tells you that you will be finished in three months. Calculate at least one year for all tasks.

    Disclosure: My company has a business relationship with this vendor other than being a customer: We are a consulting partner of BMC, as we are for other vendors. But we do not sell any licenses at all, for any vendor. We do pure consulting, also for other products. We simply report and present different options, and the client decides what to use.
    ITCS user
    Performance Management Consultant with 51-200 employees
    Vendor
    BMC BPPM Architecture Size Scale and Capacity Introduction
    BMC BPPM Architecture v9.5 – Lean, Mean, Analytics-Crunching Machine BMC released the latest update to its ProactiveNet Performance Management (BPPM) suite in January of this year. The BPPM 9.5 Sizing and Scalability upgrade represents a tremendous increase in capacity without associated new hardware cost. If you’re introducing BPPM for the first time, you will, of course, have to buy hardware, but if you’re upgrading from a prior version to 9.5, you can receive 9.5’s many benefits and enhancements without paying for any new hardware. In fact, you may actually be able to reduce your hardware footprint. You’ll be able to gain the new abilities and new capacity now by deploying 9.5. Check out for our “Size, Scale and Hardware” presentation, where we will show you some enterprise examples of…

    BMC BPPM Architecture v9.5 – Lean, Mean, Analytics-Crunching Machine

    BMC released the latest update to its ProactiveNet Performance Management (BPPM) suite in January of this year. The BPPM 9.5 Sizing and Scalability upgrade represents a tremendous increase in capacity without associated new hardware cost.

    If you’re introducing BPPM for the first time, you will, of course, have to buy hardware, but if you’re upgrading from a prior version to 9.5, you can receive 9.5’s many benefits and enhancements without paying for any new hardware. In fact, you may actually be able to reduce your hardware footprint. You’ll be able to gain the new abilities and new capacity now by deploying 9.5.

    Check out for our “Size, Scale and Hardware” presentation, where we will show you some enterprise examples of exactly how this release can dramatically reduce your hardware footprint, saving you thousands of dollars in system costs, and hundreds of man hours in administrative costs.

    See how 9.5 compares to versions 8.6 and 9.0 with regards to sizing and capacity.

    http://advantisms.wistia.com/medias/ua5li1146g?emb...

    This new release makes it a great time to upgrade or add BPPM to your enterprise monitoring software options. The new features in 9.5 make it more useful than ever, and the capacity increases are incredible.

    To demonstrate the vast improvements in size and scale in BPPM 9.5, here’s an apples-to-apples comparison of the last three versions of BPPM. Specifically, we’re looking at the benchmarks associated with a Large Hybrid BPPM infrastructure: Data, Event, and Service Impacts. These are the maximum benchmark counts, based on the current best practices deployment approach. As you can see, these numbers are huge.

    • 1,700,000 Total Attributes/Parameters. Attribute/parameters are monitored items, such as the CPU % Utilization rate. This is more than triple 8.6 which had a maximum of 500,000, and demonstrates a 1:1 capability with the BPPM Integration Service Server in 9.5. BPPM 9.0 had a maximum 1,200,000 attributes. That means 9.5 allows 500,000 more attributes than 9.0 did.
    • 250,000 instances per server, which includes your database instances, log files, processes, and service, which is an increase from the 65,000 on 8.6, and almost four times the number of instances. It doubles the number of instances allowed on 9.0 of between 120,000 and 150,000 instances.
    • 20,000 enterprise devices, which are your systems and network components across your enterprise. This is double the 10,000 capability of 8.6, and equal to the 20K allowed on 9.0. This maximum supports the demands of most large enterprises.
    • Up to 100 simultaneous end users, increased from 30 on 8.6 and 50 on 9.0. The number of supported users has doubled between 9.0 and 9.5.
    • 40,000 intelligent events per day, up from the 2,000 per day on 8.6. This increase is off the charts.
    • 350,000 external events, compared to 200,000 on both 8.6 and 9.0.

    The most impressive part of the capacity and capability increases from 8.6 to 9.5 is that they come with no increased hardware requirements, as you might expect. This is virtually unheard of in the tech industry, in which new capabilities and capacities almost always require increased hardware capacity to go with it.

    Think about one of the old household devices you have sitting around – perhaps an old iPhone or a computer that’s a few years old. Chances are, you’ve run across a piece of software or an app you’ve tried to install, only to find that your old hardware isn’t capable of running the new enhanced software. If you want to run the app, you’ll have to get a new iPhone. BMC, on the other hand, has managed to create a new version that works with your old hardware, so your enterprise won’t have to foot the bill for hardware upgrades just to run this software.

    Let’s take a more specific look at the hardware needs for the BPPM versions. All require 64-bit architecture. Additionally, the requirements across all three versions are pretty similar, hence not needing to upgrade hardware:

    • Windows 2008 R2
    • Intel Core i7
    • 2×4 Core, or 8 core total
    • 3.067 GHz on 8.6 and 9.0; 2.2 GHz on 9.6. That’s right – it actually went down on 9.6 despite the capacity increases.
    • A recommended 32 GB of memory for Data, Event, and Service Impacts.

    If you have a deployment of 8.6 or 9.0 and are running close to the maximum number of monitored instances, now would be a good time to start designing your migration path to a 9.5 architecture. In summary, this upgrade can gain you tremendous technical capacity and capability, without incurring the cost of new hardware.

    If you would like to see more BPPM 9.5 Content for other new BPPM 9.5 features, hands on presentations, and a series on "Understanding BPPM Analytics", be sure to checkout the blog I write for here.

    http://blog.advantisms.com

    I hope you find this information useful!  If it is well received I'll be sure to have follow up posts.

    Have a  GREAT day!

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    ITCS user
    Performance Management Consultant with 51-200 employees
    Vendor
    Introducing the BMC BPPM 9.5 Central Monitoring Admin Policy Console
    BMC Patrol Agent Configuration Automation using the (TrueSight) BPPM Central Monitoring Administration Console (CMA) Have you ever been frustrated to discover that your monitoring failed because one of your Patrol agents isn’t configured correctly? After you investigated you were told that someone sent you an email or called and left a voice mail, telling you it some set of systems was ready for monitoring, and you didn’t get them. Everyone knows how adequate email and phone messages are right? Communication breakdowns involving your Patrol Agent infrastructure are nothing new. They’ve been around for many many years. I know them very well. Everyone is very busy, and that only compounds the problem. There are so many things that can go wrong with keeping all your agents…

    BMC Patrol Agent Configuration Automation using the (TrueSight) BPPM Central Monitoring Administration Console (CMA)

    Have you ever been frustrated to discover that your monitoring failed because one of your Patrol agents isn’t configured correctly? After you investigated you were told that someone sent you an email or called and left a voice mail, telling you it some set of systems was ready for monitoring, and you didn’t get them. Everyone knows how adequate email and phone messages are right?

    Communication breakdowns involving your Patrol Agent infrastructure are nothing new. They’ve been around for many many years. I know them very well. Everyone is very busy, and that only compounds the problem. There are so many things that can go wrong with keeping all your agents configurations in sync and up to date. Wouldn’t it be nice if this could all be automated somehow?

    There is a new ability you need to be aware. The BPPM 9.5 Central Monitoring Administration (CMA) Console. The CMA was introduced with BPPM 9.0, but it wasn’t flexible enough to be useful in very many situations. One of the key features in this new release was the Policy Management interface. Although useful, its ability to truly manage your Patrol Agent infrastructure outside of Patrol Configuration Manager (PCM) was very limited. Well, that all changes with CMA 9.5.

    With the release of the 9.5 BPPM CMA Console, and the greatly expanded Policy capabilities, you’ve never been so close to real-time Patrol Agent configuration automation. Say hello to your new little friend, the BPPM CMA Configuration Policy.

    http://advantisms.wistia.com/medias/nvn9c6862k?emb...

    BPPM Agent Configuration Policies – A Brief History of the BPPM 9.0 CMA Introduction

    BPPM 9.0 introduced configuration policies for the first time with the CMA. A CMA Policy is suppose to replace the need for manually deploying configuration settings using Patrol Configuration Manager (PCM). Unfortunately, with the 9.0 policies you had little choice with respect to the policy “selector criteria”. The selector criterion is the mechanism that engages the CMA Policy.

    You were able to specify the use of one item, the BPPM Tag, as the policy selector, which meant that you had to create a separate Policy and BPPM Tag for every possible scenario.

    If you worked with the CMA in version 9.0, you know first hand how limited that was. Chances are you looked at it, scratching your head, and moved on.

    The 9.0 CMA release allowed you to deploy a simple Policy with three configuration options: Monitor, Threshold and Server Policy Configurations. CMA 9.0 made these three administrative options available for the first time but the overall policy capabilities were limited and ultimately became more work to manage than continuing to use PCM. They’ve been greatly expanded with version 9.5.

    The BPPM CMA 9.5 Brings Patrol Agent Configuration Automation 

    With the release of the 9.5 BPPM CMA Console, the Policy capability features available grew from three in version 9.0, to a total of nine.

    The additional features include seven total monitoring Configuration Policy options, one blackout option and one staging Policy option. Nine in all, compared to only three before. And the Policy “Selector Criteria” specifications, the item(s) which engages the Policy, has gone from one, the BPPM Tag, to eight. The new added diverse selector abilities allow for creating simple, or very complex activation condition now. With all of those new features, CMA 9.5 allows for dynamic automation of your Patrol Agent configurations like never before.

    Here are the 7 New BPPM 9.5 CMA Policies and a description of they can be used.

    Monitoring Configuration – You can use this feature for filtering or turning the monitoring configurations off or on, based on your selectors. In the associated webinar, I construct one of these policies as an example, showing how they can be used to disable a specific monitor, for a specific OS, running in a specific environment.

    Filter Configuration – This is a helpful addition to CMA 9.5. Filter Configuration allows you to specify what monitoring data is not meant to go into the BPPM database. With this new feature, you can specify the attributes and parameters that you want to stream into the BPPM console and see, without storage in the database.

    Agent Threshold– This policy allows for setting traditional monitoring thresholds at the Patrol Agent Level. It allows you to specify the alert threshold settings you use to set and deploy within PCM or from the Patrol Console, down the agents. These can now be set, and take effect as soon as the agent checks into the BPPM infrastructure.

    Server Thresholds – These thresholds are set at the BPPM server level. You can set Absolute, Signature and Intelligent thresholds within a policy based on the same selectors as the lower agent level.

    Agent Configuration – This new policy has several capabilities. It allows for setting up Agent specific settings like the Default Monitoring account. You can also use this feature to specify Polling Intervals for the Patrol Knowledge Module (KM) Collectors. The KM Collector gathers the information at polling intervals, and depending on how you construct the selectors, you can now change these intervals within the CMA console now, outside of PCM.

    Server Configuration– This feature is ideal for the policy options in Groups within the BPPM Operations Console. For example, if you have servers associated with an application named, “NewApp,” you can use this policy to group all the servers in one location within the Operations Console. By deploying a tag, “NewApp” to all the involved systems, the Patrol Agents check into BPPM, see the policy and automatically add the servers to the group you specify. If the group doesn’t exist, it will create it and place all the NewApp systems within that group for viewing, automatically.

    Configuration Variables – This last option allows for the manual creation of any agent configuration variable you want or need that can be used by the agent. But the key feature of this one is in the ability to import your existing PCM configurations.

    This new CMA brings real automation into the daily maintenance associated with your Patrol Agent infrastructure. Quit playing phone and email tag with your system and application administrators and see how to put this to work right now.

    To see this new CMA Policy in action, be sure to check out this hands-on video introduction.

    http://advantisms.wistia.com/medias/nvn9c6862k?emb...

    To read about and see the CMA put a Patrol Agent Blackout into action, check this out.

    Putting the BMC Blackout Policy to Work


    To read about and see the CMA handle the Patrol Agent event streams and give you a brand new, centrally focused Event Management mechanism, check this out.

    Simplified Patrol Agent Event Management


    New Update!!

    How to automate New Patrol Agent Package Deployments with CMA Policies.  I'll show you step by step how to use a CMA Policy to automatically baseline your new Patrol Agents the moment they come up on the network, using your existing PCM configurations.

    Automating The Configuration Deployment of Your New Patrol Agent Builds


    To read more about (TrueSight) BPPM 9.5, be sure to check out the blog on the topic located here.

    http://blog.advantisms.com

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    ITCS user
    Performance Management Consultant with 51-200 employees
    Vendor
    Start Using BPPM Analytics, Signature and Intelligent Thresholds and get rid of false alerts
    Start using BPPM Analytics and become a monitoring genius! Performance Management of your business services requires an ability to understand past behavior of all your key monitoring components. Do you know if your current alert thresholds are the result of a persons’ quick guess or assumption? Does your monitoring repeatedly generate large amounts of false alerts, and you find yourself struggling to find a solution? Once you understand how BMC’s BPPM Analytics works, using Signature Thresholds andIntelligent Thresholds, you’ll have just what you need to look like a monitoring genius. Doing what you’ve done before, will not work for you, going forward It was not long ago, when everyone had to rely on guesses or assumptions, for specifying alert thresholds. When enterprises consisted of…

    Start using BPPM Analytics and become a monitoring genius!

    Performance Management of your business services requires an ability to understand past behavior of all your key monitoring components. Do you know if your current alert thresholds are the result of a persons’ quick guess or assumption? Does your monitoring repeatedly generate large amounts of false alerts, and you find yourself struggling to find a solution?

    Once you understand how BMC’s BPPM Analytics works, using Signature Thresholds andIntelligent Thresholds, you’ll have just what you need to look like a monitoring genius.

    Doing what you’ve done before, will not work for you, going forward

    It was not long ago, when everyone had to rely on guesses or assumptions, for specifying alert thresholds. When enterprises consisted of very few devices, you could rely on an individual’s expert knowledge to guide you. In most cases you might actually get most of the alert settings correct. The likelihood of having incorrect settings however was still likely, but with fewer devices to alert on it wasn’t a chronic problem. That simply is not the case any longer. Using the same approach today or tomorrow will quickly put you in the hot seat, and your monitoring reputation in jeopardy.

    If your engine light comes while you’re driving over and over again without any issue found with your car, will you continue to trust it? Of course not – why would you? The same is true with your businesses performance management monitoring. If you continuously alert incorrectly, causing your support teams to be notified falsely over and over again, the impression will be the same as a bogus check engine light. In a very short time everyone will lose faith in your monitoring.

    Using BPPM Analytics to manage your Big Data

    With enterprises today consisting of many thousands of devices, we are truly in the age of overwhelming Big Data. Managing that Big Data takes intelligence in an automated manner, working at the machine level. This is why you hear “Analytics” mentioned just as often as “Big Data”.

    Luckily you don’t have to be an expert in the past behavior of the monitoring. Using BPPM, it is done for you automatically. BPPM’s Analytics capabilities, tied to Signature and Intelligent Thresholds have an out of the box (OOTB) capability to notify you about performance abnormalities that are associated with key monitoring components.

    Start Using Signature Thresholds and Intelligent Thresholds

    BPPM Analytics takes the raw monitoring data and uses it to form historical averages that are then used to establish a normal “Baseline” of operations. These baselines are then used with two types of new thresholds. The two new types are Signature Thresholds and Intelligent Thresholds. These words are thrown around allot, but if asked, could you explain what they are, or ask your team to implement them specifically?

    If you said no, you aren’t alone. Advantis is here to help. We’ve found this to be very common in fact, and it’s why we are taking these steps. The good news is, since you’re here reading this, you are only a few minutes away from gaining an informed understanding of these items. We help managers, directors and executives understand these principles to allow them to make informed decisions around their monitoring. Time is precious, and this knowledge is even more valuable.

    We recently put together a video demonstration to help you take the first steps to understand these new abilities. No sales pitch or confusing jargon. It’s all spelled out plainly and simply. After watching this presentation, if you still have questions, you’re one click away from answers.

    So take a moment to watch, and let us help you, look like a monitoring genius!

    Our Video demonstration of BPPM Analytics and what you need to know in order to use it.

    Video Presentation of Understanding BMC BPPM Analytics

    What are the 5 user specific types of dynamic BMC BPPM Baselines available for you to use with Signature Thresholds? What makes them different and how would you use them? We cover that here.

    http://www.advantisms.com/bmc-bppm-baselines-part-2/

    And what if you want to keep some of your absolute thresholds, but make them more intelligent and dynamic? We show you how to upgrade your static thresholds and make them BPPM Intelligent Thresholds with this post.

    http://www.advantisms.com/how-to-setup-a-bppm-intelligent-threshold/

    To find out more about the BMC BPPM product, be sure to check out our online blog located here.

    http://www.advantisms.com/advantis-blog/

    If you would like to get your BPPM design, implementation or upgrade started, simply click on the link below.

    Contact Advantis

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    it_user76896
    BMC TrueSight & PATROL Consultant at World Opus Technologies
    Vendor
    Before implementing consider: Scalability, High Availability, Implementation Repeatability and Standardization
    BPPM Implementation Considerations Part 1: Meet your business requirements Three years after BMC ProactiveNet Performance Management (BPPM) is released, now most BPPM customers reached a conclusion that BPPM implementation is more than just software installation. But what make a BPPM implementation a successful one? What do you need to consider before diving into installation details? "BPPM Implementation Consideration" blog series will try to address several important considerations at requirement level and architecture level. Implementing BPPM is a lot like building a house. Many considerations at requirement level and architecture level are like the foundation of the house. They need to be determined at the very beginning. The most important consideration in…

    BPPM Implementation Considerations

    Part 1: Meet your business requirements

    Three years after BMC ProactiveNet Performance Management (BPPM) is released, now most BPPM customers reached a conclusion that BPPM implementation is more than just software installation. But what make a BPPM implementation a successful one? What do you need to consider before diving into installation details?

    "BPPM Implementation Consideration" blog series will try to address several important considerations at requirement level and architecture level. Implementing BPPM is a lot like building a house. Many considerations at requirement level and architecture level are like the foundation of the house. They need to be determined at the very beginning.

    The most important consideration in BPPM implementation is your business requirements. The management of your organization, your entire implementation team, and other stakeholders should have a clear understanding on a list of business requirements that your BPPM implementation is expected to meet. Then you will need to translate this list of business requirements into a list of technical requirements with a category assignment such as mandatory, strategic, cost-saver, and nice-to-have.

    Only now you can map each technical requirement into a list of detailed BPPM features and prioritize the implementation of each feature. This will become your project scope. Based on your project scope, you can plan your project timeline and budget. If you outsource your BPPM implementation to a consulting company, it is critical that you do your homework on your business requirements and technical requirements first. Then work closely with the architect (not just the project manager) of the consulting company to determine the project scope.

    However many new BPPM customers I have talked to seem to do it backwards. They came up with a budget first without knowing exactly what BPPM features to implement and how long the implementation will take. Then they picked up a list of BPPM features to implement from product datasheet without knowing how each feature relates to their business bottom line.

    As an example, here is the process taken at one of my past clients. One of the top business requirements was to cut down the cost on Remedy Gateway licenses from multiple monitoring software vendors. This was translated into a technical requirement like this: Alerts from multiple monitoring software must be integrated into one alert management tool to communicate with Remedy for ticket creation. This requirement was categorized as cost-saver. This technical requirement was mapped into these BPPM features: Event to BPPM cell integration through API and SNMP traps, msend API installation, SNMP trap adapter high-availability implementation, custom BPPM cell MRL rules to process events from multiple vendors, IBRSD high-availability implementation, and event to ticket categorization in BPPM cell. The return was a 6-figure annual license saving year after year with an investment of 5-figure consulting fee. This ROI went straight to help business bottom line.

    Part 2: Keep the total cost of ownership in mind

    When you build a house for yourself, you don't just consider the cost of building, you also consider the cost of maintaining the house and utility bills when you live there. Similarly when you implement BPPM, in addition to implementation cost, you also need to keep the total cost of ownership in mind.

    After talking to several BPPM customers, I noticed that they all have at least twice the size of the operations team comparing to the team at my clients just to keep BPPM operations going. What is worse is that their operations team also need to have the implementation skill set to constantly patch up the implementation.

    Before you even start implementation, consider the following aspects:

    1) Scalability: When your environment grows with more servers, more applications, or more integration, will your architecture still work? How easy would it be to split horizontally (based on processing steps) and vertically (based on incoming traffic)?

    2) Upgrade: What can you do right now to make future upgrade easier? You may want to consider having a name convention, saving configuration in a separate repository, and documenting everything consistently.

    3) High Availability: High availability not only helps with business continuity, it also helps your team from constantly fighting fire. You have several options in high availability: Application level failover, OS based failover, active/active load balance, or duplication. Which option would best fit your needs for each BPPM component and how much would it cost? For example, a native application level failover might be your best choice for BPPM cells if your business cannot afford to miss a server down alert. But a simple duplication of PATROL 7 console is probably sufficient for you comparing to OS based failover which would cost nearly twice as much.

    4) Implementation Repeatability: Do you keep an accurate implementation document so that installation and configuration of each BPPM component is repeatable? You need to implement everything on a test system first and carefully document everything as you go. Production deployment should be a straightforward 'follow the doc' process. It also gives you a perfect opportunity to update the implementation document for anything you have missed.

    A common mistake I have seen is to start the implementation directly on a production system. After several months of figuring things out, it finally went live with many junk files sitting under the implementation directory. Then you realized that you actually needed a test system because you won't be able to make and test changes otherwise. Now you don't know how to configure your test system to make it identical to your production system since you have lost track on what made the production system work and what did not.

    5) Operations Standardization: Do you have a standard operations procedure document? For example, if a new server is added into your PeopleSoft Payroll application, do you have a document containing the steps for the operations team to add that server to PATROL, BPPM integration service, BPPM cell, BPPM server, BPPM GUI, and automated Remedy ticketing?

    Part 3: Achieve the highest ROI through integration

    In addition to monitoring solutions from BMC, most enterprises nowadays also use monitoring software from other vendors, open source, and even home-grown scripts scheduled by cron job. Having a group of NOC operators watching the GUIs of all monitoring software in a NASA-like environment is simply not efficient. What is worse is when you have to pay the license fee for each monitoring software to connect with the back-end ticketing system.

    BPPM/BEM cell provides extremely flexible and robust API and adapters to integrate with just about any monitoring software out there. Whether you are running monitoring tools from other commercial vendors such as IBM and Microsoft, or you use open source tools like Nagios, it is fairly straight forward to integrate alerts from these tools into BPPM/BEM cell using either its OS API or SNMP adapter. If you use home-grown scripts, all you need to do is to add an API call at the end.

    If your back-end ticketing system is Remedy, the out-of-box 2-way integration (IBRSD) between BPPM/BEM cell and Remedy is more efficient than Remedy gateways for other monitoring tools. It is fairly straight forward to configure two instances of IBRSD as active/active failover, so your chance of waking up at 3am to fight fire is very slim. Since the license of IBRSD is included in the price of BPPM/BEM, you instantly cut down the cost when you stop paying for the Remedy gateway license for other monitoring tools.

    Other added benefits include reduced maintenance effort for other monitoring software, less customization in Remedy, consistent ticket information for all monitoring tools, and possible event correlation between events from different monitoring tools. You will also make your NOC team's job easier.

    I understand that it is not always easy to convince people who work on other monitoring software to integrate into BPPM/BEM due to organizational silo and technical complexity. It is important to pick up the right candidate for the first BPPM/BEM integration. Once the ROI is obvious, people will become more supportive for BPPM/BEM integration. In addition, it is also important to set up a consistent framework for all integration since BMC does not provide a standard for integration. Once you have set up a consistent framework for one-way and two-way integration, your next integration will become much easier.

    At one of my past clients, it took our BPPM/BEM team three months to work with the other team to finish our first integration because the integration project had the lowest priority with the other team. Once everyone saw how well the integration worked and how much license fee it saved, our second integration took only 4 weeks to finish. Subsequently our third integration took only three days to finish.

    Part 4: Monitor the monitors

    The purpose of BPPM is to monitor your IT infrastructure. It is important that the monitors themselves are up and running all the time.

    A good BPPM implementation not just monitors your IT infrastructure, it also monitors each and every BPPM component including BPPM server, BPPM agent, BPPM cell, PATROL agent, PATROL adapter service/process, SNMP adapter service/process, IIWS service/process, IBRSD service/process, ..., etc. The self-monitoring metrics include component status and connection status.

    The events alerting that a BPPM component down or a BPPM connection down are mostly sent to its connected BPPM cell automatically. Some of the self-monitoring events require quick activation. You need to identify those events as they have different event classes and message formats. And you need to notify the right people about those events.

    Some components may have multiple ways to be monitored and you just need to pick up one way that works the best in your environment. For example, when a PATROL agent lost its connection with PATROL Integration Service, you can see an event directly sent from PATROL agent, another event from PATROL LOG KM if you configured it to monitor IS connection down log entry, and yet a third event from PATROL Integration Service if you activated it in BPPM GUI.

    You may need to reword the message of a self-monitoring event for better readability as some messages are not clear at all. For example, by default, PATROL agent connection down event contains the following slots:

    cell='PatrolAgent@server1@172.118.2.12:3181';
    msg='Monitored Cell is no longer responding';

    You may want to reword the message to look like this:

    msg='PatrolAgent@server1@172.118.2.12:3181 is no longer responding';

    because it is the PATROL agent that is no longer responding, not the cell.

    For the notification method, the most reliable way is local email fired from the cell that receives the self-monitoring events. Since your path to the ticketing system may be down when your BPPM components are experiencing problems, your back-end ticking system should not be the only way to send notification for your self-monitoring alerts. It should be used in addition to your local email notification.

    Part 5: Customize at the right place

    Unless you are a very small business, you will need to customize BMC out-of-box solutions to address the particular issues in your IT environment. It is unrealistic to expect a one-size-fits-all solution from BMC. Fortunately BPPM was developed with customization in mind. It provides extensive tools to help you develop your own solutions that seamlessly extend BMC out-of-box solutions.

    BPPM suite has three major components: BMC ProactiveNet, BPPM Cell (BEM), and PATROL. Both BPPM Cell and PATROL are more than 10 years old. One of the primary reasons that they are still going strong today is because they both allow you to add your own solutions to them seamlessly.

    Before you start developing your own custom solutions, take a step back to think about what options you have and where you should place your customization. What would be the impact on accessibility and resource consumption on the underline servers? What would be the impact on deployment of your custom solutions? What would be the impact on future maintenance and upgrade?

    In PATROL, you can develop custom knowledge modules and you can also plug in your own PSL code as a recovery action into a parameter. In BPPM Cell, you can develop your own event classes, MRL code, dynamic tables, and action scripts to extend the out-of-box knowledge base.

    In general, if you have a choice between customizing PATROL and customizing BPPM Cell to manage events, customizing BPPM Cell would require less effort and result in less impact to the servers that are being monitored. Here are a few reasons:

    1) PATROL is running on the servers you don't own, have limited access, and may not be familiar with. For example, I was recently helping a client debug a custom KM running on AS400. I had to get help from AS400 sysadmin just to add one line in its PSL code.

    2) PATROL is often sharing the server with mission critical applications. Poorly written PSL code could potentially impact the mission critical applications negatively.

    3) The same custom knowledge module may need to be running on more than one server, thus requiring more time to deploy and upgrade.

    4) BPPM Cell is running on your own infrastructure server. It is infinitely scalable as a peer-to-peer architecture. If resource has ever become an issue, you can add more cells either on the same server or on a different server (even with different operating system). you can split a cell horizontally by processing phases, or you can split a cell vertically by event sources.

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    ITCS user
    Technical Account Manager at a tech services company with 51-200 employees
    MSP
    Benefits include Reduced Cost, Improved Availability and Service Quality

    What is most valuable?

    It's an Integrated Platform and Event Management Console.

    How has it helped my organization?

    Goal
    • Outage Avoidance

    Benefits

    • Improved Service Quality
    • Improved Availability
    • Enable Virtual & Cloud
    • Reduced Cost

    Key Capabilities

    • Predictive Service Impact
    • Self-Learning Analytics & Predictive Root Cause
    • Dynamic Virtualization & Cloud Analytics
    • Continuous Deep Dive Diagnostics
    • Unified BSM Architecture

    What needs improvement?

    I don't see any improvement as if now.

    Which solution did I use previously and why did I switch?

    No

    How was the initial setup?

    Straight Forward in terms of deployment.

    What about the implementation team?

    In House Team

    Which other solutions did I evaluate?

    Yes, HP Tools, CA Tools.

    What other advice do I have?

    This is the best product so far I have ever experienced.
    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    ITCS user
    Technical Account Manager at a tech services company with 51-200 employees
    MSP
    BPPM - Movement of Oracle Database from a RAC to a Standalone.
    Hi Everyone, I would like to share one of my experience in a Project where we have a Scenario of moving our BPPM Oracle RAC Database Instance to a Oracle Standalone Database Instance. We just have backup of our RAC Database kept with all data in it. We have restored the DB Backup from RAC to Standalone Oracle DB instance. Further to this we have made changes in pronet.conf file over BPPM Application Server. We need to replace the enteries of Oracle RAC with Oracle Standalone Instance. While doing these changes keep BPPM Application Server completely down. RAC Instance Enteries which needs to be removed. pronet.api.database.oracle.rac=true pronet.api.database.oracle.rac.count=1 pronet.api.database.oracle.rac.host.1=abc.xxx.com pronet.api.database.oracle.rac.port.1=1655…

    Hi Everyone,

    I would like to share one of my experience in a Project where we have a Scenario of moving our BPPM Oracle RAC Database Instance to a Oracle Standalone Database Instance. We just have backup of our RAC Database kept with all data in it.

    We have restored the DB Backup from RAC to Standalone Oracle DB instance.

    Further to this we have made changes in pronet.conf file over BPPM Application Server.

    We need to replace the enteries of Oracle RAC with Oracle Standalone Instance. While doing these changes keep BPPM Application Server completely down.

    RAC Instance Enteries which needs to be removed.

    pronet.api.database.oracle.rac=true
    pronet.api.database.oracle.rac.count=1
    pronet.api.database.oracle.rac.host.1=abc.xxx.com
    pronet.api.database.oracle.rac.port.1=1655
    pronet.api.database.sid=ABC

    Standalone Enteries which needs to be placed.

    pronet.api.database.hostname=abc.xyz.com
    pronet.api.database.oracle.rac=false
    pronet.api.database.portnum=1655
    pronet.api.database.sid=ABC

    Once this is done. Turn on BPPM Application.

    Your BPPM will be up and running over Oracle Standalone Database.

    ./Anuparn Padalia

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    it_user7950
    Consultant at a tech consulting company with 51-200 employees
    Consultant
    BPPM has the potential to be a market beating product however, the investment required is significant
    This article is a review of BMC ProactiveNet Performance Manager (BPPM) version 8.6 and its key sub-components. The main key sub-components include: > ProactiveNet Analytics > ProactiveNet Event Management (formerly Mastercell) > ProactiveNet Performance Manager (i.e. PATROL) Versions Reviewed Component Version BPPM Event Manager 8.6 BPPM Analytics 8.6 PATROL Central 7.8.10 PATROL Central Operator – Web Edition 7.8.10 PATROL Agent 3.9.00.1i PATROL for UNIX Servers 9.10.00.02 Key Capabilities Event Management BPPM Event Management (previously known as Mastercell or BEM) is the component that replaces PATROL Enterprise Manager or PEM (previously known as…

    This article is a review of BMC ProactiveNet Performance Manager (BPPM) version 8.6 and its key sub-components.

    The main key sub-components include:

    > ProactiveNet Analytics

    > ProactiveNet Event Management (formerly Mastercell)

    > ProactiveNet Performance Manager (i.e. PATROL)

    Versions Reviewed

    Component

    Version

    BPPM Event Manager

    8.6

    BPPM Analytics

    8.6

    PATROL Central

    7.8.10

    PATROL Central Operator – Web Edition

    7.8.10

    PATROL Agent

    3.9.00.1i

    PATROL for UNIX Servers

    9.10.00.02

    Key Capabilities

    Event Management

    BPPM Event Management (previously known as Mastercell or BEM) is the component that replaces PATROL Enterprise Manager or PEM (previously known as CommandPost).

    BPPM introduces a programming language called MRL. MRL is not as flexible as PERL or REX which can both be used in PEM, but MRL does include many in-built features such as policies that make the design of rules slightly easier.

    PEM used to perform event management using up to 5 transformers or scripts written in PERL. PEM was effectively a tool box whereby all the intelligence is provided by the PERL scripts which enrich the events using a number of lookup files.

    Which product is better, PEM or BPPM? BPPM is arguable a better event management platform. Although MRL is frustrating to work with, the in-built capabilities mean that you don’t have to develop everything from scratch. BPPM is generally a good event management platform.

    Threshold Management

    PATROL Configuration Manager (PCM) is one of the best threshold management tools in the industry. The threshold management capabilities on BPPM (aka ProactiveNet) are poor in comparison. BMC state that they will include PCM functionality on the next release of BPPM.

    The limitations of Threshold management in BPPM are numerous:

    • BPPM has no local thresholds that can be applied across multiple servers.
    • Local thresholds can only be defined via the GUI.
    • Local thresholds can’t be migrated from one environment to another.
    • Migration of global thresholds can be performed using a export/import utility – but it is not simple.
    • The GUI for managing thresholds is cumbersome and not intuitive.

    On the plus side, the different types of thresholds in BPPM are very powerful. BPPM has Absolute, Intelligent, Signature and Predictive thresholds. These thresholds are statistically based and will generate events when a statistical anomaly is detected. The product will automatically calculate trends using linear regression and variations based upon hourly, daily or weekly patterns. However, the statistics will not eliminate threshold management as BMC have sometimes claimed. Many thresholds are Boolean in nature – either good or bad - and are therefore not approriate for statistical analysis. Statistical analysis is only appropriate for about 20% to 30% of thresholds and analysis consumes a lot for CPU cycles.

    Ease of Implementation

    BPPM is undeniably a complex product. Far too complex in my opinion. There are many other much simpler solutions such as HP SiteScope or CA Nimsoft which can be implemented much faster. In addition, the BMC Product Set has gradually got more and more complex over the years. The solution is really three products bundled together:

    • MasterCell which BMC purchased about 7 years ago.
    • ProactiveNet which BMC purchased about 4 years ago.
    • PATROL which BMC purchased about 20 years ago.

    MasterCell is a great event management product. ProactiveNet has perhaps been oversold by BMC – and the value is overstated. The autonomous thresholds can only be applied to 20% -to 30% of parameters anyway. PATROL was originally a great product – but has become bloated and complex after years of poor product management.

    As an illustration of how complex the BPPM solution has become, consider the following table:

    Component / Feature

    Old Solution with PEM

    New BPPM Solution (version 8.6)

    Number of Servers

    3 (DEV, DR and PROD)

    11 (3 DEV, 3 TEST, 5 PROD)

    Number of Connections to the Agents

    2 (PEM and RT Server)

    3 (BIIP3, BPPM Adaptor, RT Server)

    Number of Adaptors

    1 – RT Server

    3 (RT Server, BPPM Adaptor, BIIP3

    Dynamic Policy Files (for Rules)

    5 Rule Files

    12 Rule Files

    Forms for Threshold Management

    1 PCM

    2 (TEST and PROD BPPM Servers)

    Extensibility

    The PATROL agent has always been very extensible. There is a rich API and many different ways to write an interface. PATROL Central has no API and therefore can not be extended. Both BPPM and PEM are very extensible and can be extended through a variety of scripting languages such as PhP or PERL.

    Blackout

    BMC has never provided a web form that allows staff in the Operations Bridge to blackout servers or services for upcoming outages due to planned maintenance. This customer (mentioned in this review) had to write its own Web GUI for Blackout. This is an Apache and PhP solution that allows the shift operators to configure blackouts. It required 25 days of development to alter the blackout web form and migrate this functionality from PEM to BPPM.

    Administration

    Routine Daily Admin Tasks

    For an environment of 500 Agents, BPPM requires from 0.5 to 1 FTE to keep the lights on - depending on the experience of the person. Typical daily tasks include the following:

    • Restarting Agents. For an environment of 500 Agents, you can expect that 1 agent will crash per day. The most common cause is probably history file corruption. History files can grow to beyond 4 GB if not managed.
    • Checking the Consoles. Most environments will end up with a hierarchy of BPPM Event cells. The Administrator needs to log into each Console to verify that events are being:
      • De-duplicated properly;
      • Propagated correctly from one cell to the next;
      • that incidents are being raised correctly - if Automtic Incident Generaion (AIG) is configured.
    • Managing Thresholds. The Administrator will get on average one request per day to change a threshold or verify that a threshold is in place. For example, an ORACLE DBA may say that there was a SEV2 incident last night related to table locking. "Could you please check that instance DW_PROD is monitorited for locking.?" It can take from 30 minutes to 2 hours to investigate each request and write an email suggesting and agreeing the new threshold. Perhaps longer if a meeting is required.
    • Managing Rules. Changes to the BPPM Rules occur about once per month and need to be performed using change control. Rule changes require a code change to the MRL and the cells will need to be bounced.
    • Commissioning and Decommission New Agents. Agent commissioning using occurs every few months and may involve up to 20 virtual hosts associated with one Physical machine. The Commissioning process is faily involved (in fact all the Admin steps are complex). See below.
    • Deploying KMs. When the support teams deploy new infrastructure software such as Websphere or ORACLE, the associated PATROL Knoweldge Module (KM) will also need to be deployed. Each deployment may take 1-3 hours and will require change control. Input will be required from the SME. For example, the ORACLE DBA may be required to type in the system password for ORACLE during the KM Configuration process.

    PATROL Agent Commissioning

    The Agent commissioning process for configuring monitoring for a new server consists of the steps shown below:

    Step Number

    Step

    Description

    1

    Ping Host

    Ping Host to very that the hostname is correct?

    2

    Install Agent

    Install Agent Using Solaris Package

    3

    Update Event Rules

    edit BPPM enrichment file abc_host.csv

    4

    Apply to PROD Cell

    import abc_host.csv into PROD cell

    5

    Apply to TEST Cell

    import abc_host.csv into TEST cell

    6

    Update PING Test (primary)

    Update PING Test configuration on Primary Server to ensure the host is up.

    7

    Update PING Test (secondary)

    Update PING Test configuration on Secondary Server to ensure the host is up.

    8

    Configure UNIX km

    Use PCM to give Agent Standard Configuration for the UNIX km.

    9

    Update BIIP3

    Update BIIP3 Config so that the Agent can talk to the Event Management Cell.

    10

    Agent Restart

    Restart the Agent to ensure that the Agent Configuration takes affect.

    11

    Update PCO Web Console

    Update PCO Web Console so that the Agent appears in the PATROL console.

    12

    Update Work request

    Update the Work request to indicate the job is complete.

    If additional Monitoring is required for ORACLE or WEBLOGIC or some other Application, then there are additional configuration steps that are required.

    Programming Languages

    There are two languages to learn with BPPM

    • MRL or Mastercell Rule Language - This is a fairly unique programming language.
    • PSL or PATROL Script Language. This language is similar to PERL. The complexity lies in the functions that need ot be learned.

    Summary of Administration

    Administration of BPPM is overly complex. The product has evolved over the course of the last 20 years. As another new component has been added via aquisition, the product has become increasingly complex and time consuming to administer.

    Architectural Considerations

    Any Solution Design for BPPM should consider the following key questions:

    Question

    Details

    How does the design allow for rule tracing?

    Using the trace log is not practical due to the volume of events. A good solution is to assign a Unique ID to each rule and then configure each rule to add an entry to a new slot called “matching_rules”.

    How does the design specify rule execution order?

    It is often difficult to design rules because of confusion about rule execution order. It is good practice to split all mrl files into mrl files for new rules and mrl files for refine rules. So you get: new_mcxp.mrl and refine_mcxp.mrl. The files then should be grouped in the .load file by stage, so you have refine rules followed by new rules … etc.

    Does the DEV environment have the same number of cells as the TEST and OAT environments?

    Don’t be tempted to have fewer cells in the DEV environment. It is tempting to have fewer cells in order to limit the number of zones (servers) required. This is a mistake. Rule execution order is greatly affected by the propagation (or not) of slots between cells and the configuration of mcell.propagate.

    Does the design specify the configuration of mcell.propagate?

    The design should specify the configuration of all mcell config files – including mcell.propagate, mcell.dir etc.

    Is BIIP3 included in the Design?

    BIIP3 is essential in order to forward PATROL events to the cells for any cells that are not event class 11 and 39. These events are explicitly generated by the PSL event_trigger() function. It is impossible for BPPM Analystics (ProactiveNet) to collect these events because they have no associated metric.

    Threshold Management

    If thresholds are being migrated fro PCM to BPPM, How will the thresholds be migrated from BPPM server to another? Has the export / import process been thoroughly tested? (because is has serious issues).

    I would advise migrating the thresholds to BPPM as a Phase II activity or wait for BPPM v9.

    Export Thresholds from PCM

    Does the design specify using a tool for extracting all the thresholds from PCM into a spreadsheet? (I have a PERL tool to do this).

    Testing

    Does the Design provide for at least a month of end-to-end testing once the rules have been completed.

    Monitoring the Monitoring

    Does the Design incorporate monitoring of the monitoring? Will an event be generated if the BIIP3 Adapter fails?

    Event Storm

    If the BIIP3 Adaptor looses connection to multiple agents every half an hour and then regains the connection 30 seconds later this will create 200 new AGENT_DOWN events (mc_adapter_control). The de-dup rule will not work because the AGENT_UP event closes the AGENT_DOWN event. What rule is going to prevent this event storm?

    Time-out Policies

    Does the Design specify timeout policies for all the main top level event classes such as MC_CELL.. and EVENT. Does the cell start reasonably quickly with 2000 events? What about 20,000 events?

    DDE Enrichment

    Does the Design fully specify the Enrichment files that will be used?

    DDE Synchronization

    Are the DDE config files pulled or pushed into the cells? How are the DDE cfg files synchronized between cells?

    Blackout

    Has a Web site been included in the Design for Blackout by the Operations Bridge? BPPM does have a “Schedule downtime” facility – but this is entirely inappropriate for operators and does not account for BIIP3 events.

    Blackout Dev

    If a blackout GUI is a requirement, has a month of Development been allocated (using something like Apache and PhP)?

    BPPM Analytics

    Does the Design discuss the possibility of implementing BPPM Analytics as a second phase?

    Reporting

    Does the design include Event Reporting to drive Continuous Improvement? Key reports are total events grouped by:

    • ·Day, Week, Month
    • ·Object Class
    • ·Application
    • ·Service
    • ·Support Group

    Reporting DEV

    If reporting is a requirement, does the Design include time to implement the BMC reporting tool or 2 weeks of development using PhP and mquery.

    AIG

    Does the Design Include Automatic Incident Generation? (AIG). Semi-automatic incident generation an option – whereby an operator creates a ticket by right clicking on an event. Is this option considered and discussed in the design?

    Failover

    Is failover considered? How is the configuration replicated? Replicated DISK?

    Training

    Doe the project plan include time for Training the staff in the operations Bridge? What about 2nd level support?

    Go-live

    Is the Go-Live big bang or Phased? Phased is preferred for risk mitigation but will require operators to run two consoles in parallel.

    Audible Alarm

    Is an Audible alarm a requirement? If so, then this will require a few days of development to configure a web page that uses a sound file and “mquery –s COUNT”.

    BPPM Classes

    BPPM Has a number of event classes as shown below which all inherit from the CORE_EVENT class.

    CORE_EVENT

    • EVENT
      • MC_CELL_EVENT
      • MC_UPDATE_EVENT
      • MC_SMC_ROOT
      • MC_MCCS
      • MC_CLIENT_BASE
        • MC_CLIENT_CONTROL
        • MC_CLIENT_ERROR
      • MC_ADAPTOR_BASE
        • MC_ADAPTER_CONTROL
        • WIN_EVENTLOG
        • LOGFILE_BASE
        • SNMP_TRAP
      • PEM_EV
      • PATROL_EV
      • PPM_EV
        • ALARM
    • MC_CELL_CONTROL
      • MC_CELL_START
      • MC_CELL_STOP
      • MC_CELL_TICK
      • MC_CELL_STATBLD_START
      • MC_CELL_STATBLD_STOP
      • MC_CELL_DB_CLEANUP
      • MC_CELL_CONNECT
      • MC_CELL_CLIENT
      • MC_CELL_DESTINATION_UNREACHABLE
      • MC_CELL_HEARTBEAT_EVT
      • MC_CELL_RESOURCES
      • MC_CELL_ACTION_RESULT
      • MC_CELL_PUBLISH_RESULT
    • IAS_EVENT
      • IAS_START
      • IAS_STOP
      • IAS_SYNCH_EVENT
      • IAS_REINIT
      • IAS_LOGIN
      • IAS_ERROR

    Mastercell Rule Language (MRL)

    Mastercell Rule Language (or MRL) is the language used to develop event management rules within BPPM. The administrator can develop 11 different types of rules as shown in the table in section "Rule Phases" below. The language is simple and relatively easy to learn in terms of both the syntax and the in-built functions. The most difficult concept to grasp is the execution order as explained below. One of the most common problems with the rules is to misunderstand the execution order and find that the rules are not executing in the desired sequence. The other cause of frustration is the lack of common statements such as a looping structures (do, while for until) which one takes for granted in other languages. It is possible to iterate over a list structure using the listwalk() function call. The New rule phase also has limited capability to loop over events using the Updates clause. Fortunately however, the need to loop is fairly rare. However, at times the lack of standard statements can be a cause of frustration.

    The biggest problem with MRL is the slow cycling speed when debugging code. Compared to PhP or PERL, it takes at ten times as long, to stop, compile and restart. So debugging cycles are 10 times as long and productivity is similarly affected. True, it is not necessary to write pages and pages of code - but typically one will write about 8-15 pages of MRL for each project. 8 pages of PhP (tested and debugging) takes 1 to 2 days. 8 pages of MRL (tested and debugged) takes 2-4 weeks. In addition, one should allow for an additional month of End-to-End testing before production go-live to test the rules with real events - and to allow for all possible scenarios to play out and for all the bugs to emerge. This rules of thumb apply for companies of 5,000 to 10,000 employees. For larger organizations, you should allow for more time.

    Execution Order

    • Rules are processing in order according to their rule phase as shown below.
    • Rules are executed in the order in which they appear in the .load file.
    • Rules are executed in the order in which they appear in the mrl file.
    • Policies are executed in order of the specified ‘execution order”.

    Rule Phases

    Rules are executed in the order shown below.

    Execution Order

    Rule Phase

    Description

    1

    Refine

    A Refine rule verifies the validity of incoming events and collects additional data for an event before it is sent through the remaining rule phases where further processing takes place.

    2

    Filter

    Filter rules limit the number of incoming events by discarding those events that need no additional processing or analysis. Filter rules compare incoming events to the event condition formulas (ECFs) contained in the rule to determine if an event is discarded or proceeds to further processing. An incoming event is processed through each Filter rule until a Filter rule discards the event, or all Filter rules are exhausted. An event must match all the Filter rules to be accepted.

    3

    Regulate

    Use regulate rules to handle time frequency accumulations of events or repetitive occurrences of events. An event is considered a repetition of another if the event has the same values for all the slots that are defined with the dup_detect=yes facet in the BAROC definition of its event class.

    4

    New

    Use New rules to execute an action when a new event is received, for example increasing the severity level for an event or updating an existing event with new event data. New rules determine if an event becomes permanent and is placed in the repository.

    5

    Abstract

    Abstract rules create high-level, or abstract, events based on low-level events. A new event starts at the new rules phase, skipping the filter and regulate rules phases. With Abstract rules, you can keep low-level events with cells in the lower-level of the cell hierarchy, abstract the data from low-level events into high-level events, and propagate them to a higher-level cell. A high-level cell in the hierarchy can consolidate abstract events from several low-level cells and prevent a large number of abstracted technical events for which no consolidating rules apply.

    6

    Correlate

    Correlate rules build an effect-to-cause relationship between an event that occurs as a result of another event. Correlate rules execute whenever a cause or an effect event is received. The relationship between correlated events can be broken.

    7

    Execute

    The Execute rule performs a specified action when a slot value has changed in the repository. The specified action, which is either internal to the cell or running an external executable, is based on the characteristics of one or more events.

    8

    Threshold

    The Threshold rule counts the number of events that matches the criteria you specify if the number of these events exceeds the amount allowed within a time frame the Threshold rule executes.

    An event is considered a repetition of another if the event has the same values for all the slots that are defined with the dup_detect=yes facet in the BAROC definition of its event class.

    9

    Propagate

    A cell uses Propagate rules to forward events or messages to one or more destination cells or gateways. For example, a Propagate rule can escalate an event from a lower level cell to a higher-level cell in an environment.

    10

    Timer

    Use Timer rules to create timed triggers to call a rule. Timer rules are evaluated when a timer expires.

    11

    Delete

    The purpose of Delete rules is to perform actions before an event is discarded from the repository, such as a rule that suppresses data that has no meaning without an event instance. Delete rules are evaluated whenever an event is deleted from the repository or when events are deleted using the Delete flag in the mposter command.

    PATROL Configuration Manager (PCM)

    PATROL Configuration Manager (PCM) is a configuration tool used for PATROL agents. The tool is mainly used for configuring Thresholds and is very effective at this task.

    Operation

    PCM is similar in concept to the Windows registry editor. The Main Form consists of a two TreeView panes as shown below. The left TreeView is used to configure hosts which are arranged in groups such as ORACLE (shown below). The right hand TreeView is used to manage the rules which can also be arranged into groups. The RuleSets are linked to the Hosts by dragging RuleSets from right to left. The RuleSets are dragged and dropped onto the leaves marked "LinkedRuleSets". The user then invokes a command called "Apply RuleSets". The Rulesets are applied to each Agent in the same order as they appear in the hierarchy on the left. RuleSets linked to lower level nodes take precedence and "override" higher level group RuleSets.

    PCM

    Typical Use Case

    The use of PCM typically follows a three step process. Administrators must perform the following:

    1. Select an Agent as a master and configure this Agent using the PATROL Central Operator (PCO) Console.
    2. Copy the configuration into PCM.
    3. Apply the configuration to other similar Agents using PCM.
    4. Restart the Agents in order for the configuration to take affect.

    Weakness

    The key weaknesses of this configuration process are the following:

    1. PCM and PCO are seperate tools. Ideally, the configuration tool (PCO) and the configuration distribution tool (PCM) should be the same product. This would eliminate step 2 above.
    2. Step 4 should not be necessary. Restarted the Agents can be easily performed using PCM - but the problem is that all active events are regenerated. This means that all agents must be blacked out for up to an hour before any restart - otherwise staff in the Operations Bridge will see hundreds of duplicate events that they have already handled over the last few hours.

    Desired State Management

    The key benefit of PCM is that it can be used to manage a Desired State for each Agent If you apply the configuration once or a thousand times, the result is exactly the same. The Hierarchy allows one to set global or default configuration using the higher nodes in the left TreeView an then to override the configuration with local (host specific) configuration using the lower nodes. This hierarchy works extremely well.

    Policies

    The Policies feature within BPPM Event Management is gnerally a well executed feature within the product and has suffcient flexibity to meet most customer's needs. The Dynamic Data Enrichment (DDE) policies allows the user to manage the rules externally using Comma Seperated Value (CSV) files.

    The key thing that must be kept in mind, is that the DDE policies match based on Best Fit and not First Match. So for example, if you want to match on a hostname called "fred*" (the star is a wild card) then frederick will match before fred* even if fred* appears first in the csv file. The rules are loaded into a hash memory structure within the product. The benefit of 'Best Fit" is that the execution time for finding a match is predictable - irrespective of the number of lines in the CSV file (and there could be thousands). The disadvantage of "Best Fit" is that the matching can be out of sequence and counter-intuitive. Best Practice in this case is to keep the CSV files simple. Each Enrichment file should also have only one purpose. For example, the customer used in this review orignally started with 5 enrichment files with their old PATROL Enteprise manager (PEM) environment. After implementing BPPM, the customer ended up with 11 DDE enrichment files. The number of total lines was less, but the number of files was more.

    When migrating from PEM to BPPM, the enrichment files should be "Normalized" - by minimizing the number of lookup columns in order to reduce the probability of out-of-order rule matching.

    BMC Standard Policies

    Policy

    Description

    Closure

    An closure policy closes a specified event when a separate specified event is received.

    Blackout Policy

    A blackout policy might be used during a maintenance window or holiday period

    Component Based Enrichment

    enriches the definition of an event associated with a component by assigning selected component slot definitions to the event slots

    Enrichment

    enriches the definition of an event associated with a component by assigning selected component slot definitions to the event slots

    Correlation

    Correlation relates one or more cause events to an effect event, and can close the effect event The cell maintains the association between these cause-and-effect events.

    Escalation

    Escalation raises or lowers the priority level of an event after a specified period of time. A specified number of event recurrences can also trigger escalation of an event. For example, if the abnormally high temperature of a storage device goes unchecked for 10 minutes or if a cell receives more than five high-temperature warning events in 25 minutes, an escalation event management policy might increase the priority level of the event to critical.

    Notification

    Notification sends a request to an external service to notify a user or group of users of the event. A notification event management policy might notify a system administrator by means of a pager about the imminent unavailability of mission-critical piece of storage hardware.

    Propagation

    Propagation forwards events to other cells or to integrations to other products.

    Recurrence

    Recurrence combines duplicate events into one event that maintains a counter of the number of duplicates.

    Remote

    Remote action automatically calls a specified action rule provided the incoming event satisfies the remote execution policy’s event criteria.

    Suppression

    Suppression specifies which events that the receiving cell should delete. Unlike a blackout event management policy, the suppression event management policy maintains no record of the deleted event.

    Threshold

    Threshold specifies a minimum number of duplicate events that must occur within a specific period of time before the cell accepts the event. For events allowed to pass through to the cell, the event severity can be escalated or de-escalated a relative number of levels or set to a specific level. If the event occurrence rate falls below a specified level, the cell can take action against the event, such as changing the event to closed or acknowledged status.

    Timeout

    Timeout changes an event status to closed after a specified period of time elapses

    Component Based

    Blackout

    Specifies which events the receiving cell should classify as unimportant and therefore not process . The events are logged for reporting purposes. A Component Based Blackout event management policy might specify that the cell ignore events generated from a component or device based on component selection criteria for this policy.

    Typical DDE Enrichment Files

    CSV File Name

    Description

    Lookup Columns

    Data Columns

    Host.csv

    Assign Location and HostType (DEV, TEST or PROD) based on host name HostName Location, Physical Server, HostType

    HostSuppress.csv

    Filter out events based on hostname (e.g. when new Agent installed) HostName HostSuppress (YES,NO)

    Application.csv

    Assign an application nane to each event. ApplicationClass, Parameter Application

    ObjectSuppress.csv

    Filter out troublesome parameters based on Event class ApplicationClass, Parameter, EventClass ObjectSuppress (YES,NO)

    ApplicationSupress.csv

    Filter out events based on application Application ApplicationSuppress (YES,NO)

    HostBlackout.csv

    Blackout Hosts for planned outages based on timeframe HostName, PhysicalServer, Location TimeFrame

    Service.csv

    Assign Service Name to all events Host, Instance, HostType Service, SupportGroup

    ServiceSuppress.csv

    Filter Out events based on service Service ServiceSuppress (YES,NO)

    ServiceBlackout.csv

    Blackout services for planned outages during a particular time frame Service TimeFrame

    ServiceDowngrade.csv

    Downgrade severity for particular services Service SeverityCode (e.g. 12333)

    TextMessage

    Change message Text for certain parameters ApplicationName, Parameter, EventClass NewMesaage

    Note: Severitycode of 12333 downgrades MAJOR (4) and CRITICAL (5) to MINOR (3).

    Issues

    PATROL Agent Restart

    If the PATROL agent’s configuration is changed, then the agent usually requires a restart. Unfortunately, the PATROL Agent regenerates all active events (any parameter that exceeds a threshold) when the agent is restarted. This means that all an agent must be blacked out when the Agent is restarted.

    PATROL Agent History Corruption

    The Agent History file will always get corrupted if the History file exceeds 4 Gbytes. There is a 4 GB file size limit on Solaris. The history file will frequently exceed this limit on busy servers running messaging services such as Tuxedo or MQ (simply because there is a lot to monitor). The history file may get corrupted for other reasons. When the Agent gets corrupted, it will generated an event for every attempt to store a parameter value. This problem can generate hundreds of events every few minutes from just one host. This number events can easily overload a cell and a BIIP3 Adaptor (see BIIP3 Corruption below).

    With 500 UNIX Agents, you should expect one agent to get corrupt history about every 2 weeks.

    BIIP3 Cache File Corruption

    If the BIIP3 cache file is corrupted, the BIIP3 can get stuck on one event and keep generating the event. I have seen 4 million repeated events in a cell due to this problem.

    BIIP3 Cache file corruption may be caused by overloaded (see PATROL Agent History Corruption above).

    I have seen this problem occur twice within 3 months.

    The workaround is to clear the ache file and restart the BIIP3 Adaptor.

    BIIP3 Agent Connection Drops

    In certain situations, the BIIP3 Adaptor may loose connection with all the agents every half an hour. The Agent will then gain connection again almost immediately. This causes a flapping AGENT_DOWN and AGENT_UP condition that is not de-duplicated – because the AGENT_UP clears the AGENT_DOWN event. This issue can generate thousands of events and thousands of new Incidents (assuming Automatic Incident Generation is implemented).

    One best workaround is to create a new rule for MC_ADAPTER_CONTROL (AGENT_DOWN) events and set them initially to severity INFO. If the Agent is truly down then the second agent down event (which occurs 3 minutes later) should be configured in the rule to set the severity back to WARNING or ALARM.

    The problem is also solved by restarted the BIIP3 Adapter. I therefore suggest that all customers schedule a restart of the BIIP3 adaptors once per day. No events are lost because the BIIP3 adapter (and the PATROL Agent) caches all events.

    I have seen this problem about once per month with a population of 500 agents.

    BPPM Threshold Migration

    The migration of both global and local thresholds from one BPPM Analystics instance to another must be performed by hand. The is an export / import mechanism for global thresholds, but as of July 2012, this mechanism is unreliable. There is no import / export mechanism for local (host specific) thresholds.

    BPPM Local Instance thresholds

    BPPM Analytics does not support instance specific thresholds. In other words, you can not set a default threshold for FSCapacity across all file systems and then set an instance specific threshold that applies only to the root FileSystem and htne apply this instance specific threshold to all hosts. The instance specific threshold must be individually defined on all hosts. If there re 500 hosts, this becomes unfeasible. This is no script or API that can be used to automate this task.

    BPPM – Missing Hosts

    With this release of BPPM, the PATROL Agents are connected to BPPM Analytics using the BPPM Adaptor. When you use the Graphing facility to graph parameters in BPPM, some of the hosts do not appear – event though they are connected via the Adapter. At the time of this writing, this case is open with BMC and is unresolved.

    BPPPM does not support Custom Event Catalogues

    PATROL Events that are triggered using the event_trigger() PSL function are not supported by BPPM Analytics (ProactiveNet). This forces all customers (who use PATROL agents) to implement both the BIIP3 Adapter (for event_trigger() events) and the BPPM Adapter for all standard PATROL metrics (that have an underlying parameter).

    This means that the adapter layer with a BPPM implementation is quite complex. There are three Adapters attached to every agent on three separate ports. The Adapters are the RTServer, the BIIP3 Adapter, and the BPPM Adapter.

    This complexity means that the implementation becomes fragile, complex to administer and fundamentally unreliable.

    LOG monitoring

    It is difficult to define catch-all rules using the standard BMC Log monitoring KM. For example, it is possible to create a catch-all rule that triggers on the search stirng "ALARM". You hten give htis definition a custom origin which might be something like "LOG.BANKING_app_log.alarm". You then create a custom event mesasage that inserts the line from the log file inot the text of the message. This can be done with the syntax "%1-". The problem occurs at the event management layer. All events that match this rule will get rolled up into one event as duplicates - despite the fract that each event represents a different line from the log file and a different problem.

    The work-around is to change the de-duplication rules at the event managemnet layer. Be careful. if the rules are improperly defined, you can make the product vulnerable to an event storm - which may only manifest itself a month or two later.

    Monitoring of the monitoring is insufficient.

    Typical Project

    Project Background

    The review was conducted after an upgrade Project in which every component within an old PATROL environment was upgraded. The project was driven by the customers internal audit organization that review the companies products and determined that PATROL enterprise Manager (PEM) was no longer supported an therefore the whole environment should be upgraded.

    Project Phases

    The project consisted of a number of separate projects which could have been undertaken individually. The customer chose to performed all three projects simultaneously which increased the risk, complexity and length of the overall project.

    Phase

    Description

    Phase 1

    Solution Design

    Phase 2

    Upgrade of the PATROL Agents and Knowledge Modules

    Phase 3

    Replacement of PEM with BPPM Event Manager

    Phase 4

    Introduction of BPPM Analytics

    Project Timescales

    The Solution Design phase was conducted in late 2011 and the implementation was started immediately after the New Year in 2012. Phase 3 of the solution was finally put into production on Thursday 28th June 2012.

    Phase 4 of the project has not yet been completed. Phase 4 was removed from the project scope when the customer fell behind on delivery. Currently, there are no plans to complete this phase of the project.

    The customer contracted several months of consultancy from BMC Software. BMC performed the initial solution Design and much of the initial configuration of the event management rules.

    Resources

    The resources assigned to the project, consisted of the following:

    Resource

    Time Allocation

    BMC Consultant

    ~ 3 months

    Customer SME

    7 Months full time

    Independent Consultant

    4 Months

    Customer UNIX Engineers (2 Engineers)

    4 Months

    Customer infrastrucutre Architect

    1 Month

    Customer Project Manager

    2 Month

    Customer Deliver manager

    2 Months

    Management Involvement (Project Sponsor + Resource Manager)

    1 Month

    Total

    24 Months

    Lessons Learned

    The project overran initial estimates – both in terms of budget and cost. The following issues were encountered:

    Issue

    Description

    Solution Design

    The Event Management Rules had to be completely redesigned which delayed the projected by about a month. The customer’s old rules used First Match – whereas BPPM only supports Best Fit. The complexity of the customer’s rules was not properly analysed or understood during the design phase.

    Documentation

    The design of the event management rules and were not properly documented. When it became evident that the design had to be changed, the lack of documentation slowed understanding and meant that some thinking had to be repeated and the design documented properly.

    Thresholds

    The customer spent over a month trying to migrate their thresholds from PATROL to BPPM. This tasks was complex due to the different format of the thresholds. The customer also experienced many issues with the migration tools which did not work properly. Managing thresholds in BPPM is not as easy as managing thresholds in PATROL (using PATROL Configuration Manager). In the end the customer abandoned the attempt to introduce BPPM analytics. The Autonomous alerts only covered 20% of the thresholds anyway, so the benefit of BPPM Analytics was not compelling.

    Testing

    The customer underestimated the time required for comprehensive testing. Testing should have been planned earlier, started earlier and resourced appropriately. At least a full month of end-to-end testing was required.

    Technical Lead

    Technical Leadership was lacking through some parts of the project. Initially, the BMC Consultant was the technical lead. Towards the end, an independent consultant was the technical lead. There were issues of continuity.

    Project Phases

    The project consisted of 4 project phases. Phase 2 and Phase 4 were optional and were not required in order for the custom to meet its audit deadline. In the end, Phase 4 was abandoned.

    Summary and Conclusion

    Component Rating (1-5 Stars)

    BMC ProcativeNet Performance Manager (BPPM) is really 3 products bundles into one suite. It still makes sense to rate each component individually.

    Product Summary Score 1-5
    BMC BPPM v8.6 Analystics (formerly ProActiveNet) The product appears to have reasonably good quality control. The graphing is good. The threshold management features are poor - but BMC says this is being fixed in the next release. I am not convinced on the whole concept of using statistics. Statistical analysis uses a lot of CPU which makes scaleability an issue. Only about 30% of monitored metrics are appropriate for statistical analysis. BMC's claims that this product removes the need for threshold management is an exageration and 70% of thresholds will still need to be managed using absolute value (i.e. standard) thresholds. 3
    BMC BPPM v8.6 Event Mgmt (formerly Mastercell) This product is one of the strongest event management products around. There are challenges with using the MRL rule language - but generally this product works well. I question BMC's bundling of this product with ProactiveNet and would like to see the product available as a stand-alone component. Develoing and debugging rules is time consuming and difficult. Only time will tell if this product continuous to be a good event management platform. 3
    BMC PATROL 7.8.10 Twenty years ago, PATROL was the best monitoring solution of its type. Since then the product has become bloated and overly complex. PCM was a great addition and makes the management of thresholds realtively easy and repeatable. The product has not changed much in about 8 years. Four years ago, BMC were going to retire the product. Today PATROL is an integral part of BMC's BPPM strategy. The KMs and the breadth of monitoring saves this product from a lower rating. 3

    Rating according to Capabilities (Score 1-10)

    Component/Capability

    Previous Version (with PEM)

    Latest Version (BPPM v8.6)

    Event Management

    3
    4

    Threshold Management

    5
    2

    Analytics / Graphs

    3
    5

    Ease of Implementation

    3
    2

    Extensibility / interfaces

    4
    4

    Operator Form for Blackout

    1
    1

    Average Score

    (3.2)

    (3)

    Components

    PATROL and associated KMs

    PATROL Central Operator

    PATROL Enterprise Manager (PEM)

    PATROL and associated KMs

    PATROL Central Operator

    BPPM Event Management

    BPPM Analytics (ProactiveNet)

    Conclusion

    The score for BPPM has not improved with this revision. The product is more complex, more difficult to implement and thresholds are more difficult to administer. The improvement in capability associated with anomaly detection is not convincing and not proven to this customer and is only relevant for 30% of parameters. BMC must work hard to improve administration and ease of implementation.

    The combination of BPPM Analytics (ProactiveNet), BPPM Event Management (Mastercell) and PATROL has the potential to be a market beating product. However, the investment required is significant. Time will tell if BMC delivers on this vision.

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    ITCS user
    Consultant with 51-200 employees
    Vendor
    TM ART to BPPM Integration – Tips and Tricks
    If you’ve been reading this blog, you probably know that BMC ProactiveNet Performance Manager (BPPM) is a centralized event management system that acts as a single-pane-of-glass for many IT Operations teams as well as other functional groups within IT organizations. BPPM attempts to bring together as much information as possible about the health of an IT organization from external tools to get an overall view of the environment. One good way to measure the overall health of a complex system with many moving parts is by injecting synthetic transactions and measuring their response time. BMC Transaction Management Application Response Time (TM ART) is a tool that does just that. It runs scheduled synthetic transactions from remote locations against business…

    If you’ve been reading this blog, you probably know that BMC ProactiveNet Performance Manager (BPPM) is a centralized event management system that acts as a single-pane-of-glass for many IT Operations teams as well as other functional groups within IT organizations. BPPM attempts to bring together as much information as possible about the health of an IT organization from external tools to get an overall view of the environment.

    One good way to measure the overall health of a complex system with many moving parts is by injecting synthetic transactions and measuring their response time. BMC Transaction Management Application Response Time (TM ART) is a tool that does just that. It runs scheduled synthetic transactions from remote locations against business applications and tracks the availability, accuracy and response time of those transactions.

    Wouldn’t it be nice to see TM ART measurements in BPPM?

    Fortunately, the TM ART integration to BPPM is native to both products – no customization needed. The integration works by way of a data adapter that connects from a BPPM Agent to the TM ART Central Server using HTTP(S). Data is retrieved on a scheduled basis for all of the TM ART projects that are configured and accessible in TM ART. The data is stored in BPPM, so Intelligent thresholds can be defined to trigger events against it, just like any other data source.

    Analysis with TrueLog

    Besides sending data to BPPM, the TM ART application also runs diagnostics during failures (availability and accuracy only) and captures those as TrueLogs in TM ART. Reviewing the TrueLog can go a long way toward identifying the cause of an event that was generated in BPPM. Typically, events have an associated TrueLog that demonstrates the transaction output errors or discrepancies. Since the TrueLog is such a powerful tool for analyzing transactions, here is how you can incorporate them into your events.

    First, you need to take advantage of TM ART’s ability to execute actions in response to errors to generate a TrueLog of the transaction. This action is optional and must be enabled when creating or configuring the monitor. There are three options, as you can see below.

    Once the Generate TrueLog option is enabled in TM ART, you can take advantage of the built-in context Diagnostic in BPPM for all TMART Intelligent Events, shown below:

    Notice the name of the action called ‘Run Now + TrueLog’; it does exactly what it states. It makes a connection to TMART, Logs into the UI and generates a brand new True Log for the monitor in question. Since this is a manual action, the end user could be creating a new TrueLog at a different time than the event, which may or may not be very helpful. To get a more timely TrueLog from the Diagnostic, you may want to convert it into an Event Rule to run automatically whenever an event arrives from TM ART. From our testing, the automated diagnostic seems to run between 6-12 seconds after the original event in TMART.

    Cross Launching TrueLog

    If you follow the steps above, in the resulting TMART Execution Log area, you would see two TrueLogs – the one created by TMART and the one created by the BPPM Event Rule a number of seconds later. You may wonder if this duplication is necessary. So did we.

    Although it might seem intuitive to turn off the TrueLog creation in TM ART and just enable the manual diagnostic or the Event Rule in BPPM, this will fail because the BPPM actions rely on the Generate TrueLog option. Therefore, the ‘Generate TrueLog’ flag can’t be set to ‘Never’.

    A more effective approach is to add a context-sensitive link from the event in BPPM directly to the existing TrueLog in TM ART. This allows you to cross-launch from the event in BPPM to a very specific page in the Projects Execution Logs:

    Example: https://< TMART Server >:< port >/bmc/DEF/Monitoring/Monitoring?pId=8&mainTab=4

    The key variable in the above URL is the pID number (8 in the example), which you can usually parse from the mc_smc_alias slot using an MRL rule in BPPM. Once that value is known, the whole URL can be placed in the mc_object_uri slot and will automatically become an active hyperlink in any event with that value. The end result is a quick way to launch TM ART from an Event in BPPM and get to the TrueLog for analysis.

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    ITCS user
    Consultant with 51-200 employees
    Vendor
    Upgrading BPPM – Is it too late?
    Your monitoring tools need to work properly, and to accomplish that, they must be upgraded periodically. With your mountain of issues, horse-choking responsibilities, and meetings out the wazoo, it’s easy to miss upgrade deadlines. But, you still need to know that the right information about the status of your environment is reaching the right person at the right time, every time. Those upgrades keep your service performance consistent and operating smoothly. BPPM Upgrades If you’re using BMC’s ProactiveNet Performance Management (BPPM) software, you know that regular upgrades are necessary to keep this flagship product working effectively. Each version since 7.7 in 2005 has needed a significant upgrade to reach the latest generally available (GA) version which is…

    Your monitoring tools need to work properly, and to accomplish that, they must be upgraded periodically.

    With your mountain of issues, horse-choking responsibilities, and meetings out the wazoo, it’s easy to miss upgrade deadlines. But, you still need to know that the right information about the status of your environment is reaching the right person at the right time, every time. Those upgrades keep your service performance consistent and operating smoothly.

    BPPM Upgrades

    If you’re using BMC’s ProactiveNet Performance Management (BPPM) software, you know that regular upgrades are necessary to keep this flagship product working effectively. Each version since 7.7 in 2005 has needed a significant upgrade to reach the latest generally available (GA) version which is currently 9.0.

    One thing you may not know is that any version before 8.5 can’t be upgraded directly to 9.0. An enterprise using an earlier version must first upgrade to the 8.5 version, and only from that point can it be upgraded to 9.0. Also, the support BMC provides for BPPM is rapidly limited, then expired; staying up to date is the only way to have access to support.

    Not even your mom will support you forever…

    At this writing, version 8.5 will be unsupported after October 31, 2013. Those organizations still using version 8.5 need to arrange for upgrades before that time.

    Moreover, version 8.6 can be upgraded to 9.0, but 8.6 will be changed to limited support this summer on July 31, 2013 … and will be completely unsupported after July 31, 2014.

    So what are the options for an enterprise using a version nearing the end of its support?

    Basically, there are two.

    The first, and less expensive, is to initiate an “over the top” upgrade to 8.5, then another to 9.0. The down side of this approach is the monitoring time lost during the upgrade. Since monitoring is usually needed 24/7, it can be detrimental to go offline for the time needed to do the upgrade.

    Each upgrade will result in lapsed monitoring time, for a number of hours.

    Because of unknown factors such as the size of the database, the number of devices, rules and reports, as well as the number of thresholds that will be used, it’s difficult to predict the length of time monitoring will be disabled during an upgrade.

    More Challenges

    There are more issues, too. For instance, upgrading a BPPM Agent/ Integration Service (IS) before the server is upgraded makes the connection between the two obsolete. BPPM components aren’t backward compatible. By the same token, once the server is upgraded, BPPM Agents also have to be upgraded before the system will work. In a large environment, bringing all these components up to speed is even more difficult as well as time consuming.

    Then add this to the mix: the extent of customizations can have a huge impact on the upgrading process. Some of the customized files will likely be lost and have to be restored. Best practices may require updates to other related native files, too. Customizations to the knowledge base must be accompanied by careful review and documentation prior to an upgrade.

    All of these issues are about updates of BPPM alone. However, in most environments, BPPM is integrated with multiple other applications, such as Patrol, Transaction Management Application Response Time (TM ART), Configuration Management Database (CMDB), Blade, and Remedy. All of the tools in the given environment must operate smoothly together. There are strict version dependencies between each of these products that must align. In some cases, customers may be prevented from upgrading BPPM until CMDB and ITSM have been upgraded to a supported version.

    So … the big question: What is the alternative? If upgrading leads to all these complications, how is the enterprise to avoid them?

    The answer? A calculated migration.

    The Benefits of Migration

    A migration includes new hardware, installing the latest BPPM version, testing, then integrating, then slowly migrating system functionality to the new system.

    1.) A significant benefit in this approach is the new hardware. The use of new hardware and possibly a new operating system or enhanced gold images create a far better platform for BPPM in the longterm.

    There have been a number of enhancements added to BPPM between 8.5 and 8.6, not the least of which was the support for an external Oracle database. Changing from the native Sybase database is not possible during an upgrade, and once the upgrade is complete it isn’t an option to upgrade to Oracle later.

    The only way to move to Oracle, if that’s a good decision for your organization, is to perform a complete install on BPPM.

    2.) The other benefit of a migration over an upgrade is the monitoring outage discussed before. Monitoring outage with an upgrade can be several hours or more, but with the migration, the outage is usually not more than a few minutes.

    During those few minutes it is true that it’s necessary to manage both systems at once, but that’s usually over a short time and your new system is up and running smoothly.

    So here’s a list of the pros and cons of each approach:

    Upgrade BPPM

    Pros:

    • No new hardware required (therefore less expensive)

    Cons:

    • No database changes allowed
    • May require multiple upgrades to reach the final version that supports direct upgrade to 9.0 ( each upgrade is costly and time consuming)
    • Unknown monitoring outage window
    • Customizations can be lost
    • Incapability of integrations

    Migrate BPPM

    Pros:

    • Controlled upgrade strategy / timeline
    • New fresh hardware / operating system
    • Database changes allowed

    Cons:

    • Users / operators will have to watch two consoles until full migration is complete
    • Incompatibility of integrations

    You can see that the list of pros in the migration strategy outweighs the list in the upgrade approach. However, no two environments are identical so decisions need to be made based on the best approach for you, in your environment.

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    Buyer's Guide
    Download our free BMC TrueSight Operations Management Report and get advice and tips from experienced pros sharing their opinions.