What is our primary use case?
We use every component of the Server Automation tool except for physical provisioning. We use it for compliance on servers and remediation. We do application installation, patching, server builds. We have an external SaaS tool that we actually use to build the framework of the server and then TrueSight Server Automation is used to push down the post-build steps: Which ancillary applications are needed, what's needed for operational support. We do qualifications of our servers through TrueSight Server Automation. We do configuration management, data collection, inventory reporting. Pretty much everything.
How has it helped my organization?
Take patching as an example. Prior to using TrueSight Server Automation, we used SMS for patching. It was very manually intense. Every month we had one week where we had seven implementers who were on call the whole time executing these jobs. There was no scheduling. Then they were trying to triage on the fly, fix these issues. It was very expensive and it was demotivating for the employees - knowing they have to do this every month. They were scheduling their vacations around it. It was rough. It's not the position you want to be in.
Once we got TrueSight Server Automation in, and we were able to take a step back in our process and re-analyze how we do it, we noticed that it provides these capabilities so that we could go into a more automated process. Now the data is all driven from the CMDB, which is owner-controlled data, not IT-controlled data. So the owners get to tell us when we're going to do this effort, and if they want to make a change, they change it in the source, and will then we reflect that into all of our automation processes without any manual intervention.
Now, right before patch week starts, I have an automated job that schedules all of the jobs for patching. We've created a set of triage scripts that we've handed down to operators - not even operations staff, but literally operators - who manage all of our patching process now. They're the ones that do the analysis of what the issues are. They follow their triage scripts. They find issues. They know what to do to execute. If there are outliers, there are on-call people they would call, which doesn't happen too often. We've been able to take this very heavy manual process and turn it into a fully automated process which we've been able to hand down to lower-tier staff who are going to be on call anyway. They're already there. Now our staff can schedule their vacations and they can have a life outside of IT.
We also took it one step further and we created a portal site so that when a user logs in they're presented with any of the servers they own or support, again based on CMDB data. We give them the ability to enable/disable patching. They can initiate reboots on their servers. We've also taken that from just patching to being able to control the patching process without user intervention. So if the Exchange group says, "Oh, we're doing this big maintenance procedure this weekend. We can't patch our servers," they can go this site, disable patching for a whole block of servers, give their justification, and it just happens. The only user involved is the owner who made the initial request.
And with this solution, it has helped to reduce IT ops costs. It's tough to estimate by how much. The tool has been in place in our company for around nine years. There was very heavy adoption at first. Millions of dollars were saved with some of the processes. What's really hard to guesstimate is that, where we came in, there were 1,000 servers. We had no automation tool. We couldn't do compliance. To be able to see if we were meeting our standards, we did it when an auditor requested, and we were on-demand doing these tasks. We always found problems. Then we were trying to fix them at the last minute so that we could present audit with something clean.
To be able to create a compliance job that's going to identify and fix this content ahead of time has reduced a whole lot of man-hours. We've really looked more at our time savings than our cost savings. At the end of the day, if we're saving time on having operations staff doing some repeatable event, we can reallocate them to do something else. I don't really see the cost savings, I see the time savings. And then we can have them working on things that are more towards the level that they should be working at, building more content.
In the operations staff, in the first year alone, we probably saved 6,000 hours. We were then able to increase that. It's at a pretty set level now. We're very mature in the product, so it's now just utilizing the content we have. Now we just get efficiencies, not having to manually login to a server and install software. We still get some time savings, but we don't really build metrics around those anymore.
What is most valuable?
Among the most valuable features is its flexibility and ability to work across multiple operating systems. Being able to execute some form of data collection and not have to worry about whether I'm working on a Linux box, or a Windows box, or the underlying OS, I can do these collections, get these results, and put them together in a uniform format which makes it easier to present back to management. That way we can track issues and have near real-time results; it takes time to collect.
We're a very heavy user of the patching methodology. We use their CMDB as well. We have ties into that to provide data or attributes of each of the servers. And we have a fully automated patching process which has been absolutely phenomenal. Again, it's across operating systems, so we can patch Windows systems and Linux systems all in the same time in the same windows. It impacts different people, but it's really seamless regarding who's working with the system itself. That provides a huge benefit because it makes it easier for a post-application reboot stack. If we have an app that comes down during patching, we can shut down the web services, shut down the mid-tier, shut down the database, bring them all back in the appropriate order, and the operating system doesn't matter. We can handle it all in one application. So it's more efficient and easier to track in case there are issues.
Compliance is also huge. We're getting ready to take an even deeper step into it. BMC provides out-of-the-box templates for CIS compliance and PCI compliance. We've been looking hard, with our cybersecurity team, at the CIS compliance. These packages even provide remediation for some of these components. By tying it to Atrium Orchestrator, our workflow tool, we'll be able to have a closed loop where we identify a compliance issue, cut CRs, get them approved, and then be able to execute these CRs and more seamlessly fix these issues on the fly. We might schedule an execution of the job, which would then cut a CR. When that CR is approved, we can just go execute it on whatever scheduled approval time we have. The only person who's involved is the approver, defining when we can do it. We're really looking hard into integrations with other tools, especially our change management to be able to kick off automation and execute.
What needs improvement?
I would like to see a better methodology for handling REST calls and integration into the APIs. They add new APIs as they add functions, but they've missed some from older components which they still haven't added in. Some of the APIs are there but the CLI calls are not there. I do a lot of development work. We do a lot of very deep, customized work. So that makes it a little harder.
I would also like to see more integration with other vendors, like automation out of Splunk or working with a vendor like Datadog for monitoring. I would like to be able to easily integrate with their tool to be able to initiate automation from monitoring events found with other vendors. I've found that although the tool is very powerful, and you can build all kinds of integrations yourself, there's a lot of upfront configuration to get them working with these vendors for which they've not built integrations. So although it's possible, it's a little more complicated than it should be. They should have these frameworks already built out to make it easier.
What do I think about the stability of the solution?
The application itself has been very stable. In the beginning, the upgrade processes were very manual and prone to error. Over the past few years, they've made a lot of changes which have really streamlined that and helped it quite a bit.
Periodically they introduce some regression bugs around cleanup of the databases, which is always a concern, so we do heavy testing in Dev to make sure that hasn't happened. We've found some issues that fortunately didn't affect production but which could have.
But for the most part, the application itself is very stable. We restart the app when we patch our servers every couple of months and that's really about it.
Agent-wise the Linux/Unix agents are extremely reliable. They're very stable.
On the Windows side, it was not as good. It's gotten better. We probably still have six or seven percent of our agents with issues, about 400 or 500. It's all technically related. They create a service account and sometimes that service account password will become corrupt so the agent can't actually start. That's the biggest one that we run into. The rest of it is mostly self-inflicted. It requires admin access. I have processes for how the account is supposed to be created, and the agent installed, but people don't follow them. So we run into issues there. Also upgrades: This last one was very bad. About ten percent of my agents failed updates due to hung MSI processes. We had to manually log in and clean that up. Prior to that we actually had really good success, so I'm hoping this is just an outlier due to their rebranding stuff.
What do I think about the scalability of the solution?
It's very scalable. In fact, they've actually introduced all kinds of optimizations. We originally started our environment out with eight application servers, and we had a job server on each one, so we had eight job servers, four authentication servers. We had a lot of processing power. But over time it's actually gotten more efficient and works better. Now we're actually down to four. We've been able to scale back our infrastructure while still maintaining the same level of job processing. We used to have 50 job threads per job server. Now we can go up to 100. Using those efficiencies and benefits that they've added, we've been able to reduce our infrastructure. We've been very pleased with that.
How are customer service and technical support?
Their level-two tech support is fantastic. Their level-one support can be very iffy. They immediately come in thinking you know nothing about the product, although I've already gone through a bunch of triage steps. The most frustrating one is that I will make a ticket, and when I make ticket I can't append screenshots to it. I'll get a notification of the ticket, and then I'll reply back with my addendums: Here's a screenshot of what's going on. Here's the log data. There are a couple of them who will reply back - they won't even look at the ticket - and say, "Can you send me a screenshot and this log data?" And I'm thinking, "Just look at the ticket. I did that three hours ago."
There a few like that but they have some really good support people, too. I really like working with their patching guy, Joe. He's fantastic. And John in reporting is phenomenal. So they've got some really good resources. Fortunately, I've worked with them long enough that the issues that I bring are normally really deep issues, so I get escalated to engineering fairly quickly, and that's been very helpful as well.
Which solution did I use previously and why did I switch?
Our company used SMS before BMC. They switched because they wanted a tool that was going to be more encompassing. SMS handled only Windows. They had no automation at all for the Linux/Unix environment. They were looking for something that was going to be cross-platform, one place to go, and not have to deal with multiple tools.
If we were to replace just the patching process, we'd have to go with WSUS for Windows, we'd have to stand up Satellite for RedHat. Then we'd have to set up SUSE patching tool, and others. There are multiple components and tons of infrastructure that would have to come into play just for one function. Having one tool that can cover a lot of different topics, with one place to go, is very beneficial. That was one of the primary drivers.
How was the initial setup?
The initial setup was fairly standard.
I've done a lot of installs of the software. There are certain strategies that we developed in the post-sales consulting we did - as opposed to pre-sales - for how we would set up environments. We did that because it made it easier for support to support the product because they knew that we had a consistent methodology for how we built out the environment. We followed the standard playbook that we had in consulting. We based the size of the environment on the number of servers and estimated job counts, to be able to hit the thresholds.
It's been a while since I've been there. I've been with this company for eight years now, but at the time we had a really good setup, playbooks for how we did the initial installations.
What about the implementation team?
BMC consultants came in and did the original installation. At that time, I was one of the consultants, so the company's experience with the consultants was very good. My part of the engagement was actually installing the Network Automation software. But there were two other consultants who came in to do the Server Automation software. They were colleagues of mine and we all helped each other as we went through the project.
After the initial implementation was done, I was hired on to do additional consulting because I knew both products. After that contract ended they were trying to fill a gap. That's when they opened the position, and I moved to this company.
What was our ROI?
We definitely have a great return on investment.
What's my experience with pricing, setup cost, and licensing?
We've had an ELO for a long time, a licensing agreement that was a three-year rolling contract. We did that for two three-year periods and that included all the BMC products.
We're currently looking at assessments of all the tools that we're doing. We've got a new VP in and he wants all new assessments. So we went to a year-to-year contract to finish all the assessments and figure out what we're going to do as a long-term strategy. Right now, we're year-to-year on a maintenance contract.
Which other solutions did I evaluate?
We're not evaluating other solutions in the automation space. I'm looking at a monitoring replacement. We have the BMC monitoring tool today, but it's a very old version, and we've not been able to get it successfully upgraded. So we're going to replace monitoring. It's not an if - we're going to. We've got a project running right now to select that vendor. BMC is obviously one that we're looking at.
What other advice do I have?
My advice would be: Don't allow yourselves to become the CMDB. Don't be your database of truth. Make sure that that data is properly managed in the appropriate tool. TrueSight Server Automation is not a CMDB. It is a resource that can provide that relationship data, but at the end of the day, that's not the business you want to be in. Focus on automation, and let the tool do what it does best.
The other piece of advice which I feel is just as important is that TrueSight Server Automation is not a monitoring tool. Don't allow someone to create automation and run a scheduled job every 15 minutes just because it can perform that function. That's not the right use of the tool. Use the right tools for the job. You can use your monitoring tool to find what you're looking for and then trigger automation off of that, but use the right tool for the right job.
The biggest lesson I've learned from using this solution is that having all of your eggs in one basket is good, but when things start to go south it can really be very impactful. Make sure, if you have a solution that's as encompassing as this, that you have a very good DR strategy. So if something does go wrong you can very quickly recover from it; because it reaches out to too much. Now, after some issues, we have a DR that's in another data center. It's about half the size of our standard environment so it can cover the bulk of what we do from day-to-day. We have too many business-critical processes that are tied to the automation and scheduled content. If we do have an actual outage, and we've had that before, it does impact our business. So have a good DR strategy.
I don't think anything's perfect, so I would rate this solution at eight out of ten. It's very powerful. It's very flexible. There are some components of it that need to be modernized. Their REST API calls are still very immature. They have very limited integrations as I was mentioning earlier. The agents themselves are listening-agents only, so there's no active component. There's no self-healing. If an agent goes down, I don't know until I go check the agent to see if it's down. There's some modernization that I'd like to see around that. I know that's on the roadmap. I don't know where they're at in the development process, but I know it's on the roadmap, and I'm looking forward to those capabilities.
If they get all of the modernization of the agents up, then I would actually increase it to a nine. But again, nothing's perfect, so I would never give anything a ten.