What is our primary use case?
Our use case: Planning for sizing servers as we move them to the cloud. We use it as a substitute for VMware DRS. It does a much better job of leveling compute workload across an ESX cluster. We have a lot fewer issues with ready queue, etc. It is just a more sophisticated modeling tool for leveling VMs across an ESX infrastructure.
It is hosted on-prem, but we're looking at their SaaS offering for reporting. We do some reporting with Power BI on-premise, and it's deployed to servers that we have in Azure and on-prem.
How has it helped my organization?
The proactive monitoring of all our open enrollment applications has improved our organization. We have used it to size applications that we are moving to the cloud. Therefore, when we move them out there, we have them appropriately sized. We use it for reporting to current application owners, showing them where they are wasting money. There are easy things to find for an application, e.g., they decommissioned the server, but they never took care of the storage. Without a tool like this, that storage would just sit there forever, with us getting billed for it.
The solution handles applications, virtualization, cloud, on-prem compute, storage, and network in our environment, everything except containers because they are in an initial experimentation phase for us. The only production apps we have which use containers are a couple of vendor apps. Nothing we have developed, that's in use, is containerized yet. We are headed in that direction. We are just a little behind the curve.
Turbonomic understands the resource relationships at each of these layers (applications, virtualization, cloud, on-prem compute, storage, and network in our environment) and the risks to performance for each. It gives you a picture across the board of how those resources interact with each other and which ones are important. It's not looking at one aspect of performance, instead it is looking at 20 to 30 different things to give recommendations.
It provides a proactive approach to avoiding performance degradation. It's looking at the trends and when is the server going to run out of capacity. Our monitoring tools tell us when CPU or memory has been at 90 percent for 10 minutes. However, at that point, depending on the situation, we may be out of time. This points out, "Hey, in three weeks, you're not going to be looking good here. You need to add this stuff in advance."
We are notifying people in advance that they will have a problem as opposed to them opening tickets for a problem.
We have response-time SLAs for our applications. They are all different. It just depends on the application. Turbonomic has affected our ability to meet those SLAs in the ability to catch any performance problems before they start to occur. We are getting proactive notifications. If we have a sizing problem and there's growth happening over a trended period of time that shows that we're going to run out of capacity, rather than let the application team open a ticket, we're saying, "Hey, we're seeing latency in the application. Let's get 30 people on a bridge to research the latency." Well, the bridge never happens and the 30 people never get on it, this is because we proactively added capacity before it ever got to that point.
Turbonomic has saved human resource time and cost involved in monitoring and optimizing our estate. For our bridges, when we have a problem, we are willing to pay a little bit extra for infrastructure. We're willing to pull a lot more people than we're probably going to need onto our bridge to research the problem, rather than maybe getting the obvious team on, then having them call two more, and then the problem gets stretched out. We tend to ring the dinner bell and everybody comes running, then people go away as they prove that it's not their issue. So, you could easily end up with 30 to 40 people on every bridge for a brief period of time. Those man-hours rack up fast. Anything we can do to avoid that type of troubleshooting saves us a lot of money. Even more importantly, it keeps us productive on other projects we're working on, rather than at the end of the month going, "We're behind on these three projects. How could that have happened?" Well, "Remember there was that major problem with application ABC, and 50 people sat on a bridge for three days for 20 hours a day trying to resolve it."
In some cases you completely avoid the situation. A lot of our apps are really complex. A simple resource add in advance to a server might save us from having a ripple effect later. If we have a major application, as an example, and to get data for that application, it calls an API in another application, then pulls data from it. Well, the data it asks for: 80 percent of it's in that app, but 20 percent of it's in the next app. There is another API from that call to get that data to add it to the data from application B to send it back to application A. If you have sometimes a minor performance problem in application C that causes an outage in application A, which can be a nightmare to try and diagnose those types of problems, especially if those relationships aren't documented well. It is very difficult to quantify the savings, but If we can avoid problems like that, then the savings are big.
We are using monitoring and thresholds to assure application performance. It is great, but at the point where our monitoring tools are alerting, then we already have a problem in a lot of cases, though not always. The way we have things set up, we get warnings when resource utilization reaches 80 percent, because we try to keep it at 70 percent. We get alerts, which is kind of like, "Oh no," but we can do something about it when the applications are at 90 percent. The problem is there are so many alerts and it's such a huge environment. Because there is too much work going on, they get ignored. So, they can work into the 90s, and you end up a lot more often in a critical state. That's why the proactive monitoring of all our open enrollment stuff is really beneficial to us.
What is most valuable?
You have different groups who probably use almost everything. We use it for sizing of servers, and if somebody feels like their server needs additional resources, we validate it with the solution. We have a key part of the year called "open enrollment", where we really can't afford anything to be down or have any problems. We monitor it on a daily basis, and contact server owners if Turbonomic adds a forward-looking recommendation that they are running low on space. So, it keeps us safe. It is easy to monitor the virtual infrastructure and make sure there is capacity. However, with the individual VMs, in production alone, there are 12,000 of them. How do you keep up with those on an individual basis? So, we use Turbonomic to point out the individual VMs that are a little low.
Turbonomic provides specific actions that prevent resource starvation. They make memory recommendations and are very specific about recommendations. It looks at the individual servers, then it puts them in a cluster. At the end of the day, it comes back, and goes, "I can't fit these on here. There's not enough I/O capacity." Or, "There's just not enough memory, so you need to add two hosts."
What needs improvement?
For implementing the solution’s actions, we use scheduling for change windows and manual execution. The issue for us with the automation is we are considering starting to do the hot adds, but there are some problems with Windows Server 2019 and hot adds. It is a little buggy. So, if we turn that on with a cluster that has a lot of Windows 2019 Servers, then we would see a blue screen along with a lot of applications as well. Depending on what you are adding, cores or memory, it doesn't necessarily even take advantage of that at that moment. A reboot may be required, and we can't do that until later. So, that decreases the benefit of the real-time. For us, there is a lot of risk with real-time.
You can't add resources to a server in the cloud. If you have an Azure VM, you can't go add two cores to it because it's not going to have enough processing power. You would have to actually rebuild that server on top of a new server image which is larger. They got certain sizes available, so instead of an M3, we can pick an M4, then I need to reboot the server and have it come back up on that new image. As an industry, we need to come up with a way to handle that without an outage. Part of that is just having cloud applications built properly, but we don't. That's a problem, but I don't know if there is a solution for it. That would be the ultimate thing that would help us the most: If we could automatically resize servers in the cloud with no downtime.
The big thing is the integration with ServiceNow, so it's providing recommendations to configuration owners. So, if somebody owns a server, and it's doing a recommendation, I really don't want to see that recommendation. I want it to give that recommendation to the server owner, then have him either accept or decline that change control. Then, that change control takes place during the next maintenance window.
For how long have I used the solution?
What do I think about the stability of the solution?
Because of the size of our company, earlier versions were slow. However, they rearchitected the product about a year or 18 months ago and containerized parts of it, so we could expand and contract. Performance has been good since then.
I've a couple of guys who support it. We upgrade six or seven times a year. We are upgrading fairly often, so we are very close to current.
We have one guy spending maybe three weeks of the year doing upgrades. The upgrades are easy and fairly frequent, but there are almost always enhancements with these releases.
There are probably 50 people using it now. There are a handful who use it almost every day for sizing and infrastructure. We have a capacity management team who uses it all day long, every day. There are also multiple cloud teams and application teams who have been given access, so they can use it to appropriately size and work on their own applications. We are in the process of automating that to get that data out to everybody. There are a lot of other key teams who have found out what we were doing, and are like, "Can we have access to it now? So, we don't have to wait?" We are like, "Sure."
What do I think about the scalability of the solution?
The scalability is good. I don't see any issues at all.
We were initially on the high-end of their customers. We ran two instances of it for a while, just because there was a limit of like 10,000 devices per system, and we were significantly past that.
Just from a server perspective, we are running about 26,000 servers right now, where 97 to 98 percent are virtualized. One person can't get a handle on that. Even figuring out what direction to look, you need to have tools to help you.
How are customer service and technical support?
The technical support is good. We actually rarely call them. We have done quite a bit of work with them. Because of the number of purchases, they provided a TAM to work with us. So, we have kept that TAM around on an ongoing basis. We pretty much just call them, and they handle any support issues. From a support perspective, it has been one of the better experiences.
If it stops doing its thing and moving VMs around, it will be many days before it is going to have any impact on the environment, because everything is configured so well. From that perspective, it is an easier application to score than if you have a VMware host crash and trap a bunch of VMs on it.
Which solution did I use previously and why did I switch?
We started using Turbonomic as a replacement for VMware DRS, which handled the VM placement.
We knew we were having some performance issues and ready queue problems that we felt could be improved. We worked with VMware for a while to tweak settings without a lot of success. So, we saw what Turbonomic said that they could do. We tried it, and it could do those things, so we bought it.
From a compute standpoint, Turbonomic provides us with a single platform that manages the full application stack. When we originally started, we were primarily looking for something that would make better use of our existing infrastructure. Because it does a much better job of putting VMs together on hosts, we were able to save money immediately just by implementing it. At the time, we were non-cloud. There was a period of time where we just couldn't put anything into the cloud for security reasons. We have moved past that now and are moving to the cloud. This solution has a lot more use cases for that, e.g., sizing workloads for the cloud and monitoring workloads in the cloud.
How was the initial setup?
It's incredibly easy to set up. It took a couple of days. You spend more time building servers and getting ready for it.
It gathers its own data from vCenter. It doesn't touch the actual servers at all. Same thing with the different cloud vendors. It looks at your account information. It doesn't actually have to touch the servers themselves.
As far as the product goes, it's not an agent based. It can gather information, and start making recommendations within two or three days, then better recommendations within a week. After that, you're good. It doesn't get much easier.
What about the implementation team?
We did the implementation ourselves. It took one guy to deploy it.
My group built a couple of the VMs that we needed and installed it. It took a couple of days. As far as gathering information, you don't have to put agents on any servers or anything like that. You give a user an ID for vCenter, and we have multiple vCenters.
What was our ROI?
The open enrollment applications are all mission-critical apps. If they go down, then the clock starts ticking on its way to seven-digit sales losses. It helps us avert situations like this multiple times a week. We are constantly using it to watch and notify application owners. If we don't use Turbonomic for this, then what would typically happen is the node recommendations that they would get from Dynatrace would start showing them that there is latency in their app. If they started digging into Dynatrace, then it would come up, going, "I'm running at 90 percent CPU all the time. I better get some more CPU." Well, Turbonomic tells us two weeks before that happens, that, "We need to be adding CPUs." So, it has a proactive nature. There are a lot of other tools in play that are monitoring what is happening. For our managers, Turbonomic helps us figure out what is going to happen.
We use Turbonomic to help optimize cloud operations, and that has reduced our cloud costs. We have a lot of applications that we run which are very cyclical. Fourth quarter of the year, they get the crap beat out of them. The other three quarters of the year, they are not used a whole lot. Without Turbonomic, would it be appropriate for the application to get resized nine months out of the year. Probably not.
It has helped save cloud costs by seven figures.
The tool itself is not free, but it's easily a positive ROI. It's hard to measure the benefit of just doing the DRS and optimizing our virtual infrastructure. I just can't stress enough how much it does such a better job of stacking VMs onto a set of ESX infrastructure. If you're using Turbonomic and looking at a cluster, you will see pretty much even utilization across a set of hosts. If you let VMware manage it, you will see one host at 95 percent, then another at five percent. Everything is running fine, and that's all they care about. However, if something starts going wrong on the host that is running at 95 percent, then you may see some degradation, just like rats leave the sinking ship trying to get out through that 5 percent host. Because it does a better job of balancing things, it utilizes infrastructure better, so you have fewer servers to host the same amount of VMs.
We have probably reduced our server purchase by a million dollars, just having Turbonomic manage the VDI infrastructure. Before they were static, so they just put an X number of VMs on each host, e.g., there are 70 VMs on that one, then it goes onto the next one. If we saw hotspots, then we would manually try and move a VM or two around.
We are using Turbonomic now to manage that and the supercluster feature that lets us migrate across clusters, which is really key for the VDIs, because we had infrastructure that wasn't well utilized 24 hours a day. So, we were buying lots of extras. The reason for that was we have developers in India, tons of people offshore, and people in the Philippines. As those people come and go, the utilization of different clusters shifts radically. So, if you're trying to have enough infrastructure to manage each cluster individually, then it takes a lot more than if you're managing it as a whole. That is one of the things that we use it for.
What's my experience with pricing, setup cost, and licensing?
When we have expanded our licensing, it has always been easy to make an ROI-based decision. So, it's reasonably priced. We would like to have it cheaper, but we get more benefit from it than we pay for it. At the end of the day, that's all you can hope for.
We paid for our TAM, but I'm sure it's embedded in the cost. However, that's optional. Obviously, you can do it all yourself: Open all your own support tickets and just send in an email to your TAM. Our TAM has access to log in, because she's set up as a contractor for us. So, she can actually get in and work with us.
Which other solutions did I evaluate?
There weren't a lot of other options available at the time, but we did look at three others. I know there are other companies on the market. I don't remember which ones were competing with it at the time. There was only really one other in that space at the time, and there's a bunch now. Then, VMware was there competing as well, saying, "You just don't have it configured right. We can do better," but they really couldn't.
The model behind the scene that Turbonomic uses to make decisions just has a better way of balancing resources. It considers a lot more factors.
We use other tools to provide application-driven prioritization, to show us how top business applications and transactions are performing.
What other advice do I have?
Unfortunately, a lot of our infrastructure in the cloud is still legacy. So, we can't make full use of it to go out and resize a server, because it will bring the application down. However, what we are doing is setting up integration servers now. This puts a change control out to make the recommended change and the owner of the server can approve that change, then it will take place within a maintenance window.
We don't manage resources in real-time. Most of our applications just don't support that. We don't have enough changes required that it would be mutually beneficial to us, so we aren't doing that yet, but we're headed in that direction.
It would be a big stretch for us to actually use Turbonomic to take resources away from servers. Our company has a philosophy, which was decided four or five years ago that the most important thing for us is for our applications to be up. So, if we waste a little money on the infrastructure to bolster applications when there is a problem, that is okay. We even have our own acronym, it's called margin of error (MOE). Typically, we are looking to have at least 30 percent free capacity on any server or cluster at any given time, which is certainly not running in the most efficient way possible, but we're okay with that. While we may spend three million dollars more a year on infrastructure, an hour long outage might cost us a million dollars. So, if there is a major problem with it with big performance degradation, then we want to have the capacity to step up and keep that application afloat while they figure out the issue.
It projects the outcome of if you are going to move from one set of infrastructure to another, then it will make a recommendation. For example, if I'm moving from one type of server to another type of server where there are different core counts, faster cores, and faster memory, then it will tell me in advance, "You need fewer resources to make that happen because you are moving to better equipment."
Biggest lesson learnt: What you should do is the obvious, it is just difficult to get people to do it. You need to have servers grouped and reported up to an executive level that can show the waste. Otherwise, you are working with server owners who have multiple priorities. They have a release that's due in two weeks which will impact their bonus at the end of the year, etc. If you hit them up, and go, "Hey, you're wasting about a thousand dollars a week on this server, and more on the others, so we need to resize them." They don't care. On an individual application or server basis, it's not a big deal. However, across a 26,000 server environment, $10,000 here or there pretty becomes real money. That is the biggest challenge: competing priorities. You have one group trying to manage infrastructure for the least possible amount while getting the best performance, and you have other people who have to deliver functionality to a business unit. If they don't, the business unit will lose a million dollars a day until they get it. Those are tough priorities to compete with.
Build that reporting infrastructure right from the beginning. Make sure you have your applications divided up by business unit, so you can take that overall feedback and write it up when you are showing it to a senior executive, "Hey look, you are paying for infrastructure. You are spending a million dollars more a month than you should be."
I would rate this solution as an eight (out of 10). It is a great app. The only reason I wouldn't give them a higher rating is from a reporting standpoint. That's just not their focus, but better reporting would help. We use an app called Cloud Temple with them, who is actually a partner of theirs. Turbonomic will tell you reporting is not what they see as their core competency, and they are going to take actions to optimize your environment. However, at the same time, they have done these partnerships with another company who does better reporting.
Which deployment model are you using for this solution?