What is our primary use case?
I'm one of the administrators in our data centers. My title is Site Reliability Engineer, so my use case is that of a user and getting it to administer machines and monitor application performance.
The purpose for Nutanix, in general, was to reduce our footprint within our data centers, to scale down to a single point for all of our compute and storage, which it does very well. We're using Prism Pro to access all of the different clusters; we're able to get to them through one interface.
How has it helped my organization?
Based on information that we're able to derive from the application, we have utilized another monitoring tool, Splunk, and we're able to retrieve data on a frequent basis. We are able to find information about different VMs, or historical data regarding the process of those machines. That has been greatly beneficial for us to determine problems with our application; when machines move if there's an HA event and what those machines are; if there's a failure, what machines were involved in the problem, and where they're migrating to. It gives us a great deal of detail and it has helped improve our processes to determine where problems lie, where machines are going and what's happening with them, in near real-time. It's helped our troubleshooting process a great deal to have that information at our fingertips.
What is most valuable?
The most valuable feature for me is being able to find a machine, regardless of which cluster it's located in, as quickly as possible, and being able to work on it. A lot of times we are called upon to troubleshoot an issue. That usually means there's a problem that needs quick attention. Being able to find machines, ascertain their status, and do so in a timely manner, are processes that are very critical to our business needs.
What needs improvement?
I've used other products that are similar in nature and they can be very complex, but they have good documentation to back it up. Nutanix is no exception to that. Their documentation is quite extensive but can be challenging to read if you don't know the product firsthand. Still, it is very good at describing the features and functionality that you're looking for. But something to improve upon might be the ease of access to documentation, and helping users understand which information is going to provide the detail they need to complete their job.
The integration with Splunk is a little lacking, and this is something that we've worked on with Nutanix quite extensively in the last year or two. It didn't really have a good integration. They built some dashboards, where they were trying to kind of recreate Prism. Prism is its own utility; it works well for what it does. But it doesn't provide us quite the detail that we are looking for or the historical data that we were after. So we had to build our own custom apps for Splunk. Since doing that, we have been working with Nutanix to try and improve, to some extent, what they put out for the public. But in general, we've done some of our own customizing of our own dashboards.
So the integration itself has not been great, but the work that we have done on our own towards Splunk has been really good. On the plus side for Nutanix is that the API calls it has that allow you to retrieve information about their product are incredible. The amount of data that you can retrieve is immense. The downside would be how to best utilize that data once you have it. That's where it's lacking, and I know that they're taking strides to improve that.
The types of data I'm referring to are CPU statistics, memory usage; when there's an HA event; where machines were located and where they're being moved to. At times, if a node fails or goes down for any reason, or there's a memory failure, it has to live-migrate those machines somewhere else. Being able to identify what those machines are, where they're going, and what impact that has to the infrastructure, is a real help to someone like me. That helps me to know what the impact is going to be to our clients and how quickly we can get the system back up to a stable and fully functional state. If we had a problem with the server, being able to look back in historical data and determine what led up to that event is another use for the data. We have roadmapping graphs that show growth in storage and CPU usage, for predicting when we need to purchase more. There's quite a lot of information there that we use to help with our job.
One thing I would really like for them to do is to correlate multiple machines together, multiple VMs, and get a bigger picture of CPU usage or memory usage. That's a real challenge in Prism Pro that we overcome utilizing Splunk. That might be something they could work on, but we found ways of utilizing the data that they provide already through REST or API calls and having access to it through a Splunk interface.
I've been wanting them to improve and mature their Prism interface. With our utilization of Splunk, I found that we tie those together pretty well. Having them revamp the entire product to try and make it better would be a real challenge.
For how long have I used the solution?
I have been using Nutanix, in general, and subsequently Prism Pro, for the past three years.
What do I think about the stability of the solution?
By and large the stability of Prism Pro is very good.
I do feel that we seem to run into a lot of problems with memory DIMMs within the Nutanix servers. Maybe they're overly cautious, but we do seem to get frequent failures for nodes that are removed for possible memory issues, or just the possibility that there could be a memory issue. If overly cautious is a downside, they're overly cautious. But if that means that our systems perform well and we don't get errors of data corruption, then it's all for the better.
Their systems are very resilient and their uptime is very good, as they automatically live-migrate machines off to different nodes in the same cluster. They do that very well.
Having the cluster live outweighs having a single node fail, and that's the whole point of having multiple nodes. From that standpoint, the last time we had a system down because of the Nutanix was probably two years ago. And the cause was a network issue, which was something outside of their control. One cluster could not talk to another cluster and it went into a panic state and started shutting down VMs. It wasn't that Nutanix went offline. We had a network issue. They went into a protective mode to protect the data. That may be leaning towards the overly cautious, but we had zero corruption with any of our actual VMs. It did bring our application down, but everything was functional once we got the network issues worked out.
What do I think about the scalability of the solution?
The scalability is fantastic. Anytime you need more hardware, you just throw it in and it consumes it and starts working with it.
The only downside is the size of the clusters. As you start growing out towards 20 or more nodes, it becomes unwieldy and slows down the administrative processes. Users and administrators have to be aware that they have to scale out their clusters in addition to scaling out nodes when they have to increase capacity. That just goes along with understanding how the systems work and where their peak performance is at, and making sure that you build out correctly.
We have about 20 users of Prism Pro and they range from automation technicians to engineers to site reliability engineers, to those who actually administer the system. We have two staff for deployment and maintenance of Nutanix. Their roles are to maintain and upgrade and monitor the Nutanix infrastructure.
Our shop is 100 percent Nutanix. We do have some bare-metal servers that have functions for other applications, but all of our compute runs on Nutanix. So our use of it is rather extensive. We utilize it in all of our data centers exclusively.
How are customer service and technical support?
Their support is second to none. Anytime you have an issue, they know what they're doing. They get the right people involved and your issues are taken care of in a very timely manner. Their support is fantastic. I hate giving people a 10 out of 10, because I think there's always room for improvement, but their support is really close to a 10. They're responsive and knowledgeable. And when they don't know the answer, they quickly get to someone who does.
Which solution did I use previously and why did I switch?
Previously, we were running on Hyper-V from Microsoft. We found that it didn't suit our needs. We needed the compute, the storage, and everything under one roof, which Nutanix provides for us. Also, Nutanix's solution is more elegant than Hyper-V because you're able to bring multiple servers together into a cluster and maintain your VMs in a cluster of servers. That's as opposed to a single point of failure with one server or one array or the like.
How was the initial setup?
I wasn't deeply involved with the initial setup, but I think that it was fairly simple. I do know that anytime we need to add more infrastructure, the integration with additional nodes or adding a new chassis is extremely simple and well laid-out. They excel at that.
What about the implementation team?
We did work with an integrator and we had two sales engineers from Nutanix who assisted with that process. They were fantastic. Nutanix is a great team to work with.
What was our ROI?
I'm not privy to the numbers, but I think our ROI is quite high for Nutanix.
The contributing factor is, being able to have all of our infrastructure in one location. We use Nutanix not just for the software, the hypervisor, but for the entire solution. We're utilizing their chassis and their nodes. Having that all in one place, and being able to just add more hardware as we grow our infrastructure, is incredibly useful. It allows us to grow as we need and when we need. That alone allows us to dictate what drives our costs — when we need compute, how much compute we need — and allows us to stay ahead of our growing client base.
In addition to that, their uptime allows us to have the performance and reliability that our customers demand.
What's my experience with pricing, setup cost, and licensing?
It's cost-effective. It's not necessarily cheap, but it's also not inordinately expensive. It comes down to how much you use it to offset some of the costs. If you're all-in with Nutanix, and you have a lot of nodes, it drives down the cost.
Which other solutions did I evaluate?
I know Hyper-V was a consideration. We may have also considered VMware.
What other advice do I have?
Do your homework and make sure to get some engineers involved at Nutanix who can assist you. You'll run into issues that they can help steer you around. Nutanix is willing to help if you are willing to ask. The system is not without its complexities. It has a lot of features and there are a lot of things that you can do with it. If you engage the professionals at Nutanix, they can steer you in the right direction. You should utilize them.
Prism Pro can be quite complex, if you want it to be. At its heart there are a lot of features available. If you utilize it for simple purposes, then you can get simple answers. The ease of use really depends on what level of technicality you want to have with it. But in general, the interface is well laid-out. There's a little bit of a learning curve in making sure you're going to the right location and knowing what you're trying to locate. But otherwise, I feel that the interface is well laid-out and intuitive to use.
Some other things they've done recently, like having events tied back to documentation, which is something that they are working on right now, have been great.
The biggest lesson I've learned from using the solution is that you get what you pay for. Nutanix has been a great company to work with. As I said, their support is fantastic. If you're going to use someone for your critical business needs, make sure that it's a company that's going to stand behind you and help make your job better and easier.
Which deployment model are you using for this solution?