VMware SRM Review

Enables us to get a lot of server images successfully but it has connectivity issues with auto-recovery


What is our primary use case?

We are in technology and services but we also do enterprise architecture and strategic planning. We always work on the customer side, but we work very closely together with key partners and key vendors in the industry. This includes VMware, but other vendors as well. We realize solutions on the customer's behalf and we are also always solution-oriented and committed to delivering what the client needs. That is why we work intensively and closely with vendors like VMware.  

With VMware SRM, we had a technical account manager before coming on with them and level three support all on standby just in case we were to encounter issues. We just happened to encounter a lot of issues.  

We integrated the product at the same time partly because of discovery and partly because we want to stay vendor agnostic. We work with whatever the client has if it is a viable product. One might be using Hyper-V and another one might be using KVM (Kernel-based Virtual Machine) or Xen Project or AHV (Acropolis HyperVisor). We treat them equally to do what they need and also work with other parties, like Red Hat or Nutanix or whatever other solutions are necessary. Of course, we take our experiences from every client and every project with us on to the next opportunity.  

What is most valuable?

What I like the most about SRM is the delta sync. We typically approach a project from an architecture perspective and we do service grouping. For example, take a situation where we plan to do a migration. We decide to go with a setup where there is a front-end portal server, there are duplication servers and there is one back-end database server. This means there are four separate VMs each representing one particular service. To get the services across, we have to wait until we have the full image replication complete. By the time we kick it off, the replication has already begun to trickle in. You can parameterize a little bit. When you really want to do the migration — probably during a service outage on the weekends as it is for production — the majority of the data is already migrated to the other side. That helps a lot because you do not need to have a tremendous service outage with this model compared to doing it in a more traditional way.  

Of course, VMware SRM not the only solution that is capable of doing this anymore. But if you have a heterogeneous environment — environments are not equal on both sides — this solution can be an advantage. In our situation, we had completely different technical specs and technology foundations at the source and target. In this case, the product is really is an enabler on the condition that you have the same hypervisor on the other side.  

What needs improvement?

I would say a lot could be changed to improve the product in terms of troubleshooting and supportability. I think about every two weeks, we had an incident somewhere in the software stack. There were problems that we faced with the vRA (vRealize Automation) multiple times. We had to fix the problem and redeploy it more than once to get it to work properly. Then we had to completely redo our replication. That is a big drawback because it means we had to cancel other plans that had already been scheduled.  

To summarize it briefly: users need a lot of enhancement to the quality and functionality of the software for it to be very useful.  

For support of VMware version 3, a more recent patch needs to be released. There were a few times that fixes were released but we have already upgraded to those latest levels and the known compatibility problems are not fixed.  

The replication advantage the product has does not work for all VMs. For example, if you have a large difference in change frequency within a VM and the VM is big — in one case our VM was 42 terabytes — the data just does not get across in the migration. So the product is really not able to handle either very big VMs or a very large change frequency. I remember we tried it with one Data Mart SQL database where we do continuous ETLs (Extract, Transform and Load). The data reloads on a daily basis. The replication takes too long to complete. The next afternoon after the migration started, we were more or less at 50%. By the evening, we were at 70%. We scratched the data reloaded and started all over again. We found no means to accelerate that. By the time you appear to be progressing, you have to redo the migration. So that is another disadvantage when trying to use SRM.  

There are a lot of minor things that need to be in place on both sides of the migration to make it work. If something goes wrong in the middle of the migration, you will have a tough time trying to troubleshoot it. The product has an insufficient method of logging, an insufficient level of operability, and an insufficient level of detailed technical tracing. This lack of information makes it so you can not immediately pinpoint the issues to troubleshoot them. It cost us multiple weekends of lost time while trying to troubleshoot because we do not get this information from the product.  

But the things I would like to see for sure in a new release are:  

  • Fix all minor connectivity issues with auto-recovery.  
  • Auto-diagnose, auto-identify, and auto-correct issues as they occur and at least try to fix the issues a few times before allowing it to fail. If the fix is not successful then at least inform users that the fix attempt was made and the particular area where the issue is suspected so that users do not lose hours to troubleshooting.  
  • Open up the solution to be more environmentally agnostic. It should not be so strongly integrated with vCenter. It should be loosely coupled with vCenter and allow other solutions.  
  • Make the product more robust and much faster. Many replications we have initiated took two weeks before going to the switchover. A lot happens in two weeks. It seems like an eternity when you have no idea why replications stalled over that long of a period of time.  

For how long have I used the solution?

I was using this between 2018 and 2019. I have been using it total for a year-and-a-half.  

What do I think about the stability of the solution?

The solution is not stable enough. If there are glitches in the process, it is not auto recovering from the issue. It is not even attempting to bring back a steady operational state. So stability is not sufficiently addressed.  

What do I think about the scalability of the solution?

The product promises to be scalable. You can add multiple vRA's — as many as you want per what you want to do. But then again, you are bound by physical constraints. For example, if you want to have multiple vRA's with multiple targets, that does not work. They have to all be directed towards one individual target. It could be multiple data stores, but it still has to be directed to the same target.  

In one case, we wanted to extend to an additional target, so we initiated two targets. Of course, the targets had two different configurations, two different data stores, and so on. That will not work. So that is where scalability ends.  

We had to do a complete reconfiguration with new targets. Then push everything over to a new target, then destroy it again, and bring it back to the first. We have done that on a few occasions, back and forth, and it is quite a cumbersome process. It should not be the case.  

Again this particular case was kind of an advanced setup. But we also have tried some multiple vRA's with just one target. But even there we have encountered synchronization issues because they need to keep in sync, and it may not happen.  

Internal software synchronization issues amongst the vRA's paralyze the replications. There are some bugs in this functionality as well. We tried to patch them up using fixes provided from the VMware lab. Eventually, we ended up on version 6.5.1. Later on, those patches disappeared, apparently because VMware understood the patches did not fix the problems — or maybe created more.  

Because of all these issues, we are no longer using the product for the moment. This is because of all the problems and the fact that there is an ongoing license cost as well. I think at the peak we had 10 users. These were admins and engineers. I was using the product as a solutions design architect. But right now I would never use it unless it is for disaster recovery or rehearsal or something like that.  

The advice that I would give to other people who are looking into implementing this solution is that every software product comes with flaws. Products can evolve very rapidly. I think in our case that it was quite a good learning experience. It was a good learning experience for VMware as well — as they acknowledged. They said they would work on improvements in the various areas I brought up to them, and I liked that they will be making the effort.  

But if considering this product, I would also look at other compelling products, like Zerto, for example, or other replication tools like the Sun virtual platform. You could look at the ease-of-use of Nutanix. Their process for replication is very different compared to what SRM offers. But the ease-of-use comes with constraints. You do not always have the choice to have equal foundations for both source and target. Then there are backup solutions like Rubrik and Veeam. There are certainly alternatives out there that are categorically different product types with other ways to accomplish similar things. But a lot of what is potentially a viable choice depends on the use case.  

My recommendation would be to prepare carefully. Mimic your own live environment in testing as close as possible to the existing architecture with the vendor. Let the vendor prove that they are value-added resellers. Make sure you have tested in a representative set up at their facilities and can achieve what you are trying to achieve before going on to attempt to deploy and use it in your own environment.  

I do not think SRM is fully ready yet for a hybrid context where the workload is working across multiple clouds and on-premises. It is an evolving product.  

How was the initial setup?

In a simple situation, the setup is a piece of cake. However, as soon as you start to work across various deployments based on various levels, the setup is much more cumbersome and much more complex. You need to deal with the interoperability issues like checking the vCenter on the left side and the vCenter on the right side, what is the ESX (Elastic Sky X) level, et cetera. You may need to downgrade your expectations accordingly, to make it still work.  

Also, if you have network routing in between two completely different, distinct environments, that can give you quite a lot of headaches as well. To give you an idea: in the initial setup of one migration, we could just not connect both VMs end-to-end. The site manager would not connect. The vRA's were connecting, but the site manager was not. It turned out to be a network routing issue. In actuality, the "issue" was not an issue. The routing was just was working like it should, following the default gateway. It just could never connect to the other site manager.  

At times you really need to go back down to the very basics yourself, and even then there may be no clarity about why it will not connect. It follows the route, the stage-gate goes through, and the connection does not happen.  

Then also the checkpoint restart is a problem. There is no checkpoint restart. What I mean by that is you can have eight VMs to migrate over a coming weekend and something goes wrong after the process is initiated, or somebody made a mistake in the service grouping. When you see this problem, you think you just need to remap, recalibrate, and then relaunch it. But there is no history track of what is already replicated. The service grouping does not reflect in that result. You need to start all over again. So there is no checkpoint for the restart. There is a checkpoint for an individual VM, but not for multiple VMs.  

As far as the time it takes to deploy, that will vary. We have had different levels of complexity in our deployments. We initially had a simple setup that was done in two days, but there were no different networks involved, no different vCenters, and also it was intra-cluster. When done like this it was very easy.  

It was a completely different story for the more complex setups. I think it took us about six weeks with a lot of effort. There was a lot of alignment, a lot of verification, a lot of troubleshooting, and a lot of diagnostics to get it working end-to-end on both sides. It was really too much time to take with that kind of project.  

What other advice do I have?

On a scale from one to ten where one is the worst and ten is the best, I would rate VMware SRM as about a five. I am not open to giving a positive recommendation as the product stands. It is a little generous to give it a five considering all the issues.  

This review focuses a lot on the weaknesses of the product. But we were actually able to use the solution to get quite a lot of server images successfully, especially if the servers were relatively small, like a parasitic thermal server or an ordinary file server. That type of project went fine. So, if your use case is entry-level, beginning, and maybe intermediate, I think you will be fine using the product. But even if you do not have a lot of complexity and you try to work with this in a really big enterprise and a multi-region, multi-datacenter environment, you will have a lot of challenges ahead for sure.  

We have used it as a migration tool in support of a big transformation. I would think twice before using it for continuity on a permanent basis. I might think three times before more enhancements to the product are made successfully to enhance the utility.  

Which deployment model are you using for this solution?

On-premises
**Disclosure: My company has a business relationship with this vendor other than being a customer: partner
More VMware SRM reviews from users
Add a Comment
Guest