What is our primary use case?
This is our main backup system. All of our VMs, our hardware hosts; everything is backed up using Rubrik.
Disaster recovery is one of the options we have explored, so that in case of a big disaster we could utilize their image conversion to run our VMs on AWS, but that is just a proof of concept at this stage. We have tested it. It works. But we don't have a proper plan in place for that.
We have only one physical server that we are protecting with it and the rest are all virtual servers. We have around 400 server VMs and all of them are protected using Rubrik. Most of our environment, around 90 percent, is VMware, while 10 percent of our environment is Hyper-V.
With the VMs we are also taking backups of our CIFS shares. We have our file clusters running Windows Servers so we are taking backups using the SMB mount. We have NFS clusters as well, for the Linux side, which we're backing up using the built-in NFS connectors. We explored SQL backups, but right now we are using our SQL Server to dump the data and then the files are being backed up. We're not directly backing up SQL using Rubrik.
How has it helped my organization?
The SLA-based policy automation has had a very good effect on our data protection operations. We came from Commvault and we used to have tape backups. It was a full-time job for one of our sys admins to update the tape library, replace the tape cartridges, recycle them, scratch them, and then bring them back. It was a huge process. We were using offsite storage to store our tape backups which were continuously going back and forth from our campus. Now it's all automated. We barely have to manage anything. We are now consumers instead of actually setting this up. It was one set up and we just maintain it now.
It saves us time when it comes to managing backups because we barely do anything, other than just verify. We get a daily report to see if any of the VMs are out of our SLA. The only action item we have, if something is out of SLA, is to verify what happened, why the backup failed or missed its window. Given that it was tape before, it has gone from hours to minutes. It used to be more proactive, where we were continuously checking everything and replacing the tapes and making sure that everything went through. Now, it's more of a reactive situation, where we only look at a backup when there is an issue.
It has also definitely reduced the time we spend on recovery testing, because it can do Live Mounts and that does not require an actual recovery. So our VMs are instantly available. And the file restore feature allows us to explore the file system of every VM, instead of restoring it, and then just restore the files that we need, and that has been amazing so far as well. Within a few minutes, we have either the VM or the files available. I don't even know how to compare it to Commvault and the tape backups. When I joined Harvard, they already were on Rubrik and we were decommissioning Commvault, so I know a little bit about the process. We do classroom recordings in Harvard Law School and those were still going to Commvault. That was the last project that I was involved in and I saw the crazy amount of work involved where we had to bring all the tape libraries from safe.
And when it comes to recovery time itself, it's an instant recovery in most circumstances, even if we have to recover something that's more than three days old. In our environment, after something is more than three days old it goes to an archival location on S3. When we restore data that is between three and 42 days old it is downloaded from S3 and then made available. For us, that situation is a little bit slower compared to the Live Mount. Depending on the size of the VM, it could range between a few minutes to a few hours. But if the data is on premises, it downloads the data instantly.
We don't have to worry about the solution too much, which definitely has helped our productivity. Most of our workflow is automated, where VMs are automatically added. The SLA is automatically assigned. Things are automatically archived. Anyone can take action. We have on-call people who look at the reports and take action as needed.
What is most valuable?
There is a live-restore feature, their Live Mount, and the way it works we can instantly recover a VM, a past backup, to be directly attached to our VMware environment. Rubrik will act as a disk for it. It's like an instant restore. Within a few minutes our VM is up and running. And then, if we want to restore it, we can just migrate it to our actual storage.
Rubrik's web interface is very simple to use. We have a very simple SLA configured so that everything is backed up every day. Any new VMs we configure in our environment automatically get added, the SLA is automatically assigned to them. All the VMs, after three days, are archived to AWS S3, and then there's a life cycle on the AWS side to work with that.
The archival functionality is one of the main features because the Rubrik that we have has about 60 or 70 TB of total local storage, which is definitely not enough for our data. We have around 140 TB of data stored on AWS and, without the archival feature, we would have to buy at least three times the number of nodes that we currently have to keep all the data secure for 42 days, based on our SLA. It's definitely saving us on costs. It also gets us away from having to keep redundancy on the data, because if we were storing it on-premises we would have to make sure that we have redundancy and offsite storage. Now, all of that is AWS. We no longer have to worry about that.
What needs improvement?
Capacity reports could definitely be improved. It's hard to determine what is using the space and why. For instance, you can see that some host is using 2 TB on the Rubrik node and the disk space on that host is 400 GBs. It's hard to explain how there can be 2 TBs of data on local storage when nothing has changed on the host for the past three days.
They have improved a lot on the SLA reports. We used to get a lot of false alerts before, because a snapshot was missed. In the reports it would remain a "non-compliant to an SLA for 42 days, until the 42 cycles were done. They've removed that. If it misses an SLA and if you take another snapshot or to take an automatic backup, it automatically fixes the SLA report to show us it's protected.
Most of their documentation for cloud stuff can be improved. This could be old information, as we did the PoC last year and maybe their documentation has been updated now, but we literally had to contact support every day, and at every step for things like, "Okay, what do we do with the AMIs? How do we get Rubrik configured? How do we convert the image?" None of that was available in a single documentation format. It was spread around in different documentation.
For how long have I used the solution?
I've been using Rubrik for the past three years, since I joined Harvard, but I think it was deployed on-premises four or five years back.
What do I think about the stability of the solution?
It's very stable. We have had an instance where one of the nodes was offline for no reason, but working with their support it was determined that there was a cache issue and they fixed it.
We don't have to worry about backups. We have been using it for more than four years and so far there hasn't been a single incident where we have had any issues recovering any of the files or VMs. It is very robust, continuously updating.
What do I think about the scalability of the solution?
They have everything available by API, which is a good thing because this is the way that things are going forward with an API-first infrastructure. In terms of their physical nodes you can also scale them, but there's a requirement of always increasing in sets of three more nodes. We have one Brik and four nodes currently, and to increase our storage we would have to buy three more nodes, which is kind of a limitation. It would have been nice if we could just buy one node and increase that way, gradually, instead of buying three large nodes. But I can't complain about it. That's probably their infrastructure.
We are using it for everything except our media storage. Our classroom recordings are directly archived to glacier and everything else goes through Rubrik. The reason for that is that we don't want on-premises storage of the media. These are large video recordings and it would be very expensive to store them locally. Rubrik keeps a local copy for three days, for regular backups. We are actually testing a new feature where you can connect to NAS storage and there will be no local data, only metadata, stored locally. Everything else is archived. We have tested this feature with their support. They showed it to us but we haven't acquired the license to start using it yet.
Only sys admins have access to Rubrik in our organization. Currently, 10 of our sys admins have access to the system.
How are customer service and technical support?
Rubrik support is amazing. When we are involved in upgrades we always open a ticket and there is a tech person joined through a tunnel and looking at the upgrade while it's being done. It's like everything is off our shoulders in terms of managing it. If something goes wrong, they're always available to support us.
Every time we've opened a ticket with them, even to explore new features, we have always gotten an instant response, and even when it comes to trial licenses. The whole proof of concept project we did on AWS for DR was provided from their support, and it has been amazing. The experience has been really good.
Most of the time, their turnaround time for tickets is less than 24 hours, especially with high-priority tickets. Recently, we have had some issues with our VM storage sizes not reflected properly. We were looking at a capacity report and we were seeing some of the VMs using way more storage on Rubrik than they should. This has been a difficult problem and they have continued to escalate it to different engineers. That is the longest interaction we have had and the issue is still pending.
We are not running the bleeding edge, so there is a possibility that if we do switch to 5.2 we might see an improvement already on that deduplication; that might be the reason that this is happening. They are looking into it. They have suggested a couple of actions from our end to actually delete those backups, archive them, and restart the backups, but they're still looking into it.
Which solution did I use previously and why did I switch?
We wanted to get away from tapes. We tried Veeam but it did not work very well for us. There were a couple of shortcomings which we couldn't maintain, plus it wasn't cloud-ready at that moment, at least not to the extent that Rubrik was.
Rubrik was very fresh in the market at that time, but it was bringing features that we were looking for. We were already set on using either Azure or AWS and it had the needed support for them.
How was the initial setup?
I've been involved with upgrades but not an install because we just have the one on-premise device. I've been involved in multiple proofs of concepts. For example, they launched a couple of features along the way where we were testing cloud workloads and converting our images to native AWS images so that we could use it as a disaster recovery site in the future, if needed. All of our backups are going to AWS.
Upgrades are very straightforward. Their support is always with us, so we haven't had any hiccups during the upgrades. They go very smoothly. I've been involved in multiple upgrades, and we were at some point running the bleeding edge software, when we were looking for some features that were available, without any issues. So we did upgrade to the latest and greatest version. Our general policy is to stay one version behind to iron out all the bugs. But with Rubrik we have attempted to run the latest version, to use the features, and it has been stable enough for us and the upgrades have gone smoothly.
We usually block out a two-hour maintenance window for upgrades. There have been major upgrades which required some database work, and they have taken more time. In the move from version 4 to version 5 their whole database infrastructure was changed.
What's my experience with pricing, setup cost, and licensing?
We got grandfathered in the licensing terms. Their licensing is much more narrow now and you have to buy licenses for every cloud feature, but we got most of those things as a package.
We got really good pricing because we're in the education sector and we were one of the first big organizations to start using Rubrik.
Which other solutions did I evaluate?
Recently, when we were looking for direct backup to glacier, we started using CloudBerry which is a very basic product. It's a standalone install on our media servers and it's directly backing up to glacier. It's a single unit license on the single server; there's no hardware involved with it.
The only advantage of CloudBerry is that we're not keeping an on-premises copy. When we take a backup with Rubrik it creates an on-premises copy of all of our media files and then uploads them, and that requires more storage on Briks that we don't want to spend money on. The Rubrik feature we tested, where you connect to NAS storage, wasn't available when we acquired the license from CloudBerry.
What other advice do I have?
Rubrik is an amazing product. There are some features still missing. For example, you cannot do a granular backup or restore of Active Directory. That has been on my wish list. I have posted that on their tech forum where people discuss new features and new things that they are launching. I know that it will come because they have been adding other granular backup support with VSS. The AD-level granular backup, so we can restore a single account or a single computer, is the one of the last features that we are requesting. They usually do bring out whatever features we request in their next update.
We have not used the solution's ransomware recovery. I have attended a couple of seminars where they have recently been talking about that, but we haven't tested it. We haven't had any incidents which would require us to use that feature.
We have also not used its pre-built integrations or API support for integrations with other solutions. We played with a couple of features, such as the organization features to segregate some of our VMs, but we found that it was not possible the way we handle the system. We wanted to make our domain controller backups inaccessible to our backup administrators, because we wanted that to be part of the DCA job. So we explored the organizations, but the way it works we would have had to move everything into an organization and our backup administrators were taking care of everything except domain controllers. So we dropped the idea of using organizations.
In terms of downtime, I don't think Rubrik has reduced that in a meaningful way. We have a pretty redundant environment anyway. If something happens to our VMware hosts, the VMs automatically fail over to other hosts so there is rarely any downtime. We have been off physical servers for quite some time. If there were physical servers, Rubrik could help reduce downtime, but since we don't have physical servers we don't even know what the recovery would look like with Rubrik. With tapes it was crazy when something happened. If someone did not look at RAID and we had a two-drive failure or a three-drive failure, then it would be a full recovery from tape. But now, because everything is running on VMs, we have no downtime, most of the time.
Overall the product is really good. Rubrik is very competitive. Even if you now look at their positioning on the industry review sites, they are doing really well. It's a very good product. We recommended the product to our Central IT department. We are Harvard Law School, but Harvard has a Central IT which manages other schools, and they are doing a PoC right now. It's a good product to recommend.
Which deployment model are you using for this solution?
Which version of this solution are you currently using?