What is our primary use case?
We use Rubrik for VM backups, NAS backups, and SQL backups. Most of what we protect is virtual. It's AHV and VMware, primarily. We have a half dozen physical machines, but most of it is virtualized. We don't do any cloud-native protection yet, although we're about to start doing Office 365.
We have the Brik as an on-prem piece and we offload all our data to Azure.
How has it helped my organization?
Not having to specify a time to run a backup with a fixed schedule is something that's really beneficial. In the past we had to schedule and try to manually stagger things over the window, to back up everything. Because Rubrik is SLA-based, you say, "Well, I need it to fit in this window here," and it just backs it up when it's most convenient for the Brik and for the third-party system. It looks at the CPU usage and says, "Okay, it's not as busy now. I know I've got time to take the backup." That's a real advantage.
When it comes to its archival functionality, automatic is probably the best way to do it. You set it up in the SLA to archive the data, and tell it where to put it, and it just does it. You don't have to worry about it. You don't have to check it. It just works. That's true with a lot of Rubrik's functionality. The big thing, the big benefit, it gives us is that it just works. We don't have to handhold it or check it to make sure things are still working. It does just work.
Another way it has improved our organization is recovery time. In the past, when we wanted to recover one of our SQL databases—our student record system is about 1.5 TB in size—to recover that from tape used to take about four or five days, and then get it onto a disk and have it visible in SQL Server. With Rubrik, when we've had to recover that, we've actually put it into the Live Mount capability. It runs on the Brik in the SSD layer. When we timed this, it took nine seconds to mount it so it was available in SQL Server and, within 30 seconds, it was out-performing production on queries. So within a minute you can have recovered what you might need to recover, rather than having to wait days to recover something. And if you have to completely replace the database, then you can migrate that over. Or if you have to just take some data out, you can just pull that out as well. It's an instant approach to database management, rather than having to worry about the time it takes to get data out.
And when we've had to recover a backup of SQL data, it has reduced downtime. It's allowed us to get back up and running within 10 or 15 minutes, rather than having to wait days to recover something, especially where the state needed to be adjusted as well. The impact, the downtime, is much reduced now.
When it comes to backup testing, we don't have to worry about validating that the backup has run. We can spin up a backup into Live Mount. We run our DBCC checks for SQL against the Live Mount instead of production. That helps protect the production platform performance, but it also allows us to validate that our backups are smooth and are recoverable as well. Having a backup is one thing, but proving that you can restore them has always been a bit tougher. So we pick databases on a weekly basis and recover those with Live Mounts to make sure that we can access the data in them.
We also don't spend time managing backups now. That's the really important message. We used to have about half an FTE looking after our backup state, making sure jobs were running, or actually changing their tapes on a daily basis. That's all gone away now. If anything, it might be 0.1 FTE, just to just keep an eye on things occasionally. Some weeks there might be two days of stuff we might need to do, whether it's for upgrade prep and then doing an upgrade, or adding some new bits to the backup piece, or removing things as we decommission them. But it's more operational now, rather than actually managing the backup piece itself. It's just another part of the process. Part of the business case for us was the time it was going to save us in managing the backup, to add more value back into the organization.
Rubrik has given us that half an FTE back. We don't have to worry now about what the backups are doing. We can actually now focus on other things. As a result, our IT security posture has improved because we've realigned that resource to improve our IT security resource count. We're now being more proactive with our security stances. We are able to use our resources more efficiently.
The Polaris, SaaS-based framework for extracting metadata is what the ransomware product actually is surfacing. You have the core Polaris product which is the GPS, and then Radar is actually in that. We do have Sonar as well, which is the data classification product search, to look for data that shouldn't be in certain places. The benefit of Polaris is that I don't have to be onsite to look at that. I can log in remotely. It allows me to have visibility of what we're doing in terms of our backups. That's particularly true if we have a ransomware alert that is triggered in the early hours. When I wake up I can have a look at that alert through the Polaris interface, rather than having to log in to my laptop and onto the VPN to get into the CDM product. Polaris is really helpful in giving us the agility.
The Sonar piece really helps because it allows us to look for data that shouldn't be in certain places, and it even helps the efficiency of platforms. For example, when our HR product creates the payroll, it actually creates a copy of that temporarily on the HR platform. When it's processed, it should be deleted or moved into archive. But when we ran Sonar against the HR platform, we actually identified that a lot of the data hadn't been tidied up as part of that process. So if that server had been compromised by either internal or external access, it would have potentially allowed a lot of that sensitive data to be leaked out. It's helped them to change their processes to look after the data better.
What is most valuable?
It backs up everything to Azure, so we no longer have to worry about tapes. When we went into lockdown, as a response to COVID, we didn't have to think about, "Well, we need to send people into the site to change backup tapes." That all carried on working. We could do a lot more remotely than we would have been able to do otherwise.
We also have the Radar product for ransomware detection. That looks for anomalies in our backups and will trigger an alert if it sees something that is an abnormal amount of change. That could be lots of deletes or modifications, compared to normal. Or it could be some VMs that have suddenly had a lot of folders added or deleted. We haven't had anything so far, at least, that was problematic, but it's nice to know that it's keeping an eye on how much change is happening with backups and helping us identify problems. It can detect when someone has gone in and deleted a substantial amount of data on a VM. If that's abnormal it will flag it and say, "Well, you might want to investigate this."
Our finance was doing a big refresh of non-production data. They deleted a load of log data and the app flagged it and said, "Well, this is strange activity. You might want to just check this out." I referred that to the finance team and they said, "Yeah, we're just refreshing the VMs, that's okay." That was cool, because we moved on. But if they had said, "Well, no one has touched that for months," then we would have looked at it in a bit more detail to see what it could have been. But without that alert, we wouldn't have any clue that anything happened. It's helping us keep an eye on what's normal and not on the estate. It's worth it because it doesn't always have to be external actors that are causing problems. You could have somebody internal being malicious if they're looking to leave or dissatisfied in their role, for example. It helps keep an eye on those situations as well.
Its web interface is really easy to use. It's just click and go. It's fast and intuitive. We've never had any problems in navigating.
What needs improvement?
Looking at how the data is broken down, we can see the total story, but sometimes it's difficult to see how big a particular snapshot is. Across 90 days of snapshots, which one is a particularly large one? Looking at the data holistically could be a lot easier.
With the Radar product, it would be helpful if it gave us a bit more insight into the alerts. It might be alerting on an object like this VM, but what particularly on that? A bit more oversight, without having to do digging, is the biggest gap they should be filling now.
For how long have I used the solution?
I've been using Rubrik for nearly three years now.
What do I think about the stability of the solution?
It hasn't gone down yet. Even when we've had a power problem, and the Brik actually lost power because our UPS is failing, we turned it back on and it just picked up where it left off and carried on. It does just work and it's intelligent enough to rebalance itself as well.
What do I think about the scalability of the solution?
Because it's hyper-converged, we can just add additional Briks and nodes to give extra capability. We introduced an edge appliance to our setup. We installed it, added it to the cluster, and it picked up some of the workloads. It was so simple, a bit like Nutanix. The fact that it is all hyper-converged means the whole scaling piece is so much simpler compared to 3D architecture. It's just plug and go.
It's only within our IT department that there is access to the product. There are about a dozen people who can use it. But the services that we support help support the whole organization, whether it's HR, finance, or research data, or user file stores. It does touch everyone.
Which solution did I use previously and why did I switch?
Prior to using Rubrik we used NetBackup onto tape, and we used a bit of StorSimple as well. It used to take us six days and 23 hours to back up on those, as a full. We only had just just enough time in a week to fit it all in and then we had a very small window to change the tapes and start it off again. That was an ongoing problem we'd always had so it needed very close monitoring. If backup jobs failed it was always hard to work out why. And we had the whole tape-changing piece as well. In addition, StorSimple was quite expensive.
Rubrik reduced our backup costs and our backup time. It increased our snapshot position as well, because we're doing incremental forever. It just made the whole process so much more efficient.
How was the initial setup?
The initial setup was really straightforward. From unboxed to being in production it took less than two hours. That was with some of the networking we had to do around it as well.
But we did go a bit too fast in terms of deployment. Even though it's incremental forever, it has to do that first full backup. We pointed a little bit too much at it the first time around and it struggled to ingest it all and move forward. After 24 hours, we stopped and started again because we were still backing up through the old method as well. When we started again we slowed the pace down to happen over three or four days rather than one day. At that point we had ingested everything and, from there, it's been smooth sailing. We haven't had any problems.
The biggest thing I always say, if anyone asks, "What would you do differently?" is to slow down the initial rollout to make sure that you're not overloading the first full backups. The incremental forever won't be in position as quickly, but it will be a bit more stable.
I was the only one involved in the deployment. My platform team handles maintenance of it. I've got a junior infrastructure engineer who essentially looks after it. Her role is to look after monitoring and backups. But it's not something we ever really have to look at these days.
What was our ROI?
Our ROI is actually neutral because we're backing up more. We could never back up everything we needed to back up, and that was always a risk that we carried. While the return is neutral, we are doing a lot more than we could before.
Which other solutions did I evaluate?
We looked at Veeam, but I didn't want to have a large on-premise implementation, as that is very much an appliance model. I would have had to roll out quite a lot of infrastructure to cover that.
We looked at Druva, to see where that was in the market but that didn't really fit our model.
We looked at Cohesity as well, and they seemed to be a few months behind Rubrik, and just duplicating everything Rubrik were doing.
The main requirement we did have was that it had to post to AHV as well. Three years ago, there were not many products out there that could backup VMware and AHV.
What other advice do I have?
We haven't explored the API yet. It's been on our list for quite a while, but it's always been hard to prioritize. We have so much technical debt that we've been dealing with, rather than focusing there. As an API-first product, it makes a lot of sense to go that way. For us, it's just a matter of prioritizing that. I have had a little play with the API interface, to prove we can get some information we want to get out of it.
Which deployment model are you using for this solution?
On-premises
Disclosure: PeerSpot contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.