What is most valuable?
The ability to resize the cluster is what really makes it stand out over other Hadoop and big data solutions. You can do it very easily and quickly. It is a managed service from AWS Amazon so it removes a lot of the headaches of configuring the different environments for all the nodes in the cluster, and frees you up to do other things. You can use it. You can set it up in minutes and it's very straightforward.
How has it helped my organization?
Well, I've been at two different companies and mostly I'll relate to my experience at HLI, Human Longevity, in San Diego. We used it for genomics. Genomics is a perfect use case for big data. We manage literally terabytes of data using some of the tools that are included with EMR like Spark and Hive. What we were able to do with these EMR tools - EMR is a collection of things - was to essentially set up a genomic data warehouse of people's samples and their sequenced DNA. And then we were able to quickly and easily pair that with annotation data which essentially just tells you what your genome means, like what that sequence, or what certain sections of those characters, means. That was just all very, very easy and it allowed everyone to know where, for instance, the most recent versions of certain data lived at all times, which is really important.
What needs improvement?
There were times where they would release new versions and it seemed to end up breaking old versions, which is very strange. It could have been a red herring, it could have been that something else changed in our environment that we never found out. But all of a sudden one day we couldn't run our scripts to start up clusters, the things we could do the day before. It was because they'd released a new version and we had to change things around.
They have listened to the community quite a bit. So, the things that we had suggested to them - they sometimes have older versions of some of these tools because they're open source and Amazon creates their own version of these. Like, for instance, the version of Hive was pretty far behind for a quite a while.
They've addressed that and I think it's partially because of customers like us telling them, "Hey, there are a lot of new features that should be available but aren't in your distribution."
For how long have I used the solution?
For close to two years now.
What do I think about the stability of the solution?
No, not really. I can definitely count on it to do what it needs to do. There hasn't been a time in the last year that it has been anything but the data you're feeding into it.
You have to configure it. You may have to configure your cluster with bigger nodes or with more nodes if the shape of your data changes. That's going to be the nature of the beast with any kind of solution like this, so that's not EMR's fault.
What do I think about the scalability of the solution?
How are customer service and technical support?
I have not called them but we had a plan where, if we had an urgent case, we could email them. There were certain people in the organization who could actually call them for mission critical things in our department using EMR. We could basically either ask those people to do it or we could email them, and we could expect the response within a couple of hours.
We did have to do that when the new version came out and broke the old version. And then when there was one time it turned out to be the data that was a problem. There were so many logs and we were in a time crunch and searching through the logs, trying to figure out what was going on. So we emailed them, and both times they were very responsive, and they solved the problem very quickly.
Which solution did I use previously and why did I switch?
No, not really. The reason that we used it at that company - when I got there, that's what they were using. It was because my boss was very big on using those managed services from Amazon because it does give you an additional layer of insurance where, if something goes wrong at the level of the operating system for instance - the patching for the operating system for the nodes in the cluster - that's on Amazon to take care of that. We didn't have to focus on that so we could focus on actually getting the work done.
How was the initial setup?
It was one of those things where once you figured it out, you've got it. With this big data stuff, you put in a lot of work, trying to set something up and then you sort of set it and forget it.
Amazon has made it much easier since I first started with it. Once you get the cluster set up, if you set it up in the graphical interface, just point and click, you can actually copy a script that you could run from the command line to create that cluster. That is extremely helpful and that's the way that most people do it in production. You have a script and you run and it comes up. So it's a one-button kind of thing.
They tried to make it easy. It was fairly simple once you got through the complexity of everything that was involved with it.
Which other solutions did I evaluate?
Every now and then we would evaluate another vendor like Cloudera or MapR, but at the end of the day, we ended up sticking with EMR because nothing made a compelling enough argument to change.
We did try Cloudera. We liked Cloudera quite a bit, but between the fact that we already had such an investment in EMR and the fact was that Cloudera's cost - it's not that they weren't competitive - just wasn't enough of a cost savings to justify switching. And then MapR came in and tried to sell us on them, and none of us ever saw any benefit for using MapR over any other solutions.
Using Cloudera may have looked a little bit less expensive because Amazon EMR does charge extra fees per node based on the size of the node. When you're using EMR, it can be up to 16 to 32 times the actual original cost of the nodes. But we determined that that extra cost - for us, it was only about two to four times because of the size of the nodes we were using - the penalties weren't as great. And the benefit of not having to manage the infrastructure was enough that we said, "Well, if we want the Cloudera, we would have to do that to a certain extent, potentially." So, we said, "All right, well, it would be more work. So, let's just keep it with EMR."
What other advice do I have?
I would say take advantage of the documentation that exists, there are a lot of tutorials, and there's a really good community. The documentation is actually very thorough and very well-written, which is one of the greatest things with AWS. I don't know if this matters, but I'm a Certified Developer and Solutions Architect with Associate level, so not that I wouldn't criticize them, if I had anything to criticize.
I gave it a nine out of 10 because nothing is perfect. Everything can always improve but, overall, it's extremely well thought out. The cost is a bit prohibitive sometimes, but the whole world of big data and cluster computing can be very daunting, especially for someone new getting into it as a developer, or from a business perspective. Amazon makes it about as easy as it can be to dip a toe in those waters.