I do not see a big advantage of using Cloudera or Hortonworks Hadoop over AWS EMR.
I would like to know what are the key pain points that these vendors address which AWS EMR will not be able to support.
Here are the key points that differentiate EMR vs. packaged HADOOP software on a private cluster:
Amazon Web Services Elastic Map Reduce (EMR) is clearly a simple and fast
way to get started with Hadoop. As with any cloud offering the trade off is
control and security. With your corporate data in the cloud you are trusting
someone else and you are somewhat limited in terms of the types of things
you can do. AWS EMR is going to leverage open source Apache Hadoop
components almost exclusively.
Cheap but not as easy to use as some of the value add components in Hortonworks, Cloudera or IBM products.
If I leverage IBM InfoSphere BigInsights on my own cluster I gain
ease of use thru robust tools, security which I can control and standard SQL queries
thru BigSQL instead of HiveQL. Additionally the support would be superior.
Cost is of course more with a private cluster and purchasing SW and/or Support
So for these reasons, many people do get started with AWS EMR.
To summarize, the advantages of EMR are cost and open source components vs. flexibility, control, security, and convenience for a private HADOOP cluster.
Full disclosure: I work for an IBM Business Partner.
In my opinion it is more about support and certification across Apache projects and vendor products. In open source, if you run into an issue,you fix the problem. You can build your own distribution and deploy it with EMR or you can take a certified distribution like CDH and HDP and have assistance.
If you Are interested to know why Hadoop is so important. Suggestion is visit this link once :