We performed a comparison between Amazon EMR, Apache Spark, and Cloudera Distribution for Hadoop based on real PeerSpot user reviews.
Find out what your peers are saying about Apache, Cloudera, Amazon Web Services (AWS) and others in Hadoop."The initial setup is straightforward."
"In Amazon EMR it is easy to rebuild anything, easy to upgrade and has good fault tolerance."
"The ability to resize the cluster is what really makes it stand out over other Hadoop and big data solutions."
"When we grade big jobs from on-prem to the cloud, we do it in EMR with Spark."
"The solution is scalable."
"Amazon EMR's most valuable features are processing speed and data storage capacity."
"We are using applications, such as Splunk, Livy, Hadoop, and Spark. We are using all of these applications in Amazon EMR and they're helping us a lot."
"The solution helps us manage huge volumes of data."
"There's a lot of functionality."
"The solution has been very stable."
"It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance."
"DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort."
"We use Spark to process data from different data sources."
"Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term."
"The solution is very stable."
"With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware."
"We also really like the Cloudera community. You can have any question and will have your answer within a few hours."
"The product as a whole is good."
"The solution is reliable and stable, it fits our requirements."
"CDH has a wide variety of proprietary tools that we use, like Impala. So from that perspective, it's quite useful as opposed to something open-source. We get a lot of value from Cloudera's proprietary tools."
"The file system is a valuable feature."
"The product provides better data processing features than other tools."
"The product is completely secure."
"The scalability of Cloudera Distribution for Hadoop is excellent."
"Amazon EMR is continuously improving, but maybe something like CI/CD out-of-the-box or integration with Prometheus Grafana."
"The product must add some of the latest technologies to provide more flexibility to the users."
"As people are shifting from legacy solutions to other technologies, Amazon EMR needs to add more features that give more flexibility in managing user data."
"We don't have much control. If we have multiple users, if they want to scale up, the cost will go and increase and we don't know how we can restrict that price part."
"Modules and strategies should be better handled and notified early in advance."
"The initial setup was time-consuming."
"The product's features for storing data in static clusters could be better."
"The most complicated thing is configuring to the cluster and ensure it's running correctly."
"Apache Spark's GUI and scalability could be improved."
"I know there is always discussion about which language to write applications in and some people do love Scala. However, I don't like it."
"Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial."
"If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation."
"Apache Spark should add some resource management improvements to the algorithms."
"More ML based algorithms should be added to it, to make it algorithmic-rich for developers."
"Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors."
"When using Spark, users may need to write their own parallelization logic, which requires additional effort and expertise."
"It would be useful if Cloudera had more tools like SQL Engines that offer the traditional relational database. We have to do a lot of work preparing the data outside Cloudera before getting it into the platform."
"The initial setup of Cloudera is difficult."
"There is a maximum of a one-gigabyte block size, which is an area of storage that can be improved upon."
"Currently, we are using many other tools such as Spark and Blade Job to improve the performance."
"The solution is not fit for on-premise distributions."
"There are better solutions out there that have more features than this one."
"The procedure for operations could be simplified."
"The user infrastructure and user interface needs to be improved, as well as the performance. The GUI needs to be better."
More Cloudera Distribution for Hadoop Pricing and Cost Advice →