We performed a comparison between Amazon EMR and Apache Spark based on real PeerSpot user reviews.
Find out in this report how the two Hadoop solutions compare in terms of features, pricing, service and support, easy of deployment, and ROI."The project management is very streamlined."
"We are using applications, such as Splunk, Livy, Hadoop, and Spark. We are using all of these applications in Amazon EMR and they're helping us a lot."
"When we grade big jobs from on-prem to the cloud, we do it in EMR with Spark."
"The ability to resize the cluster is what really makes it stand out over other Hadoop and big data solutions."
"In Amazon EMR it is easy to rebuild anything, easy to upgrade and has good fault tolerance."
"The solution is scalable."
"One of the valuable features about this solution is that it's managed services, so it's pretty stable, and scalable as much as you wish. It has all the necessary distributions. With some additional work, it's also possible to change to a Spark version with the latest version of EMR. It also has Hudi, so we are leveraging Apache Hudi on EMR for change data capture, so then it comes out-of-the-box in EMR."
"This is the best tool for hosts and it's really flexible and scalable."
"The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily."
"The deployment of the product is easy."
"Features include machine learning, real time streaming, and data processing."
"There's a lot of functionality."
"The product is useful for analytics."
"Spark can handle small to huge data and is suitable for any size of company."
"The most valuable feature of this solution is its capacity for processing large amounts of data."
"We use Spark to process data from different data sources."
"We don't have much control. If we have multiple users, if they want to scale up, the cost will go and increase and we don't know how we can restrict that price part."
"The initial setup was time-consuming."
"The product's features for storing data in static clusters could be better."
"The dashboard management could be better. Right now, it's lacking a bit."
"Amazon EMR can improve by adding some features, such as megastore services and HiveServer2. Additionally, the user interface could be better, similar to what Apache service provides, cross-platform services."
"The legacy versions of the solution are not supported in the new versions."
"The most complicated thing is configuring to the cluster and ensure it's running correctly."
"The product must add some of the latest technologies to provide more flexibility to the users."
"The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate."
"The product could improve the user interface and make it easier for new users."
"The solution needs to optimize shuffling between workers."
"Spark could be improved by adding support for other open-source storage layers than Delta Lake."
"In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that."
"Apart from the restrictions that come with its in-memory implementation. It has been improved significantly up to version 3.0, which is currently in use."
"Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available."
"We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time."
Amazon EMR is ranked 3rd in Hadoop with 20 reviews while Apache Spark is ranked 2nd in Hadoop with 58 reviews. Amazon EMR is rated 7.8, while Apache Spark is rated 8.4. The top reviewer of Amazon EMR writes "Provides efficient data processing features and has good scalability ". On the other hand, the top reviewer of Apache Spark writes "Reliable, able to expand, and handle large amounts of data well". Amazon EMR is most compared with Cloudera Distribution for Hadoop, Snowflake, Amazon Redshift, Azure Data Factory and Microsoft Azure Synapse Analytics, whereas Apache Spark is most compared with Spring Boot, AWS Batch, Spark SQL, SAP HANA and AWS Fargate. See our Amazon EMR vs. Apache Spark report.
See our list of best Hadoop vendors.
We monitor all Hadoop reviews to prevent fraudulent reviews and keep review quality high. We do not post reviews by company employees or direct competitors. We validate each review for authenticity via cross-reference with LinkedIn, and personal follow-up with the reviewer when necessary.