Compare Apache Spark vs. Cloudera Distribution for Hadoop

Cancel
You must select at least 2 products to compare!
Most Helpful Review
Find out what your peers are saying about Apache Spark vs. Cloudera Distribution for Hadoop and other solutions. Updated: January 2021.
464,369 professionals have used our research since 2012.
Quotes From Members

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:

Pros
"I found the solution stable. We haven't had any problems with it.""The scalability has been the most valuable aspect of the solution.""The most valuable feature of this solution is its capacity for processing large amounts of data.""The solution is very stable.""I feel the streaming is its best feature.""The features we find most valuable are the machine learning, data learning, and Spark Analytics.""The main feature that we find valuable is that it is very fast.""The processing time is very much improved over the data warehouse solution that we were using."

More Apache Spark Pros »

"The features I find most valuable is that the solution is that it is easy to install and to work with. It starts with the installation and from there on the management is very simple and centralized.""The search function is the most valuable aspect of the solution.""Provides a viable open-source solution for enterprise implementations and reliable, intelligent data analysis.""We experienced many issues when we started working with Hadoop 3.0 in the Cloudera 6.0 version, so there are a lot of things that need to improve. I believe they are working on that.""In terms of scalability, if you have enough hardware you can scale out. Scalability doesn't have any issues.""The most valuable feature is Impala, the querying engine, which is very fast.""We also really like the Cloudera community. You can have any question and will have your answer within a few hours.""The most valuable feature is Kubernetes."

More Cloudera Distribution for Hadoop Pros »

Cons
"It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster.""The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive.""When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data.""The solution needs to optimize shuffling between workers.""When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources.""We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data.""We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time.""I would like to see integration with data science platforms to optimize the processing capability for these tasks."

More Apache Spark Cons »

"I would like to see an improvement in how the solution helps me to handle the whole cluster.""The user infrastructure and user interface needs to be improved, as well as the performance. The GUI needs to be better.""The solution does not support multiple languages very well and this means users need to create work-arounds to implement some solutions.""We experienced many issues when we started working with Hadoop 3.0 in the Cloudera 6.0 version, so there is a lot of things that need to improve.""The one thing that we struggled with predominately was support. Because it was relatively new, support was always a big issue and I think it's still a bit of an ongoing concern with the team currently managing it.""There is a maximum of a one-gigabyte block size, which is an area of storage that can be improved upon.""Without the big data environment, we cannot store all of this data live. We have billions of records and terabytes of storage to be used. It's not an option actually for us to have a big data environment.""The price of this solution could be lowered."

More Cloudera Distribution for Hadoop Cons »

Pricing and Cost Advice
"Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera."

More Apache Spark Pricing and Cost Advice »

"When comparing with Oracle Sybase and SQL, it's cheaper. It's not expensive.""The price could be better for the product."

More Cloudera Distribution for Hadoop Pricing and Cost Advice »

report
Use our free recommendation engine to learn which Hadoop solutions are best for your needs.
464,369 professionals have used our research since 2012.
Questions from the Community
Top Answer: SQreamDB is a GPU DB. It is not suitable for real-time oltp of course. Cassandra is best suited for OLTP database use cases, when you need a scalable database (instead of SQL server, Postgres)… more »
Top Answer: I love every core functionality of Apache Spark Initially they have only provided RDD basic interface to process the data across distributed cluster. Then it evolved to dataframe and dataset interface… more »
Top Answer: Apache spark is available in cloud services like AWS cloud, Azure. We have to use the specific service for our use case. For example we can use AWS Glue which runs spark for ETL process, AWS EMR… more »
Ranking
1st
out of 22 in Hadoop
Views
11,334
Comparisons
9,218
Reviews
12
Average Words per Review
388
Rating
8.3
2nd
out of 22 in Hadoop
Views
4,915
Comparisons
3,303
Reviews
11
Average Words per Review
408
Rating
7.5
Popular Comparisons
Learn More
Overview

Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

Cloudera Distribution for Hadoop is the world's most complete, tested, and popular distribution of Apache Hadoop and related projects. CDH is 100% Apache-licensed open source and is the only Hadoop solution to offer unified batch processing, interactive SQL, and interactive search, and role-based access controls. More enterprises have downloaded CDH than all other such distributions combined.
Offer
Learn more about Apache Spark
Learn more about Cloudera Distribution for Hadoop
Sample Customers
NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
37signals, Adconion,adgooroo, Aggregate Knowledge, AMD, Apollo Group, Blackberry, Box, BT, CSC
Top Industries
REVIEWERS
Financial Services Firm44%
Computer Software Company22%
Marketing Services Firm11%
Non Profit11%
VISITORS READING REVIEWS
Computer Software Company25%
Comms Service Provider19%
Media Company10%
Financial Services Firm10%
REVIEWERS
Financial Services Firm43%
Computer Software Company21%
Marketing Services Firm14%
Healthcare Company7%
VISITORS READING REVIEWS
Computer Software Company31%
Comms Service Provider17%
Financial Services Firm11%
Media Company6%
Company Size
REVIEWERS
Small Business38%
Midsize Enterprise22%
Large Enterprise41%
REVIEWERS
Small Business26%
Midsize Enterprise19%
Large Enterprise55%
Find out what your peers are saying about Apache Spark vs. Cloudera Distribution for Hadoop and other solutions. Updated: January 2021.
464,369 professionals have used our research since 2012.

Apache Spark is ranked 1st in Hadoop with 13 reviews while Cloudera Distribution for Hadoop is ranked 2nd in Hadoop with 11 reviews. Apache Spark is rated 8.2, while Cloudera Distribution for Hadoop is rated 7.6. The top reviewer of Apache Spark writes "Good Streaming features enable to enter data and analysis within Spark Stream". On the other hand, the top reviewer of Cloudera Distribution for Hadoop writes "Open-source solution for intelligent data management and analysis". Apache Spark is most compared with Spring Boot, Azure Stream Analytics, AWS Batch, SAP HANA and Amazon EMR, whereas Cloudera Distribution for Hadoop is most compared with Amazon EMR, HPE Ezmeral Data Fabric, Cassandra, Hortonworks Data Platform and MongoDB. See our Apache Spark vs. Cloudera Distribution for Hadoop report.

See our list of best Hadoop vendors.

We monitor all Hadoop reviews to prevent fraudulent reviews and keep review quality high. We do not post reviews by company employees or direct competitors. We validate each review for authenticity via cross-reference with LinkedIn, and personal follow-up with the reviewer when necessary.