Most Helpful Review
Researched Cloudera Distribution for Hadoop but chose Apache Spark: Good Streaming features enable to enter data and analysis within Spark Stream
Find out what your peers are saying about Apache Spark vs. Cloudera Distribution for Hadoop and other solutions. Updated: March 2020.
407,401 professionals have used our research since 2012.
We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
The processing time is very much improved over the data warehouse solution that we were using.
The main feature that we find valuable is that it is very fast.
The features we find most valuable are the machine learning, data learning, and Spark Analytics.
I feel the streaming is its best feature.
The solution is very stable.
The most valuable feature of this solution is its capacity for processing large amounts of data.
I found the solution stable. We haven't had any problems with it.
The scalability has been the most valuable aspect of the solution.
The most valuable feature is Kubernetes.
We also really like the Cloudera community. You can have any question and will have your answer within a few hours.
The most valuable feature is Impala, the querying engine, which is very fast.
In terms of scalability, if you have enough hardware you can scale out. Scalability doesn't have any issues.
Provides a viable open-source solution for enterprise implementations and reliable, intelligent data analysis.
The search function is the most valuable aspect of the solution.
We experienced many issues when we started working with Hadoop 3.0 in the Cloudera 6.0 version, so there are a lot of things that need to improve. I believe they are working on that.
The features I find most valuable is that the solution is that it is easy to install and to work with. It starts with the installation and from there on the management is very simple and centralized.
I would like to see integration with data science platforms to optimize the processing capability for these tasks.
We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time.
We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data.
When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources.
The solution needs to optimize shuffling between workers.
When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data.
It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster.
The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive.
The price of this solution could be lowered.
Without the big data environment, we cannot store all of this data live. We have billions of records and terabytes of storage to be used. It's not an option actually for us to have a big data environment.
There is a maximum of a one-gigabyte block size, which is an area of storage that can be improved upon.
The one thing that we struggled with predominately was support. Because it was relatively new, support was always a big issue and I think it's still a bit of an ongoing concern with the team currently managing it.
The solution does not support multiple languages very well and this means users need to create work-arounds to implement some solutions.
The user infrastructure and user interface needs to be improved, as well as the performance. The GUI needs to be better.
We experienced many issues when we started working with Hadoop 3.0 in the Cloudera 6.0 version, so there is a lot of things that need to improve.
I would like to see an improvement in how the solution helps me to handle the whole cluster.
out of 24 in Hadoop
Average Words per Review
out of 24 in Hadoop
Average Words per Review
Compared 36% of the time.
Compared 11% of the time.
Compared 9% of the time.
Compared 22% of the time.
Compared 13% of the time.
Compared 8% of the time.
Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory
|Cloudera Distribution for Hadoop is the world's most complete, tested, and popular distribution of Apache Hadoop and related projects. CDH is 100% Apache-licensed open source and is the only Hadoop solution to offer unified batch processing, interactive SQL, and interactive search, and role-based access controls. More enterprises have downloaded CDH than all other such distributions combined.|
Learn more about Apache Spark
Learn more about Cloudera Distribution for Hadoop
|NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions||37signals, Adconion,adgooroo, Aggregate Knowledge, AMD, Apollo Group, Blackberry, Box, BT, CSC|
Software R&D Company29%
Financial Services Firm29%
Marketing Services Firm14%
Software R&D Company35%
Comms Service Provider11%
Financial Services Firm8%
Financial Services Firm40%
Marketing Services Firm20%
Software R&D Company36%
Comms Service Provider10%
Financial Services Firm7%