We performed a comparison between Apache Spark, Cloudera Distribution for Hadoop, and IBM InfoSphere BigInsights [EOL] based on real PeerSpot user reviews.
Find out what your peers are saying about Apache, Cloudera, Amazon Web Services (AWS) and others in Hadoop."The most valuable feature of Apache Spark is its ease of use."
"Now, when we're tackling sentiment analysis using NLP technologies, we deal with unstructured data—customer chats, feedback on promotions or demos, and even media like images, audio, and video files. For processing such data, we rely on PySpark. Beneath the surface, Spark functions as a compute engine with in-memory processing capabilities, enhancing performance through features like broadcasting and caching. It's become a crucial tool, widely adopted by 90% of companies for a decade or more."
"It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained."
"This solution provides a clear and convenient syntax for our analytical tasks."
"Apache Spark can do large volume interactive data analysis."
"The features we find most valuable are the machine learning, data learning, and Spark Analytics."
"The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly."
"The processing time is very much improved over the data warehouse solution that we were using."
"Customer service and support were able to fix whatever the issue was."
"The solution is stable."
"The search function is the most valuable aspect of the solution."
"Provides a viable open-source solution for enterprise implementations and reliable, intelligent data analysis."
"The features I find most valuable is that the solution is that it is easy to install and to work with. It starts with the installation and from there on the management is very simple and centralized."
"The product as a whole is good."
"The most valuable feature is Impala, the querying engine, which is very fast."
"The data science aspect of the solution is valuable."
"InfoSphere Streams was the one core product from the platform in which we were using. We were building a real-time response system and we built it on InfoSphere Streams."
"They could improve the issues related to programming language for the platform."
"I would like to see integration with data science platforms to optimize the processing capability for these tasks."
"At the initial stage, the product provides no container logs to check the activity."
"When you are working with large, complex tasks, the garbage collection process is slow and affects performance."
"When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."
"The product could improve the user interface and make it easier for new users."
"Apache Spark provides very good performance The tuning phase is still tricky."
"In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that."
"The price of this solution could be lowered."
"Without the big data environment, we cannot store all of this data live. We have billions of records and terabytes of storage to be used. It's not an option actually for us to have a big data environment."
"Cloudera Distribution for Hadoop is not always completely stable in some cases, which can be a concern for big data solutions."
"The pricing needs to improve."
"The Cloudera training has deteriorated significantly."
"There is a maximum of a one-gigabyte block size, which is an area of storage that can be improved upon."
"The solution does not support multiple languages very well and this means users need to create work-arounds to implement some solutions."
"The areas of improvement depend on the scale of the project. For banking customers, security features and an essential budget for commercial licenses would be the top priority. Data regulation could be the most crucial for a project with extensive data or an extra use case."
"The UI was not interactive: Responses used to be very slow and hang up at times."
More Cloudera Distribution for Hadoop Pricing and Cost Advice →
Earn 20 points