We performed a comparison between Apache Spark and Google Cloud Dataflow based on real PeerSpot user reviews.
Find out what your peers are saying about Apache, Cloudera, Amazon Web Services (AWS) and others in Hadoop."Now, when we're tackling sentiment analysis using NLP technologies, we deal with unstructured data—customer chats, feedback on promotions or demos, and even media like images, audio, and video files. For processing such data, we rely on PySpark. Beneath the surface, Spark functions as a compute engine with in-memory processing capabilities, enhancing performance through features like broadcasting and caching. It's become a crucial tool, widely adopted by 90% of companies for a decade or more."
"With Spark, we parallelize our operations, efficiently accessing both historical and real-time data."
"It is useful for handling large amounts of data. It is very useful for scientific purposes."
"It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance."
"There's a lot of functionality."
"The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it."
"The most valuable feature of Apache Spark is its flexibility."
"The deployment of the product is easy."
"The solution allows us to program in any language we desire."
"The service is relatively cheap compared to other batch-processing engines."
"The most valuable features of Google Cloud Dataflow are the integration, it's very simple if you have the complete stack, which we are using. It is overall very easy to use, user-friendly friendly, and cost-effective if you know how to use it. The solution is very flexible for programmers, if you know how to do scripts or program in Python or any other language, it's extremely easy to use."
"The most valuable features of Google Cloud Dataflow are scalability and connectivity."
"The support team is good and it's easy to use."
"I don't need a server running all the time while using the tool. It is also easy to setup. The product offers a pay-as-you-go service."
"The best feature of Google Cloud Dataflow is its practical connectedness."
"It is a scalable solution."
"Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn."
"They could improve the issues related to programming language for the platform."
"We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time."
"Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing."
"Dynamic DataFrame options are not yet available."
"At the initial stage, the product provides no container logs to check the activity."
"In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that."
"It's not easy to install."
"Google Cloud Data Flow can improve by having full simple integration with Kafka topics. It's not that complicated, but it could improve a bit. The UI is easy to use but the experience could be better. There are other tools available that do a better job."
"The deployment time could also be reduced."
"I would like Google Cloud Dataflow to be integrated with IT data flow and other related services to make it easier to use as it is a complex tool."
"There are certain challenges regarding the Google Cloud Composer which can be improved."
"They should do a market survey and then make improvements."
"Google Cloud Dataflow should include a little cost optimization."
"When I deploy the product in local errors, a lot of errors pop up which are not always caught. The solution's error logging is bad. It can take a lot of time to debug the errors. It needs to have better logs."
"The solution's setup process could be more accessible."
Apache Spark is ranked 1st in Hadoop with 60 reviews while Google Cloud Dataflow is ranked 7th in Streaming Analytics with 10 reviews. Apache Spark is rated 8.4, while Google Cloud Dataflow is rated 7.8. The top reviewer of Apache Spark writes "Reliable, able to expand, and handle large amounts of data well". On the other hand, the top reviewer of Google Cloud Dataflow writes "Easy to use for programmers, user-friendly, and scalable". Apache Spark is most compared with Spring Boot, AWS Batch, Spark SQL, SAP HANA and Cloudera Distribution for Hadoop, whereas Google Cloud Dataflow is most compared with Databricks, Apache NiFi, Amazon MSK, Amazon Kinesis and Talend Data Streams.
We monitor all Hadoop reviews to prevent fraudulent reviews and keep review quality high. We do not post reviews by company employees or direct competitors. We validate each review for authenticity via cross-reference with LinkedIn, and personal follow-up with the reviewer when necessary.