Apache Spark vs Google Cloud Dataflow Comparison 2024

Apache Spark

Google Cloud Dataflow

Apache Spark

Read 60 Apache Spark reviews

2,498 views|1,884 comparisons

Google Cloud Dataflow

Read 10 Google Cloud Dataflow reviews

4,813 views|3,977 comparisons

Comparison Buyer's Guide

Download the complete report

Buyer's Guide

Hadoop

April 2024

Executive Summary

We performed a comparison between Apache Spark and Google Cloud Dataflow based on real PeerSpot user reviews.

Find out what your peers are saying about Apache, Cloudera, Amazon Web Services (AWS) and others in Hadoop.

To learn more, read our detailed Hadoop Report (Updated: April 2024).

Download the complete report

768,740 professionals have used our research since 2012.

Featured Review

Anonymous User

Quantitative Developer at a marketing services firm

Seamless in distributing tasks, including its impressive map-reduce functionality

I have an example. We had a single-threaded application that used to run for about four to five hours, but with Spark, it got reduced to under one... Read more →

Darasimi Ajewole

Software Engineer at Formplus

Helps to run batch-specific jobs, but notifications for error messages could be more detailed

Migrating our batch processing jobs to Google Cloud Dataflow led to a reduction in cost by 70%.

Quotes From Members

We asked business professionals to review the solutions they use.
Here are some excerpts of what they said:

Pros

"Now, when we're tackling sentiment analysis using NLP technologies, we deal with unstructured data—customer chats, feedback on promotions or demos, and even media like images, audio, and video files. For processing such data, we rely on PySpark. Beneath the surface, Spark functions as a compute engine with in-memory processing capabilities, enhancing performance through features like broadcasting and caching. It's become a crucial tool, widely adopted by 90% of companies for a decade or more.""With Spark, we parallelize our operations, efficiently accessing both historical and real-time data.""It is useful for handling large amounts of data. It is very useful for scientific purposes.""It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance.""There's a lot of functionality.""The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it.""The most valuable feature of Apache Spark is its flexibility.""The deployment of the product is easy."

More Apache Spark Pros →

"The solution allows us to program in any language we desire.""The service is relatively cheap compared to other batch-processing engines.""The most valuable features of Google Cloud Dataflow are the integration, it's very simple if you have the complete stack, which we are using. It is overall very easy to use, user-friendly friendly, and cost-effective if you know how to use it. The solution is very flexible for programmers, if you know how to do scripts or program in Python or any other language, it's extremely easy to use.""The most valuable features of Google Cloud Dataflow are scalability and connectivity.""The support team is good and it's easy to use.""I don't need a server running all the time while using the tool. It is also easy to setup. The product offers a pay-as-you-go service.""The best feature of Google Cloud Dataflow is its practical connectedness.""It is a scalable solution."

More Google Cloud Dataflow Pros →

Cons

"Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn.""They could improve the issues related to programming language for the platform.""We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time.""Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.""Dynamic DataFrame options are not yet available.""At the initial stage, the product provides no container logs to check the activity.""In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that.""It's not easy to install."

More Apache Spark Cons →

"Google Cloud Data Flow can improve by having full simple integration with Kafka topics. It's not that complicated, but it could improve a bit. The UI is easy to use but the experience could be better. There are other tools available that do a better job.""The deployment time could also be reduced.""I would like Google Cloud Dataflow to be integrated with IT data flow and other related services to make it easier to use as it is a complex tool.""There are certain challenges regarding the Google Cloud Composer which can be improved.""They should do a market survey and then make improvements.""Google Cloud Dataflow should include a little cost optimization.""When I deploy the product in local errors, a lot of errors pop up which are not always caught. The solution's error logging is bad. It can take a lot of time to debug the errors. It needs to have better logs.""The solution's setup process could be more accessible."

More Google Cloud Dataflow Cons →

Pricing and Cost Advice

"Since we are using the Apache Spark version, not the data bricks version, it is an Apache license version, the support and resolution of the bug are actually late or delayed. The Apache license is free."

"Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera."

"We are using the free version of the solution."

"Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera."

"Apache Spark is an expensive solution."

"Spark is an open-source solution, so there are no licensing costs."

"On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing."

"It is an open-source solution, it is free of charge."

More Apache Spark Pricing and Cost Advice →

"The price of the solution depends on many factors, such as how they pay for tools in the company and its size."

"Google Cloud is slightly cheaper than AWS."

"The tool is cheap."

"Google Cloud Dataflow is a cheap solution."

"The solution is cost-effective."

"On a scale from one to ten, where one is cheap, and ten is expensive, I rate Google Cloud Dataflow's pricing a four out of ten."

"On a scale from one to ten, where one is cheap, and ten is expensive, I rate the solution's pricing a seven to eight out of ten."

"The solution is not very expensive."

More Google Cloud Dataflow Pricing and Cost Advice →

See Which Vendors Are Best For You

Use our free recommendation engine to learn which Hadoop solutions are best for your needs.

See Recommendations

768,740 professionals have used our research since 2012.

Questions from the Community

What do you like most about Apache Spark?

Top Answer:We use Spark to process data from different data sources.

Read all 30 answers →

What is your experience regarding pricing and costs for Apache Spark?

Top Answer:The solution is moderately priced.

Read all 19 answers →

What needs improvement with Apache Spark?

Top Answer:In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, and do the transformation in a subsecond

Read all 32 answers →

What do you like most about Google Cloud Dataflow?

Top Answer:The product's installation process is easy...The tool's maintenance part is somewhat easy.

Read all 10 answers →

What is your experience regarding pricing and costs for Google Cloud Data...

Top Answer:The solution is not very expensive.

Read all 9 answers →

What needs improvement with Google Cloud Dataflow?

Top Answer:The authentication part of the product is an area of concern where improvements are required. For some common users, the solution's authentication part is difficult to use. The scalability of the… more »

Read all 10 answers →

Ranking

1st

out of 22 in Hadoop

Views

2,498

Comparisons

1,884

Reviews

Average Words per Review

432

Rating

8.7

7th

out of 38 in Streaming Analytics

Views

4,813

Comparisons

3,977

Reviews

Average Words per Review

308

Rating

7.7

Comparisons

Spring Boot vs. Apache Spark

Compared 31% of the time.

AWS Batch vs. Apache Spark

Compared 10% of the time.

Spark SQL vs. Apache Spark

Compared 10% of the time.

SAP HANA vs. Apache Spark

Compared 8% of the time.

Cloudera Distribution for Hadoop vs. Apache Spark

Compared 6% of the time.

More Apache Spark Competitors →

Databricks vs. Google Cloud Dataflow

Compared 28% of the time.

Apache NiFi vs. Google Cloud Dataflow

Compared 15% of the time.

Amazon MSK vs. Google Cloud Dataflow

Compared 11% of the time.

Amazon Kinesis vs. Google Cloud Dataflow

Compared 11% of the time.

Talend Data Streams vs. Google Cloud Dataflow

Compared 1% of the time.

More Google Cloud Dataflow Competitors →

Also Known As

Google Dataflow

Learn More

Apache

Google

Overview

Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

Google Dataflow is a unified programming model and a managed service for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation. Cloud Dataflow frees you from operational tasks like resource management and performance optimization.

Sample Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions

Absolutdata, Backflip Studios, Bluecore, Claritics, Crystalloids, Energyworx, GenieConnect, Leanplum, Nomanini, Redbus, Streak, TabTale

Top Industries

REVIEWERS

Computer Software Company30%

Financial Services Firm15%

University9%

Marketing Services Firm6%

VISITORS READING REVIEWS

Financial Services Firm24%

Computer Software Company13%

Manufacturing Company7%

Comms Service Provider6%

VISITORS READING REVIEWS

Financial Services Firm14%

Computer Software Company12%

Retailer11%

Manufacturing Company10%

Company Size

REVIEWERS

Small Business40%

Midsize Enterprise19%

Large Enterprise40%

VISITORS READING REVIEWS

Small Business17%

Midsize Enterprise12%

Large Enterprise71%

REVIEWERS

Small Business27%

Midsize Enterprise18%

Large Enterprise55%

VISITORS READING REVIEWS

Small Business17%

Midsize Enterprise12%

Large Enterprise72%

Apache Spark vs Google Cloud Dataflow comparison

Apache Spark

Google Cloud Dataflow