Apache Spark Overview

Apache Spark is the #1 ranked solution in our list of top Hadoop tools. It is most often compared to Spring Boot: Apache Spark vs Spring Boot

What is Apache Spark?

Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

Apache Spark Buyer's Guide

Download the Apache Spark Buyer's Guide including reviews and more. Updated: May 2021

Apache Spark Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions

Apache Spark Video

Filter Archived Reviews (More than two years old)

Filter by:
Filter Reviews
Industry
Loading...
Filter Unavailable
Company Size
Loading...
Filter Unavailable
Job Level
Loading...
Filter Unavailable
Rating
Loading...
Filter Unavailable
Considered
Loading...
Filter Unavailable
Order by:
Loading...
  • Date
  • Highest Rating
  • Lowest Rating
  • Review Length
Search:
Showingreviews based on the current filters. Reset all filters
RW
Portfolio Manager, Enterprise Solutions Architect at Capgemini
Real User
Supports streaming and micro-batch

What is our primary use case?

Streaming telematics data.

How has it helped my organization?

It's a better MR, supports streaming and micro-batch, and supports Spark ML and Spark SQL.

What is most valuable?

It supports streaming and micro-batch.

What needs improvement?

Better data lineage support.
SP
Director - Data Management, Governance and Quality at Hilton
Real User
Powerful language but complicated coding

What is our primary use case?

Ingesting billions of rows of data all day.

How has it helped my organization?

Spark on AWS is not that cost-effective as memory is expensive and you cannot customize hardware in AWS. If you want more memory, you have to pay for more CPUs too in AWS.

What is most valuable?

Powerful language.

What needs improvement?

It is like going back to the '80s for the complicated coding that is required to write efficient programs.
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: May 2021.
501,151 professionals have used our research since 2012.
User
User
Features include machine learning, real time streaming, and data processing. It doesn't enable spark job scheduling with monitoring capability.

What is our primary use case?

Used for building big data platforms for processing huge volumes of data. Additionally, streaming data is critical.

How has it helped my organization?

It provides a scalable machine learning library so that we can train and predict user behavior for promotion purposes.

What is most valuable?

Machine learning, real time streaming, and data processing are fantastic, as well as the resilient or fault tolerant feature.

What needs improvement?

I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.

For how long have I used the solution?

Trial/evaluations only.
Manager | Data Science Enthusiast | Management Consultant at a consultancy with 5,001-10,000 employees
Consultant
We can now harness richer data sets and benefit from use cases

How has it helped my organization?

Organisations can now harness richer data sets and benefit from use cases, which add value to their business functions.

What is most valuable?

Distributed in memory processing. Some of the algorithms are resource heavy and executing this requires a lot of RAM and CPU. With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware.

What needs improvement?

Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing.

For how long have I used the solution?

Three to five years.

What do I think about the stability of the solution?

At times when users do not know how to use Spark and request a lot of resources, then the underlying JVMs can crash, which is a…
Big Data and Cloud Solution Consultant at a financial services firm with 10,001+ employees
Vendor
Provides flexibility for application creation with less coding effort

What is most valuable?

DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort.

How has it helped my organization?

We developed a tool for data ingestion from HDFS->Raw->L1 layer with data quality checks, putting data to elastic search, performing CDC.

What needs improvement?

Dynamic DataFrame options are not yet available.

For how long have I used the solution?

One and a half years.

What do I think about the stability of the solution?

No.

What do I think about the scalability of the solution?

No.

What other advice do I have?

Spark gives the flexibility for developing custom applications.
Sr. Software Engineer at a tech vendor with 1-10 employees
Real User
Helped us reduce 3TB Google Ngrams in hours instead of days

Pros and Cons

  • "The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics."
  • "More ML based algorithms should be added to it, to make it algorithmic-rich for developers."

What other advice do I have?

This is a very good product for the big data analytics and integrates well with other parts like Machine Learning and graph analytics.
Architect at a healthcare company with 51-200 employees
Real User
Having everything in the same framework has helped us out a lot

What is most valuable?

ETL and streaming capabilities.

How has it helped my organization?

Made Big Data processing more convenient and a uniform framework adds to efficiency of usage since the same framework can be used for batch and stream processing.

What needs improvement?

Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet).

For how long have I used the solution?

I have used Spark since its inception in March 2015, from Spark 1.1 onwards. Currently, I use 2.2 extensively.

What do I think about the stability of the solution?

Yes, occasionally with different APIs.

What do I think about the scalability of the solution?

No.

How are customer service and technical support?

Since we were using the Open Source…
Big Data Consultant at a tech services company with 501-1,000 employees
Consultant
We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.

What is most valuable?

The good performance. The nice graphical management console. The long list of ML algorithms.

How has it helped my organization?

We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.

What needs improvement?

Apache Spark provides very good performance The tuning phase is still tricky.

For how long have I used the solution?

I've used it for 2 years.

What was my experience with deployment of the solution?

We didn't have an issue with the deployment.

What do I think about the stability of the solution?

In the past we deployed Spark 1.3 to use Spark SQL but unfortunately one of our queries failed because of a bug fixed in following releases. Then we moved to Spark 1.6 but still some queries were…
Chief System Architect at a marketing services firm with 501-1,000 employees
Vendor
Spark gives us the ability to run queries on MySQL database without pressurising our database
Software Consultant at a tech services company with 10,001+ employees
Consultant
It provides large scale data processing with negligible latency at the cost of commodity hardwares.

What other advice do I have?

My advice to others would be just to use Apache Spark for large scale data processing, as it provides good performance at low cost, unlike Ab-Initio or Informatica. But the main problem is, now in the market, there are not many people certified in Apache Spark.
Core Engine Engineer at a computer software company with 51-200 employees
Real User
It makes web-based queries for plotting data easier. It needs to be simpler to use the machine learning algorithms supported by Octave.

Valuable Features

RDDs DataFrames Machine learning libraries

Improvements to My Organization

Faster time to parse and compute data. It makes web-based queries for plotting data easier.

Room for Improvement

It needs to be simpler to use the machine learning algorithms supported by Octave (example polynomial regressions, polynomial interpolation).

Use of Solution

I've been using it for one year.

Deployment Issues

There have been no issues with the deployment.

Stability Issues

There have been no issues with the stability.

Scalability Issues

There have been no issues with the scalability.

Customer Service and Technical Support

We still rely on user forums for my answers. We do not use a commercial product yet.

Initial Setup

The initial set-up was easy. I have…
Systems Engineering Lead, Mid-Atlantic at a tech company with 10,001+ employees
Vendor
It allows you to construct event-driven information systems.

What other advice do I have?

I also suggest having a Chief Technologist who has extensive experience in architecting several Big Data solutions. They should be able to communicate in business as well as technology language. Their expertise should range from infrastructure to application development and have command of Hadoop technologies.
Lead Big Data Engineer at a non-profit with 51-200 employees
Vendor
​I use it to process large amount of data in the energy industry.

What other advice do I have?

Get to know how Spark works, what are job, stage, task, DAG, etc., and it will help you to write Spark application.
Engineer at a tech vendor with 10,001+ employees
Real User
Spark provides lots of high-level APIs, which reduces duplication of work.

Valuable Features

Streaming data processing

Improvements to My Organization

In the previous version, we use Storm to handle real-time data, however its performance doesn't meet the requirement. Spark Streaming's micro-batch mode helps improving performance. Also, Spark provides lots of high-level APIs, which reduces duplication of work.

Room for Improvement

Better monitoring ability. Especially monitoring integration with customer codes.

Use of Solution

I've used it for one year.

Stability Issues

We met some standalone deployment issues, which showed that its stability is not that good. So we plan to switch to Yarn or Mesos mode

Customer Service and Technical Support

I have to say it is bad. I can only ask for help in the Google group. However, it is run in the…
CEO at a tech consulting company with 51-200 employees
Consultant
It's enabled interactive self-service access to data​.

What is most valuable?

There are several valuable features. Interactive data access (low latency) Batch ETL-style processing Schema-free data models Algorithms

How has it helped my organization?

We have 1000x improvement in performance over other techniques. It's enabled interactive self-service access to data.

What needs improvement?

Better integration of BI tools wold be a much appreciated improvement.

For how long have I used the solution?

I've used it for about 14 months.

What was my experience with deployment of the solution?

I haven't had any issues with deployment.

What do I think about the stability of the solution?

It's been stable for us.

What do I think about the scalability of the solution?

It's scaled without issue.

How are customer service and

Data Scientist at a tech vendor with 10,001+ employees
Vendor
It allows the loading and investigation of very lard data sets, has MLlib for machine learning, Spark streaming, and both the new and old dataframe API.

What other advice do I have?

Learn Scala as this will greatly reduce the pain in starting off with Spark.
Software Developer (Product Engineering) at a computer software company with 501-1,000 employees
Vendor
We have been using Spark to do a lot of batch and stream processing of inbound data from Apache Kafka. Scaling Spark on YARN is still an issue but we are getting acceptable performance.

What other advice do I have?

Have Scala developers at hand. Base Java competency will not be enough during optimization rounds.
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Quick Links