Apache Spark Overview
What is Apache Spark?
Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory
Apache Spark Buyer's Guide
Download the Apache Spark Buyer's Guide including reviews and more. Updated: May 2021
Apache Spark Customers
NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
Apache Spark Video
Filter Archived Reviews (More than two years old)
- Highest Rating
- Lowest Rating
- Review Length
Showingreviews based on the current filters.
Portfolio Manager, Enterprise Solutions Architect at Capgemini
Apr 11, 2019
Supports streaming and micro-batch
What is our primary use case?Streaming telematics data.
How has it helped my organization?It's a better MR, supports streaming and micro-batch, and supports Spark ML and Spark SQL.
What is most valuable?It supports streaming and micro-batch.
What needs improvement?Better data lineage support.
Director - Data Management, Governance and Quality at Hilton
Mar 19, 2019
Powerful language but complicated coding
What is our primary use case?Ingesting billions of rows of data all day.
How has it helped my organization?Spark on AWS is not that cost-effective as memory is expensive and you cannot customize hardware in AWS. If you want more memory, you have to pay for more CPUs too in AWS.
What is most valuable?Powerful language.
What needs improvement?It is like going back to the '80s for the complicated coding that is required to write efficient programs.
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: May 2021.
501,151 professionals have used our research since 2012.
Jul 11, 2018
Features include machine learning, real time streaming, and data processing. It doesn't enable spark job scheduling with monitoring capability.
What is our primary use case?Used for building big data platforms for processing huge volumes of data. Additionally, streaming data is critical.
How has it helped my organization?It provides a scalable machine learning library so that we can train and predict user behavior for promotion purposes.
What is most valuable?Machine learning, real time streaming, and data processing are fantastic, as well as the resilient or fault tolerant feature.
What needs improvement?I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.
For how long have I used the solution?Trial/evaluations only.
Manager | Data Science Enthusiast | Management Consultant at a consultancy with 5,001-10,000 employees
Dec 10, 2017
We can now harness richer data sets and benefit from use cases
How has it helped my organization?Organisations can now harness richer data sets and benefit from use cases, which add value to their business functions.
What is most valuable?Distributed in memory processing. Some of the algorithms are resource heavy and executing this requires a lot of RAM and CPU. With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware.
What needs improvement?Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing.
For how long have I used the solution?Three to five years.
What do I think about the stability of the solution?At times when users do not know how to use Spark and request a lot of resources, then the underlying JVMs can crash, which is a…
Big Data and Cloud Solution Consultant at a financial services firm with 10,001+ employees
Oct 2, 2017
Provides flexibility for application creation with less coding effort
What is most valuable?DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort.
How has it helped my organization?We developed a tool for data ingestion from HDFS->Raw->L1 layer with data quality checks, putting data to elastic search, performing CDC.
What needs improvement?Dynamic DataFrame options are not yet available.
For how long have I used the solution?One and a half years.
What do I think about the stability of the solution?No.
What do I think about the scalability of the solution?No.
What other advice do I have?Spark gives the flexibility for developing custom applications.
Sr. Software Engineer at a tech vendor with 1-10 employees
Oct 1, 2017
Helped us reduce 3TB Google Ngrams in hours instead of days
Pros and Cons
- "The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics."
- "More ML based algorithms should be added to it, to make it algorithmic-rich for developers."
What other advice do I have?This is a very good product for the big data analytics and integrates well with other parts like Machine Learning and graph analytics.
Architect at a healthcare company with 51-200 employees
Sep 27, 2017
Having everything in the same framework has helped us out a lot
What is most valuable?ETL and streaming capabilities.
How has it helped my organization?Made Big Data processing more convenient and a uniform framework adds to efficiency of usage since the same framework can be used for batch and stream processing.
What needs improvement?Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet).
For how long have I used the solution?I have used Spark since its inception in March 2015, from Spark 1.1 onwards. Currently, I use 2.2 extensively.
What do I think about the stability of the solution?Yes, occasionally with different APIs.
What do I think about the scalability of the solution?No.
How are customer service and technical support?Since we were using the Open Source…
Big Data Consultant at a tech services company with 501-1,000 employees
Aug 25, 2017
We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.
What is most valuable?The good performance. The nice graphical management console. The long list of ML algorithms.
How has it helped my organization?We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.
What needs improvement?Apache Spark provides very good performance The tuning phase is still tricky.
For how long have I used the solution?I've used it for 2 years.
What was my experience with deployment of the solution?We didn't have an issue with the deployment.
What do I think about the stability of the solution?In the past we deployed Spark 1.3 to use Spark SQL but unfortunately one of our queries failed because of a bug fixed in following releases. Then we moved to Spark 1.6 but still some queries were…
Software Consultant at a tech services company with 10,001+ employees
Mar 27, 2016
It provides large scale data processing with negligible latency at the cost of commodity hardwares.
What other advice do I have?My advice to others would be just to use Apache Spark for large scale data processing, as it provides good performance at low cost, unlike Ab-Initio or Informatica. But the main problem is, now in the market, there are not many people certified in Apache Spark.
Core Engine Engineer at a computer software company with 51-200 employees
Jan 21, 2016
It makes web-based queries for plotting data easier. It needs to be simpler to use the machine learning algorithms supported by Octave.
Valuable FeaturesRDDs DataFrames Machine learning libraries
Improvements to My OrganizationFaster time to parse and compute data. It makes web-based queries for plotting data easier.
Room for ImprovementIt needs to be simpler to use the machine learning algorithms supported by Octave (example polynomial regressions, polynomial interpolation).
Use of SolutionI've been using it for one year.
Deployment IssuesThere have been no issues with the deployment.
Stability IssuesThere have been no issues with the stability.
Scalability IssuesThere have been no issues with the scalability.
Customer Service and Technical SupportWe still rely on user forums for my answers. We do not use a commercial product yet.
Initial SetupThe initial set-up was easy. I have…
Systems Engineering Lead, Mid-Atlantic at a tech company with 10,001+ employees
Jan 21, 2016
It allows you to construct event-driven information systems.
What other advice do I have?I also suggest having a Chief Technologist who has extensive experience in architecting several Big Data solutions. They should be able to communicate in business as well as technology language. Their expertise should range from infrastructure to application development and have command of Hadoop technologies.
Lead Big Data Engineer at a non-profit with 51-200 employees
Jan 20, 2016
I use it to process large amount of data in the energy industry.
What other advice do I have?Get to know how Spark works, what are job, stage, task, DAG, etc., and it will help you to write Spark application.
Jan 18, 2016
Spark provides lots of high-level APIs, which reduces duplication of work.
Valuable FeaturesStreaming data processing
Improvements to My OrganizationIn the previous version, we use Storm to handle real-time data, however its performance doesn't meet the requirement. Spark Streaming's micro-batch mode helps improving performance. Also, Spark provides lots of high-level APIs, which reduces duplication of work.
Room for ImprovementBetter monitoring ability. Especially monitoring integration with customer codes.
Use of SolutionI've used it for one year.
Stability IssuesWe met some standalone deployment issues, which showed that its stability is not that good. So we plan to switch to Yarn or Mesos mode
Customer Service and Technical SupportI have to say it is bad. I can only ask for help in the Google group. However, it is run in the…
CEO at a tech consulting company with 51-200 employees
Jan 17, 2016
It's enabled interactive self-service access to data.
What is most valuable?There are several valuable features. Interactive data access (low latency) Batch ETL-style processing Schema-free data models Algorithms
How has it helped my organization?We have 1000x improvement in performance over other techniques. It's enabled interactive self-service access to data.
What needs improvement?Better integration of BI tools wold be a much appreciated improvement.
For how long have I used the solution?I've used it for about 14 months.
What was my experience with deployment of the solution?I haven't had any issues with deployment.
What do I think about the stability of the solution?It's been stable for us.
What do I think about the scalability of the solution?It's scaled without issue.
How are customer service and…
Data Scientist at a tech vendor with 10,001+ employees
Jan 17, 2016
It allows the loading and investigation of very lard data sets, has MLlib for machine learning, Spark streaming, and both the new and old dataframe API.
What other advice do I have?Learn Scala as this will greatly reduce the pain in starting off with Spark.
Software Developer (Product Engineering) at a computer software company with 501-1,000 employees
Jan 13, 2016
We have been using Spark to do a lot of batch and stream processing of inbound data from Apache Kafka. Scaling Spark on YARN is still an issue but we are getting acceptable performance.
What other advice do I have?Have Scala developers at hand. Base Java competency will not be enough during optimization rounds.
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.