We just raised a $30M Series A: Read our story

Apache Spark OverviewUNIXBusinessApplication

Apache Spark is the #1 ranked solution in our list of top Hadoop tools. It is most often compared to Spring Boot: Apache Spark vs Spring Boot

What is Apache Spark?

Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

Apache Spark Buyer's Guide

Download the Apache Spark Buyer's Guide including reviews and more. Updated: October 2021

Apache Spark Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions

Apache Spark Video

Archived Apache Spark Reviews (more than two years old)

Filter by:
Filter Reviews
Industry
Loading...
Filter Unavailable
Company Size
Loading...
Filter Unavailable
Job Level
Loading...
Filter Unavailable
Rating
Loading...
Filter Unavailable
Considered
Loading...
Filter Unavailable
Order by:
Loading...
  • Date
  • Highest Rating
  • Lowest Rating
  • Review Length
Search:
Showingreviews based on the current filters. Reset all filters
AD
Senior Consultant & Training at a tech services company with 51-200 employees
Consultant
Easy to use and is capable of processing large amounts of data

Pros and Cons

  • "The most valuable feature of this solution is its capacity for processing large amounts of data."
  • "When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data."

What is our primary use case?

We use this solution for information gathering and processing. 

I use it myself when I am developing on my laptop.

I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.

What is most valuable?

The most valuable feature of this solution is its capacity for processing large amounts of data.

This solution makes it easy to do a lot of things. It's easy to read data, process it, save it, etc.

What needs improvement?

When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data. Once you are experienced, it is easier and more stable.

When you are trying to do something outside of the normal requirements in a typical project, it is difficult to find somebody with experience.

For how long have I used the solution?

I have been using this solution for between two and three years.

What do I think about the stability of the solution?

This solution is difficult for users who are just beginning and they experience out of memory errors when dealing with large amounts of data.

How are customer service and technical support?

I have not been in contact with technical support. I find all of the answers that I need in the forums.

What other advice do I have?

The work that we are doing with this solution is quite common and is very easy to do.

My advice for anybody who is implementing this solution is to look at their needs and then look at the community. Normally, there are a lot of people who have already done what you need. So, even without experience, it is quite simple to do a lot of things.

I would rate this solution a nine out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
Karthikeyan R
Principal Architect at a financial services firm with 1,001-5,000 employees
Real User
Fast performance and has an easy initial setup

Pros and Cons

  • "I found the solution stable. We haven't had any problems with it."
  • "It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster."

What is our primary use case?

We use the solution for analytics.

How has it helped my organization?

I'm not sure how it has improved my organization but I believe that it's a good product.

What is most valuable?

The fast performance is the most valuable aspect of the solution.

What needs improvement?

The search could be improved. Usually, we are using other tools to search for specific stuff. We'll be using it how I use other tools - to get the details, but if there any way to search for little things that will be better.

It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster.

In the next release, if they can add more analytics, that would be useful. For example, for data, built data, if there was one port where you put the high one then you can pull any other close to you, and then maybe a log for the right script. 

For how long have I used the solution?

I've been using the solution for two years.

What do I think about the stability of the solution?

I found the solution stable. We haven't had any problems with it.

How are customer service and technical support?

Usually, we can fix any issues. If we have problems, we google a little bit to find the issue. 

Which solution did I use previously and why did I switch?

I was using some other systems and we moved to Spark later. We faced performance and other issues with the other solution.

How was the initial setup?

The initial setup was easy. We keep on getting data from different sources so we will keep on porting in little bits. It's not done in a single sitting, so I can't really say how long it takes.

What other advice do I have?

I would recommend the solution. I would rate it an eight or nine out of 10.

For some areas, I would give it ten but I cannot use some parts. If you are going to use it for a consumer then I would be able to recommend it and you should go ahead. It doesn't work for me as I have different clients and different engagements.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: October 2021.
540,984 professionals have used our research since 2012.
LC
Snr Security Engineer at a tech vendor with 201-500 employees
Real User
Provides security analytics and has good scalability

What is our primary use case?

We primarily use the solution for security analytics.

What is most valuable?

The scalability has been the most valuable aspect of the solution.

What needs improvement?

The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive. 

For how long have I used the solution?

I've been using the solution for three years.

What do I think about the stability of the solution?

The 2.3 version is quite stable. All of our customers use it, there are around 100,000+ users, and it runs 24/7.

What do I think about the scalability of the solution?

The scalability is very good.

How are customer service and technical support?

You actually buy Cloudera along with it. You don't really get…

What is our primary use case?

We primarily use the solution for security analytics.

What is most valuable?

The scalability has been the most valuable aspect of the solution.

What needs improvement?

The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive. 

For how long have I used the solution?

I've been using the solution for three years.

What do I think about the stability of the solution?

The 2.3 version is quite stable. All of our customers use it, there are around 100,000+ users, and it runs 24/7.

What do I think about the scalability of the solution?

The scalability is very good.

How are customer service and technical support?

You actually buy Cloudera along with it. You don't really get any support, except you need support.

Which solution did I use previously and why did I switch?

In previous companies, we used MySQL platform and solutions like ArcSight and Splunk. We switched for scalability. MySQL wasn't going to scale, and we don't use Splunk at this company.

How was the initial setup?

The initial setup was complex. It is a complex tool. It's a lot to do with how you will use it. There is a lot to set up. They need to put a lot of scripts to it. There's nearly 60 to set up. When you set up the cloud, it takes about a day to set up. If you set it up on-premise, you know, on hardware, it only takes about a week.

What other advice do I have?

I would rate this solution eight out of 10. 

Disclosure: I am a real user, and this review is based on my own experience and opinions.
RW
Portfolio Manager, Enterprise Solutions Architect at Capgemini
Real User
Supports streaming and micro-batch

What is our primary use case?

Streaming telematics data.

How has it helped my organization?

It's a better MR, supports streaming and micro-batch, and supports Spark ML and Spark SQL.

What is most valuable?

It supports streaming and micro-batch.

What needs improvement?

Better data lineage support.

What is our primary use case?

Streaming telematics data.

How has it helped my organization?

It's a better MR, supports streaming and micro-batch, and supports Spark ML and Spark SQL.

What is most valuable?

It supports streaming and micro-batch.

What needs improvement?

Better data lineage support.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
SP
Director - Data Management, Governance and Quality at Hilton
Real User
Powerful language but complicated coding

What is our primary use case?

Ingesting billions of rows of data all day.

How has it helped my organization?

Spark on AWS is not that cost-effective as memory is expensive and you cannot customize hardware in AWS. If you want more memory, you have to pay for more CPUs too in AWS.

What is most valuable?

Powerful language.

What needs improvement?

It is like going back to the '80s for the complicated coding that is required to write efficient programs.

What is our primary use case?

Ingesting billions of rows of data all day.

How has it helped my organization?

Spark on AWS is not that cost-effective as memory is expensive and you cannot customize hardware in AWS. If you want more memory, you have to pay for more CPUs too in AWS.

What is most valuable?

Powerful language.

What needs improvement?

It is like going back to the '80s for the complicated coding that is required to write efficient programs.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
reviewer894894
User
User
Features include machine learning, real time streaming, and data processing. It doesn't enable spark job scheduling with monitoring capability.

What is our primary use case?

Used for building big data platforms for processing huge volumes of data. Additionally, streaming data is critical.

How has it helped my organization?

It provides a scalable machine learning library so that we can train and predict user behavior for promotion purposes.

What is most valuable?

Machine learning, real time streaming, and data processing are fantastic, as well as the resilient or fault tolerant feature.

What needs improvement?

I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.

For how long have I used the solution?

Trial/evaluations only.

What is our primary use case?

Used for building big data platforms for processing huge volumes of data. Additionally, streaming data is critical.

How has it helped my organization?

It provides a scalable machine learning library so that we can train and predict user behavior for promotion purposes.

What is most valuable?

Machine learning, real time streaming, and data processing are fantastic, as well as the resilient or fault tolerant feature.

What needs improvement?

I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.

For how long have I used the solution?

Trial/evaluations only.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user786777
Manager | Data Science Enthusiast | Management Consultant at a consultancy with 5,001-10,000 employees
Consultant
We can now harness richer data sets and benefit from use cases

How has it helped my organization?

Organisations can now harness richer data sets and benefit from use cases, which add value to their business functions.

What is most valuable?

Distributed in memory processing. Some of the algorithms are resource heavy and executing this requires a lot of RAM and CPU. With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware.

What needs improvement?

Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing.

For how long have I used the solution?

Three to five years.

What do I think about the stability of the solution?

At times when users do not know how to use Spark and request a lot of resources, then the underlying JVMs can crash, which is a…

How has it helped my organization?

Organisations can now harness richer data sets and benefit from use cases, which add value to their business functions.

What is most valuable?

Distributed in memory processing. Some of the algorithms are resource heavy and executing this requires a lot of RAM and CPU. With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware.

What needs improvement?

Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing.

For how long have I used the solution?

Three to five years.

What do I think about the stability of the solution?

At times when users do not know how to use Spark and request a lot of resources, then the underlying JVMs can crash, which is a big sense of worry. 

What do I think about the scalability of the solution?

No issues.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user746943
Big Data and Cloud Solution Consultant at a financial services firm with 10,001+ employees
Vendor
Provides flexibility for application creation with less coding effort

What is most valuable?

DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort.

How has it helped my organization?

We developed a tool for data ingestion from HDFS->Raw->L1 layer with data quality checks, putting data to elastic search, performing CDC.

What needs improvement?

Dynamic DataFrame options are not yet available.

For how long have I used the solution?

One and a half years.

What do I think about the stability of the solution?

No.

What do I think about the scalability of the solution?

No.

What other advice do I have?

Spark gives the flexibility for developing custom applications.

What is most valuable?

DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort.

How has it helped my organization?

We developed a tool for data ingestion from HDFS->Raw->L1 layer with data quality checks, putting data to elastic search, performing CDC.

What needs improvement?

Dynamic DataFrame options are not yet available.

For how long have I used the solution?

One and a half years.

What do I think about the stability of the solution?

No.

What do I think about the scalability of the solution?

No.

What other advice do I have?

Spark gives the flexibility for developing custom applications.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user746673
Sr. Software Engineer at a tech vendor with 1-10 employees
Real User
Helped us reduce 3TB Google Ngrams in hours instead of days

Pros and Cons

  • "The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics."
  • "More ML based algorithms should be added to it, to make it algorithmic-rich for developers."

What is most valuable?

The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics. The community is growing and hence executing ML in a distributed fashion is quite good.

How has it helped my organization?

Previously we were using Hadoop MapReduce to reduce the Google Ngrams (3TB), which took us approximately five days on our cluster. After using Spark, we were able to accomplish this task within hours.

What needs improvement?

This product is already improving as the community is developing it rapidly. More ML based algorithms should be added to it, to make it algorithmic-rich for developers.

For how long have I used the solution?

Two and a half years.

What do I think about the stability of the solution?

No, I did not encounter any problems with the stability. It is also quite backwards compatible.

What do I think about the scalability of the solution?

No I did not as of now, it is quite scalable. Using simple scripts you can add as many workers as you want.

What other advice do I have?

This is a very good product for the big data analytics and integrates well with other parts like Machine Learning and graph analytics.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user326142
Architect at a healthcare company with 51-200 employees
Real User
Having everything in the same framework has helped us out a lot

What is most valuable?

ETL and streaming capabilities.

How has it helped my organization?

Made Big Data processing more convenient and a uniform framework adds to efficiency of usage since the same framework can be used for batch and stream processing.

What needs improvement?

Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet).

For how long have I used the solution?

I have used Spark since its inception in March 2015, from Spark 1.1 onwards. Currently, I use 2.2 extensively.

What do I think about the stability of the solution?

Yes, occasionally with different APIs.

What do I think about the scalability of the solution?

No.

How are customer service and technical support?

Since we were using the Open Source…

What is most valuable?

ETL and streaming capabilities.

How has it helped my organization?

Made Big Data processing more convenient and a uniform framework adds to efficiency of usage since the same framework can be used for batch and stream processing.

What needs improvement?

Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet).

For how long have I used the solution?

I have used Spark since its inception in March 2015, from Spark 1.1 onwards.

Currently, I use 2.2 extensively.

What do I think about the stability of the solution?

Yes, occasionally with different APIs.

What do I think about the scalability of the solution?

No.

How are customer service and technical support?

Since we were using the Open Source version of Apache Spark, without the Databricks support, we never used technical support form Databricks.

Which solution did I use previously and why did I switch?

Yes we used Hive, Pig, and Storm. Having everything in the same framework has helped us out a lot.

Which other solutions did I evaluate?

Yes, we considered other big data products in the Big Data Ecosystem.

What other advice do I have?

Go for it.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user372393
Big Data Consultant at a tech services company with 501-1,000 employees
Consultant
We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.

What is most valuable?

The good performance. The nice graphical management console. The long list of ML algorithms.

How has it helped my organization?

We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.

What needs improvement?

Apache Spark provides very good performance The tuning phase is still tricky.

For how long have I used the solution?

I've used it for 2 years.

What was my experience with deployment of the solution?

We didn't have an issue with the deployment.

What do I think about the stability of the solution?

In the past we deployed Spark 1.3 to use Spark SQL but unfortunately one of our queries failed because of a bug fixed in following releases. Then we moved to Spark 1.6 but still some queries were…

What is most valuable?

The good performance. The nice graphical management console. The long list of ML algorithms.

How has it helped my organization?

We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.

What needs improvement?

Apache Spark provides very good performance The tuning phase is still tricky.

For how long have I used the solution?

I've used it for 2 years.

What was my experience with deployment of the solution?

We didn't have an issue with the deployment.

What do I think about the stability of the solution?

In the past we deployed Spark 1.3 to use Spark SQL but unfortunately one of our queries failed because of a bug fixed in following releases. Then we moved to Spark 1.6 but still some queries were failing when run against huge datasets. Now we are using version 2.1: it is more stable, it ensures better performances and the SQL/ML parts are reacher than before.

What do I think about the scalability of the solution?

I've had no issues with the scalability.

How is customer service and technical support?

Customer Service:

I've never had to use customer service.

Technical Support:

I've never had to use technical support.

How was the initial setup?

The initial set-up is quite complex because you have to set-up many different configuration parameters that are deployment-specific. It is not trivial to set-up the correct configuration with so many variables involved.

What about the implementation team?

In-house team. The setup itself is not a problem when you have just to test the system. The challenging part is discovering the optimal configuration needed to obtain a production system proving good performance.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user371832
Chief System Architect at a marketing services firm with 501-1,000 employees
Vendor
Spark gives us the ability to run queries on MySQL database without pressurising our database

What is most valuable?

With spark SQL we've now the capabilities to analyse very large quantities of data located in S3 on Amazon at very low cost comparing other solution we checked. 

We also use our own Spark cluster to aggregate data on near real time and save the result on MySQL database. 

We've started new projects using the machine learning library ML.

How has it helped my organization?

Until Spark we didn't have the ability to analyse this quantity of data we're talking about two TB/hour. So we're now able to produce a lot of reports, and are also able to develop machine learning based analysis to optimize our business. 

We've central access to every piece of data in the company including finance, business, debug etc. and the ability to join all this data together.

What needs improvement?

Spark is actually very good for batch analysis much more good than Hadoop, it's much simple, much more quicker etc., but it actually lacks the ability to perform real-time querying like Vertica or Redshift.  

Also, it is more difficult for an end user to work with Spark than normal database. even comparing with analytic database like Vertica or Redshift.

For how long have I used the solution?

We're now using Spark-Streaming and Spark-SQL for almost 2 years. 

What was my experience with deployment of the solution?

We're working on AWS so we need to have a managed environment. We've choose to go with a solution based on Chef to deploy and configure the spark clusters. Tip : if you don't have any devops you can use the ec2 script (provided by spark distro) to deploy cluster on amazon. We've tested it and work perfectly.  

What do I think about the stability of the solution?

Spark Streaming is difficult to stabilize as you're always dependant to your stream flow. If you start to be late on the consumer you've a serious problem. We've encountered a lot of stability issue to configure it as expected

What do I think about the scalability of the solution?

It's linked to stability in our case it's takes time to evaluate what is the correct size of the cluster you need. It's very important to always add to you jobs monitoring to be able to understand what's the problem. We use datadog as monitoring platform

Which solution did I use previously and why did I switch?

Yes to make this job we've used a MySQL database. We switch because MySQL is not a scalable solution and we've reach it's limits.

How was the initial setup?

Setup a spark cluster can be difficult. it's related to your clustering strategy. There is 4 solution at least. 

ec2 script : work only on Amazon AWS

Standalone : manually configuration (hard)

Yarn : to leverage your already existing Hadoop environment.

Mesos : to use with your other Mesos ready application

What about the implementation team?

We use Databricks as online DB ad hoc query. It's work on AWS as managed service, it manage for you the cluster creation, configuration and monitoring.

Give a notebook oriented user interface to query any data source using Spark: DB, Parquet, CSV, Avro etc...

Which other solutions did I evaluate?

Yes we've started to evaluate analytics databases : vertica, exasol, and other for all the them the price was an issue regarding the quantity of data we want to manipulate.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user365304
Software Consultant at a tech services company with 10,001+ employees
Consultant
It provides large scale data processing with negligible latency at the cost of commodity hardwares.

Valuable Features:

The most important feature of Apache Spark is that it provides large scale data processing with negligible latency at the cost of commodity hardwares. Spark framework is just a blessings over Hadoop, as the later does not allow fast processing of data, which is accomplished by the in-memory data processing of Spark.

Improvements to My Organization:

Apache Spark is a framework, which allows one organization to perform business & data analytics, at a very low cost, as compared to Ab-Initio or Informatica. Thus, by using Apache Spark in place of those tools, one organization can achieve huge reduction in cost, & without compromising with any data security & other data related issues, if controlled by an expert Scala programmer  & Apache Spark does not bear the overheads of Hadoop of having high latency. All these points, by which my organization is being benefitted as well.

Room for Improvement:

Question of improvement always comes to mind of the developers. Just like the most common need of the developers, if a user-friendly GUI along with 'drag & drop' feature can be attached to this framework, then it would be easier to access it.

Another thing to mention, there always is a place for improvement in terms of the memory usage. If in future, it is achievable to use less memory for processing, it would obviously be better.

Deployment Issues:

We've had no issues with deployment.

Stability Issues:

See above regarding memory usage.

Scalability Issues:

We've had no issues with scalability.

Other Advice:

My advice to others would be just to use Apache Spark for large scale data processing, as it provides good performance at low cost, unlike Ab-Initio or Informatica. But the main problem is, now in the market, there are not many people certified in Apache Spark.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user374028
Core Engine Engineer at a computer software company with 51-200 employees
Real User
It makes web-based queries for plotting data easier. It needs to be simpler to use the machine learning algorithms supported by Octave.

Valuable Features

RDDs DataFrames Machine learning libraries

Improvements to My Organization

Faster time to parse and compute data. It makes web-based queries for plotting data easier.

Room for Improvement

It needs to be simpler to use the machine learning algorithms supported by Octave (example polynomial regressions, polynomial interpolation).

Use of Solution

I've been using it for one year.

Deployment Issues

There have been no issues with the deployment.

Stability Issues

There have been no issues with the stability.

Scalability Issues

There have been no issues with the scalability.

Customer Service and Technical Support

We still rely on user forums for my answers. We do not use a commercial product yet.

Initial Setup

The initial set-up was easy. I have…

Valuable Features

  • RDDs
  • DataFrames
  • Machine learning libraries

Improvements to My Organization

Faster time to parse and compute data. It makes web-based queries for plotting data easier.

Room for Improvement

It needs to be simpler to use the machine learning algorithms supported by Octave (example polynomial regressions, polynomial interpolation).

Use of Solution

I've been using it for one year.

Deployment Issues

There have been no issues with the deployment.

Stability Issues

There have been no issues with the stability.

Scalability Issues

There have been no issues with the scalability.

Customer Service and Technical Support

We still rely on user forums for my answers. We do not use a commercial product yet.

Initial Setup

The initial set-up was easy. I have not explored using this on AWS clusters.

Implementation Team

We did an in-house implementation and development for our regression tool.

ROI

The ROI will be an in-house product to do machine learning analytics on data obtained from customer.

Other Solutions Considered

We did not evaluate any other products.

Other Advice

It's easy to use and has a learning curve.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user374040
Systems Engineering Lead, Mid-Atlantic at a tech company with 10,001+ employees
Vendor
It allows you to construct event-driven information systems.

Valuable Features

Spark Streaming, which allows you to construct event-driven information systems and respond to the events in near-real time.

Improvements to My Organization

Apache Spark’s ability to perform batch processing at one second or less intervals is the most transformative and less pervasive for any data processing application. The ingested data can also be validated and verified for quality early in the data pipeline.

Room for Improvement

Apache Spark as a data processing engine has come a long way since its inception. Although you are able to perform complex transformations using Spark libraries, the support for SQL to perform transformations is still limited. You can alleviate some of these limitations by running Spark within Hadoop ecosystem and by leveraging the fairly evolved HiveQL.

Use of Solution

I've used it for 16 months.

Deployment Issues

The enterprise scale deployment of Apache Spark is slightly involved to derive its full potential of stability, scalability and security. However, some Hadoop vendors like Cloudera have integrated Spark data processing engine into their Hadoop platforms and have made it easier to deploy, scale and secure.

Customer Service and Technical Support

This is an open source technology and is dependent on community support. The Apache Spark community is vibrant and it is easy to find answers to questions. The enterprises can also get commercial support from Hadoop vendors such as Cloudera. I recommend enterprises to inspect Hadoop vendors’ commitment to open source as well as their ability to curate Apache Spark technology into the rest of the ecosystem before signing up for a commercial support or subscription.

Initial Setup

The initial set-up is straightforward as long as you have picked a right Hadoop distribution.

Implementation Team

I recommend engaging an experienced Hadoop vendor during the planning and initial implementation phases of the project. You will be able to avoid any potential pitfalls or reduce overall project time by having a Hadoop expert guiding you during the initial stages of the project.

Other Solutions Considered

I evaluated some other technologies such as Samza but community backing for Apache Spark stood out.

Other Advice

I also suggest having a Chief Technologist who has extensive experience in architecting several Big Data solutions. They should be able to communicate in business as well as technology language. Their expertise should range from infrastructure to application development and have command of Hadoop technologies.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user373173
Lead Big Data Engineer at a non-profit with 51-200 employees
Vendor
​I use it to process large amount of data in the energy industry.

What is most valuable?

Spark is relatively easy to deploy, with rich features in handling big data. Spark Core, Spark SQL, Spark MLlib are used mostly in our applications.

How has it helped my organization?

I use Spark to process large amount of data in the energy industry.

What needs improvement?

Good tool to analyse Spark application performance. Right now there are still many parameters to tune in order to get good performance of Spark application, I would like to see the auto tuning of parameters.

For how long have I used the solution?

I've been using Spark for seven months.

What was my experience with deployment of the solution?

There were no issues with the deployment.

What do I think about the stability of the solution?

I ran into Spark application performance issues. For instance, Spark JDBC write performance needs to be improved.

What do I think about the scalability of the solution?

There were no issues with the scalability.

How are customer service and technical support?

Customer Service:

I use Apache open source. Everything is on our own.

Technical Support:

I use Apache open source. Everything is on our own.

Which solution did I use previously and why did I switch?

I evaluated Hadoop-based solution, and chose Spark due to the fast processing and ease of use.

How was the initial setup?

The initial setup is not complex. The online documents are pretty good.

What about the implementation team?

I implemented it in-house.

What other advice do I have?

Get to know how Spark works, what are job, stage, task, DAG, etc., and it will help you to write Spark application.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
ITCS user
Engineer at a tech vendor with 10,001+ employees
Real User
Spark provides lots of high-level APIs, which reduces duplication of work.

Valuable Features

Streaming data processing

Improvements to My Organization

In the previous version, we use Storm to handle real-time data, however its performance doesn't meet the requirement. Spark Streaming's micro-batch mode helps improving performance. Also, Spark provides lots of high-level APIs, which reduces duplication of work.

Room for Improvement

Better monitoring ability. Especially monitoring integration with customer codes.

Use of Solution

I've used it for one year.

Stability Issues

We met some standalone deployment issues, which showed that its stability is not that good. So we plan to switch to Yarn or Mesos mode

Customer Service and Technical Support

I have to say it is bad. I can only ask for help in the Google group. However, it is run in the…

Valuable Features

Streaming data processing

Improvements to My Organization

In the previous version, we use Storm to handle real-time data, however its performance doesn't meet the requirement. Spark Streaming's micro-batch mode helps improving performance. Also, Spark provides lots of high-level APIs, which reduces duplication of work.

Room for Improvement

Better monitoring ability. Especially monitoring integration with customer codes.

Use of Solution

I've used it for one year.

Stability Issues

We met some standalone deployment issues, which showed that its stability is not that good. So we plan to switch to Yarn or Mesos mode

Customer Service and Technical Support

I have to say it is bad. I can only ask for help in the Google group. However, it is run in the developer-for-developer style. There are almost no people from databricks. I also use a Cassandra-Spark-connector, and Datastax has at least one dedicated person to help the community.

Initial Setup

Not that straightforward in terms of standalone deployment, there are some tricks which are not mentioned in the docs.

Implementation Team

We did it in-house.

Pricing, Setup Cost and Licensing

So far we have no plan to switch to commercial license.

Other Advice

I love Spark over other solutions.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user371334
CEO at a tech consulting company with 51-200 employees
Consultant
It's enabled interactive self-service access to data​.

What is most valuable?

There are several valuable features. Interactive data access (low latency) Batch ETL-style processing Schema-free data models Algorithms

How has it helped my organization?

We have 1000x improvement in performance over other techniques. It's enabled interactive self-service access to data.

What needs improvement?

Better integration of BI tools wold be a much appreciated improvement.

For how long have I used the solution?

I've used it for about 14 months.

What was my experience with deployment of the solution?

I haven't had any issues with deployment.

What do I think about the stability of the solution?

It's been stable for us.

What do I think about the scalability of the solution?

It's scaled without issue.

How are customer service and

What is most valuable?

There are several valuable features.

  • Interactive data access (low latency)
  • Batch ETL-style processing
  • Schema-free data models
  • Algorithms

How has it helped my organization?

We have 1000x improvement in performance over other techniques. It's enabled interactive self-service access to data.

What needs improvement?

Better integration of BI tools wold be a much appreciated improvement.

For how long have I used the solution?

I've used it for about 14 months.

What was my experience with deployment of the solution?

I haven't had any issues with deployment.

What do I think about the stability of the solution?

It's been stable for us.

What do I think about the scalability of the solution?

It's scaled without issue.

How are customer service and technical support?

Customer Service:

Customer service is excellent.

Technical Support:

Technical support is excellent.

Which solution did I use previously and why did I switch?

Yes, we previously used Oracle, from which we ported our data.

How was the initial setup?

The initial setup was simple.

What about the implementation team?

We implemented it with our in-house team.

What other advice do I have?

Be sure to Uuse the Apache versions and avoid vendor-specific extensions.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user371325
Data Scientist at a tech vendor with 10,001+ employees
Vendor
It allows the loading and investigation of very lard data sets, has MLlib for machine learning, Spark streaming, and both the new and old dataframe API.

What is most valuable?

It allows the loading and investigation of very lard data sets, has MLlib for machine learning, Spark streaming, and both the new and old dataframe API.

How has it helped my organization?

We're able to perform data discovery on large datasets without too much difficulty.

What needs improvement?

It needs better documentation as well as examples for all the Spark libraries. That would be very helpful in maximizing its capabilities and results.

For how long have I used the solution?

I've used it for over nine months now.

What was my experience with deployment of the solution?

I haven't encountered any issues with deployment.

What do I think about the stability of the solution?

There have been no stability issues.

What do I think about the scalability of the solution?

I haven't had any scalability issues. It scales better than Python and R.

How are customer service and technical support?

Customer Service:

I haven't had to use customer service.

Technical Support:

I haven't had to use technical support.

Which solution did I use previously and why did I switch?

I previously used Python and R, but neither of these scaled particularly well.

How was the initial setup?

The initial setup was complex. It was not easy getting the correct version and dependencies set up.

What about the implementation team?

I implemented it in-house on my own!

What was our ROI?

It's open-source, so ROI is inapplicable.

What other advice do I have?

Learn Scala as this will greatly reduce the pain in starting off with Spark.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
it_user365301
Software Developer (Product Engineering) at a computer software company with 501-1,000 employees
Vendor
We have been using Spark to do a lot of batch and stream processing of inbound data from Apache Kafka. Scaling Spark on YARN is still an issue but we are getting acceptable performance.

Valuable Features:

\Spark Streaming, Spark SQL and MLib in that order.

Improvements to My Organization:

We have been using Spark to do a lot of batch and stream processing of inbound data from Apache Kafka. Scaling Spark on YARN is still an issue but we are getting acceptable performance.

Room for Improvement:

Like I said scalability is still an issue, also stability. Spark on Yarn still doesn't seem to have programming submission api, so have to rely on spark-submit script to run jobs on YARN. Scala vs Java API have performance differences which will require sometimes to code in Scala.

Other Advice:

Have Scala developers at hand. Base Java competency will not be enough during optimization rounds.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.