We just raised a $30M Series A: Read our story

Apache Spark OverviewUNIXBusinessApplication

Apache Spark is the #1 ranked solution in our list of top Hadoop tools. It is most often compared to Spring Boot: Apache Spark vs Spring Boot

What is Apache Spark?

Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

Apache Spark Buyer's Guide

Download the Apache Spark Buyer's Guide including reviews and more. Updated: October 2021

Apache Spark Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions

Apache Spark Video

Pricing Advice

What users are saying about Apache Spark pricing:
  • "Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera."

Apache Spark Reviews

Filter by:
Filter Reviews
Industry
Loading...
Filter Unavailable
Company Size
Loading...
Filter Unavailable
Job Level
Loading...
Filter Unavailable
Rating
Loading...
Filter Unavailable
Considered
Loading...
Filter Unavailable
Order by:
Loading...
  • Date
  • Highest Rating
  • Lowest Rating
  • Review Length
Search:
Showingreviews based on the current filters. Reset all filters
SA
Technical Consultant at a tech services company with 1-10 employees
Consultant
Top 20
Good Streaming features enable to enter data and analysis within Spark Stream

Pros and Cons

  • "I feel the streaming is its best feature."
  • "When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."

What is our primary use case?

We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.

What is most valuable?

I have worked with Hadoop a lot in my career and you need to do a lot of things to get it to Hello World. But in Spark it is easy. You could say it's an umbrella to do everything under the one shelf. It also has Spark Streaming. I feel the streaming is its best feature because I have extracted to enter data and analysis within Spark Stream.

What needs improvement?

I think for IT people it is good. The whole idea is that Spark works pretty easily, but a lot of people, including me, struggle to set things up properly. I like contributions and if you want to connect Spark with Hadoop its not a big thing, but other things, such as if you want to use Sqoop with Spark, you need to do the configuration by hand. I wish there would be a solution that does all these configurations like in Windows where you have the whole solution and it does the back-end. So I think that kind of solution would help. But still, it can do everything for a data scientist.

Spark's main objective is to manipulate and calculate. It is playing with the data. So it has to keep doing what it does best and let the visualization tool do what it does best.

Overall, it offers everything that I can imagine right now. 

For how long have I used the solution?

I have been using Apache Spark for a couple of months.

What do I think about the stability of the solution?

In terms of stability, I have not seen any bugs, glitches or crashes. Even if there is, that's fine, because I would probably take care of it and then I'd have progressed further in the process.

What do I think about the scalability of the solution?

I have not tested the scalability yet.

In my company, there are two or three people that are using it for different products. But right now, the client I'm engaged with doesn't know anything about Spark or Hadoop. They are a typical financial company so they do what they do, and they ask us to do everything. They have pretty much outsourced their whole big data initiative to us.

Which solution did I use previously and why did I switch?

I have used MapReduce from Hadoop previously. Otherwise, I haven't used any other big data infrastructure.

In my work previously, not in this company, I was working with some big data, but I was extracting using a single-core off my PC. I realized over time that my system had eight cores. So instead, I used all of those cores for multi-core programming. Then I realized that Hadoop and Spark do the same thing but with different PC's. That was then I used multi-core programming and that's the point - Spark needs to go and search Hadoop and other things.

How was the initial setup?

The initial setup to get it to Hello World is pretty easy, you just have to install it. But when you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources. But you can get a lot of help from different sources on the internet. So it's great. A lot of people are doing it.

I work with a startup company. You know that in startups you do not have the luxury of different people doing different things, you have to do everything on your own, and it's an opportunity to learn everything. In a typical corporate or big organization you only have restricted SOPs, you have to work within the boundaries. In my organization, I have to set up all the things, configure it, and work on it myself.

What's my experience with pricing, setup cost, and licensing?

I would suggest not to try to do everything at once. Identify the area where you want to solve the problem, start small and expand it incrementally, slowly expand your vision. For example, if I have a problem where I need to do streaming, just focus on the streaming and not on the machine learning that Spark offers. It offers a lot of things but you need to focus on one thing so that you can learn. That is what I have learned from the little experience I have with Spark. You need to focus on your objective and let the tools help you rather than the tools drive the work. That is my advice.

What other advice do I have?

On a scale of 1 to 10, I'd put it at an eight.

To make it a perfect 10 I'd like to see an improved configuration bot. Sometimes it is a nightmare on Linux trying to figure out what happened on the configuration and back-end. So I think installation and configuration with some other tools. We are technical people, we could figure it out, but if aspects like that were improved then other people who are less technical would use it and it would be more adaptable to the end-user.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
Kürşat Kurt
Software Architect at Akbank
Real User
Top 10Leaderboard
Provides fast aggregations, AI libraries, and a lot of connectors

Pros and Cons

  • "AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI."
  • "Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing."

What is our primary use case?

We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase. 

Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them.  

This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It tries to find out suspicious activities and executes rules that are being developed or written by our business team. An example of a rule is that if the transaction count or transaction amount is greater than 10 million Turkish Liras and the user device is new, then raise an exception. The system sends an SMS to the user, and the user can choose to continue or not continue with the transaction.

How has it helped my organization?

Aggregations are very fast in our project since we started to use Spark. We can tell results in around 300 milliseconds. Before using Spark, the time was around 700 milliseconds. 

Before using Spark, we only used Couchbase. We needed fast results for this project because transactions come from various channels, and we need to decide and resolve them at the earliest because users are performing the transaction. If our result or process takes longer, users might stop or cancel their transactions, which means losing money. Therefore, fast results time is very important for us.

What is most valuable?

AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI. 

What needs improvement?

Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.

For how long have I used the solution?

I am a Java developer. I have been interested in Spark for around five years. We have been actively using it in our organization for almost a year.

What do I think about the stability of the solution?

It is the most stable platform. As compare to Flink, Spark is good, especially in terms of clusters and architecture. My colleagues who set up these clusters say that Spark is the easiest.

What do I think about the scalability of the solution?

It is scalable, but we don't have the need to scale it. 

It is mainly used for reporting big data in our organization. All teams, especially the VR team, are using Spark for job execution and remote execution. I can say that 70% of users use Spark for reporting, calculations, and real-time operations. We are a very big company, and we have around a thousand people in IT.

We will continue its usage and develop more. We have kind of just started using it. We finished this project just three months ago. We are now trying to find out bottlenecks in our systems, and then we are ready to go.

How are customer service and technical support?

We have not used Apache support. We have only used Cloudera support for this project, and they helped us a lot during the development cycle of this project. 

How was the initial setup?

I don't have any idea about it. We are a big company, and we have another group for setting up Spark.

What other advice do I have?

I would advise planning well before implementing this solution. In enterprise corporations like ours, there are a lot of policies. You should first find out your needs, and after that, you or your team should set it up based on your needs. If your needs change during development because of the business requirements, it will be very difficult. 

If you are clear about your needs, it is easier to set it up. If you know how Spark is used in your project, you have to define firewall rules and cluster needs. When you set up Spark, it should be ready for people's usage, especially for remote job execution. 

I would rate Apache Spark a nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: October 2021.
542,823 professionals have used our research since 2012.
RV
Director at Nihil Solutions
Real User
Top 5Leaderboard
Stable and easy to set up with a very good memory processing engine

Pros and Cons

  • "The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly."
  • "The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate."

What is our primary use case?

When we receive data from the messaging queue, we process everything using Apache Spark. Data Bricks does the processing and sends back everything the Apache file in the data lake. The machine learning program does some kind of analysis using the ML prediction algorithm.

What is most valuable?

The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly.

What needs improvement?

There are lots of items coming down the pipeline in the future. I don't know what features are missing. From my point of view, everything looks good.

The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate.

There should be more information shared to the user. The solution already has all the information tracked in the cluster. It just needs to be accessible or searchable.

For how long have I used the solution?

I started using the solution about four years ago. However, it's been on and off since then. I would estimate in total I have about a year and a half of experience using the solution.

What do I think about the stability of the solution?

The stability of the solution is very, very good. It doesn't crash or have glitches. It's quite reliable for us.

What do I think about the scalability of the solution?

The scalability of the solution is very good. If a company has to expand it, they can do so.

Right now, we have about six or seven users that are directly on the product. We're encouraging them to use more data. We do plan to increase usage in the future.

How are customer service and technical support?

I'm a developer, so I don't interact directly with technical support. I can't speak to the quality of their service as I've never directly dealt with them.

Which solution did I use previously and why did I switch?

We did previously use a lot of different mechanisms, however, we needed something that was good at processing data for analytical purposes, and this solution fit the bill. It's a very powerful tool. I haven't seen other tools that could do precisely what this one does.

How was the initial setup?

The initial setup isn't too complex. It's quite straightforward.

We use CACD DevOps from deployment. We only use Spark for processing and for the Data Bricks cluster to spin off and do the job. It's continuously running int he background.

There isn't really any maintenance required per se. We just click the button and it comes up automatically, with the whole cluster and the Spark and everything ready to go.

What's my experience with pricing, setup cost, and licensing?

I'm unsure as to how much the licensing is for the solution. It's not an aspect of the product I deal with directly.

What other advice do I have?

We're customers and also partners with Apache.

While we are on version 2.6, we are considering upgrading to version 3.0.

I'd rate the solution nine out of ten. It works very well for us and suits our purposes almost perfectly.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
Oscar Estorach
Chief Data-strategist and Director at theworkshop.es
Real User
Top 5Leaderboard
Scalable, open-source, and great for transforming data

Pros and Cons

  • "The solution has been very stable."
  • "It's not easy to install."

What is our primary use case?

You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.

What is most valuable?

Overall, it's a very nice tool.

It is great for transforming data and doing micro-streamings or micro-batching.

The product offers an open-source version.

The solution has been very stable.

The scalability is good.

Apache Spark is a huge tool. It has many use cases and is very flexible. You can use it with so many other platforms. 

Spark, as a tool, is easy to work with as you can work with Python, Scala, and Java.

What needs improvement?

If you are developing projects, and you need to not put them in a production scenario, you might need more than a cluster of servers, as it requires distributed computing.

It's not easy to install. You are typically dealing with a big data system.

It's not a simple, straightforward architecture. 

For how long have I used the solution?

I've been using the solution for three years.

What do I think about the stability of the solution?

The stability is very good. There are no bugs or glitches and it doesn't crash or freeze. It's a reliable solution. 

What do I think about the scalability of the solution?

We have found the scalability to be good. If your company needs to expand it, it can do so.

We have five people working on the solution currently.

How are customer service and technical support?

There isn't really technical support for open source. You need to do your own studying. There are lots of places to find information. You can find details online, or in books, et cetera. There are even courses you can take that can help you understand Spark.

Which solution did I use previously and why did I switch?

I also use Databricks, which I use in the cloud.

How was the initial setup?

When handling big data systems, the installation is a bit difficult. When you need to deploy the systems, it's better to use services like Databricks.

I am not a professional admin. I am a developer for and design architecture.

You can use it in your standalone system, however, it's not the best way. It would be okay for little branch codes, not for production.

What's my experience with pricing, setup cost, and licensing?

We use the open-source version. It is free to use. However, you do need to have servers. We have three or four. they can be on-premises or in the cloud. 

What other advice do I have?

I have the solution installed on my computer and on our servers. You can use it on-premises or as a SaaS.

I'd rate the solution at a nine out of ten. I've been very pleased with its capabilities. 

I would recommend the solution for the people who need to deploy projects with streaming. If you have many different sources or different types of data, and you need to put everything in the same place - like a data lake - Spark, at this moment, has the right tools. It's an important solution for data science, for data detectors. You can put all of the information in one place with Spark.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
NitinKumar
Engineering Manager at Sigmoid
Real User
Top 5Leaderboard
Easy to code, fast, open-source, very scalable, and great for big data

Pros and Cons

  • "Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica. Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark."
  • "Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available."

What is our primary use case?

I use it mostly for ETL transformations and data processing. I have used Spark on-premises as well as on the cloud.

What is most valuable?

Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica.

Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark.

What needs improvement?

Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available.

For how long have I used the solution?

I have been using this solution for around five years.

What do I think about the stability of the solution?

There were bugs three to four years ago, which have been resolved. There were a couple of issues related to slowness when we did a lot of transformations using the Width columns. I was writing a POC on ETL for moving from Informatica to Spark SQL for the ETL pipeline. It required the use of hundreds of Width columns to change the column name or add some transformation, which made it slow. It happened in versions prior to version 1.6, and it seems that this issue has been fixed later on.

What do I think about the scalability of the solution?

It is very scalable. You can scale it a lot.

How are customer service and technical support?

I haven't contacted them.

How was the initial setup?

The initial setup was a little complex when I was using open-source Spark. I was doing a POC in the on-premise environment, and the initial setup was a little cumbersome. It required a lot of set up on Unix systems. We also had to do a lot of configurations and install a lot of things. 

After I moved to the Cloudera CDH version, it was a little easy. It is a bundled product, so you just install whatever you want and use it.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera.

What other advice do I have?

I would definitely recommend Spark. It is a great product. I like Spark a lot, and most of the features have been quite good. Its initial learning curve is a bit high, but as you learn it, it becomes very easy.

I would rate Apache Spark an eight out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
SS
Co-Founder at a tech vendor with 11-50 employees
Real User
Offers good machine learning, data learning, and Spark Analytics features

What is our primary use case?

We have built a product called "NetBot." We take any form of data, large email data, image,  videos or transactional data and we transform unstructured textual data videos in their structured form into reading into transactional data and we create an enterprise-wide smart data grid. That smart data grid is being used by the downstream analytics tool. We also provide machine-building for people to get faster insight into their data. 

What is most valuable?

We use all the features. We use it for end-to-end. All of our data analysis and execution happens through Spark. The features we find most valuable are the:  Machine learning Data learning Spark Analytics.

What needs improvement?

We've had problems using a Python process to try to access…

What is our primary use case?

We have built a product called "NetBot." We take any form of data, large email data, image,  videos or transactional data and we transform unstructured textual data videos in their structured form into reading into transactional data and we create an enterprise-wide smart data grid. That smart data grid is being used by the downstream analytics tool. We also provide machine-building for people to get faster insight into their data. 

What is most valuable?

We use all the features. We use it for end-to-end. All of our data analysis and execution happens through Spark.

The features we find most valuable are the: 

  • Machine learning
  • Data learning
  • Spark Analytics.

What needs improvement?

We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data.

For how long have I used the solution?

I have been using Apache Spark for more than five years. 

What do I think about the stability of the solution?

We haven't had any issues with stability so far. 

What do I think about the scalability of the solution?

As long as you do it correctly, it is scalable.

Our users mostly consist of data analysts, engineers, data scientists, and DB admins.

Which solution did I use previously and why did I switch?

Before using this solution we used Apache Storm

How was the initial setup?

The initial setup is complex. 

What about the implementation team?

We installed it ourselves. 

What other advice do I have?

I would rate it a nine out of ten. 

Which deployment model are you using for this solution?

On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
NK
Lead Consultant at a tech services company with 51-200 employees
Consultant
The data storage capacity means we can inject somewhere in the user database in more efficient ways

Pros and Cons

  • "The main feature that we find valuable is that it is very fast."
  • "We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time."

What is most valuable?

The main feature that we find valuable is that it is very fast. In terms of big data, the main feature is that the data is in so many different nodes. It goes through many data nodes so whenever we use the data, it enables us to parse the data from different data nodes. 

What needs improvement?

We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time. There is some latency in the system and latency in the data caching. The main issue is that we need to design it in a way that data will be available to us very quickly. It takes a long time and the latest data should be available to us much quicked. 

What do I think about the stability of the solution?

We don't have any problems with stability. 

How are customer service and technical support?

I'm not the one who would contact their support if we needed it. 

How was the initial setup?

The initial setup is straightforward. 

What other advice do I have?

The advice that I would give to someone considering this solution is that the quality of data has key streaming capabilities like velocity. This means how quickly you are going to refer to the data. These things matter by designing the solution. We need to take these things out. 

I would rate Apache Spark an eight out of ten. 

To make it a ten they should improve the speed. The data storage capacity means we can inject somewhere in the user database in more efficient ways.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
Mohamed Ghorbel
Director of BigData Offer at IVIDATA
Real User
Top 20
Stable, fast, and easy to use

Pros and Cons

  • "The solution is very stable."
  • "The solution needs to optimize shuffling between workers."

What is our primary use case?

We primarily use the solution to integrate very large data sets from another environment, such as our SQL environment, and draw purposeful data before checking it. We also use the solution for streaming very very large servers. 

What is most valuable?

It is a very fast solution. It's very easy to use. There are many RPis with many languages like Scala, Java, R, and Python. The greatest advantage of Spark is that we can initiate many kinds of analytics including SQL analytics, graphics analytics, etc. 

What needs improvement?

The solution needs to optimize shuffling between workers.

For how long have I used the solution?

I've been using the solution for four or five years.

What do I think about the stability of the solution?

The solution is very stable.

What do I think about the scalability of the solution?

The solution is scalable. My understanding is version 3.0 has renewed scaling capabilities and will be able to do so automatically.

How are customer service and technical support?

Apache is an open-source platform so there is no technical support.

What other advice do I have?

We use both on-premises and public and private cloud deployment models. We're partners with Databricks.

I'm a consultant. Our company works for large enterprises such as banks and energy companies. 17 of our workers use Apache Spark.

With the cloud, there are many companies that integrate Spark. Most projects in big data around the world use Spark, indirectly or directly. 

I'd rate the solution eight out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.