Apache Spark Overview

Apache Spark is the #1 ranked solution in our list of top Hadoop tools. It is most often compared to Spring Boot: Apache Spark vs Spring Boot

What is Apache Spark?

Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

Apache Spark Buyer's Guide

Download the Apache Spark Buyer's Guide including reviews and more. Updated: July 2021

Apache Spark Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions

Apache Spark Video

Pricing Advice

What users are saying about Apache Spark pricing:
  • "Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera."

Filter Reviews

Filter by:
Filter Reviews
Industry
Loading...
Filter Unavailable
Company Size
Loading...
Filter Unavailable
Job Level
Loading...
Filter Unavailable
Rating
Loading...
Filter Unavailable
Considered
Loading...
Filter Unavailable
Order by:
Loading...
  • Date
  • Highest Rating
  • Lowest Rating
  • Review Length
Search:
Showingreviews based on the current filters. Reset all filters
SA
Technical Consultant at a tech services company with 1-10 employees
Consultant
Good Streaming features enable to enter data and analysis within Spark Stream

What is our primary use case?

We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.

Pros and Cons

  • "I feel the streaming is its best feature."
  • "When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."

What other advice do I have?

On a scale of 1 to 10, I'd put it at an eight. To make it a perfect 10 I'd like to see an improved configuration bot. Sometimes it is a nightmare on Linux trying to figure out what happened on the configuration and back-end. So I think installation and configuration with some other tools. We are technical people, we could figure it out, but if aspects like that were improved then other people who are less technical would use it and it would be more adaptable to the end-user.
Kürşat Kurt
Software Architect at Akbank
Real User
Top 10Leaderboard
Provides fast aggregations, AI libraries, and a lot of connectors

What is our primary use case?

We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase. Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them. This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It… more »

Pros and Cons

  • "AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI."
  • "Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing."

What other advice do I have?

I would advise planning well before implementing this solution. In enterprise corporations like ours, there are a lot of policies. You should first find out your needs, and after that, you or your team should set it up based on your needs. If your needs change during development because of the business requirements, it will be very difficult. If you are clear about your needs, it is easier to set it up. If you know how Spark is used in your project, you have to define firewall rules and cluster needs. When you set up Spark, it should be ready for people's usage, especially for remote job…
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: July 2021.
523,975 professionals have used our research since 2012.
RV
Director at Nihil Solutions
Real User
Top 5Leaderboard
Stable and easy to set up with a very good memory processing engine

What is our primary use case?

When we receive data from the messaging queue, we process everything using Apache Spark. Data Bricks does the processing and sends back everything the Apache file in the data lake. The machine learning program does some kind of analysis using the ML prediction algorithm.

Pros and Cons

  • "The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly."
  • "The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate."

What other advice do I have?

We're customers and also partners with Apache. While we are on version 2.6, we are considering upgrading to version 3.0. I'd rate the solution nine out of ten. It works very well for us and suits our purposes almost perfectly.
NitinKumar
Engineering Manager at Sigmoid
Real User
Top 5Leaderboard
Easy to code, fast, open-source, very scalable, and great for big data

What is our primary use case?

I use it mostly for ETL transformations and data processing. I have used Spark on-premises as well as on the cloud.

Pros and Cons

  • "Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica. Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark."
  • "Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available."

What other advice do I have?

I would definitely recommend Spark. It is a great product. I like Spark a lot, and most of the features have been quite good. Its initial learning curve is a bit high, but as you learn it, it becomes very easy. I would rate Apache Spark an eight out of ten.
AD
Senior Consultant & Training at a tech services company with 51-200 employees
Consultant
Easy to use and is capable of processing large amounts of data

What is our primary use case?

We use this solution for information gathering and processing. I use it myself when I am developing on my laptop. I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.

Pros and Cons

  • "The most valuable feature of this solution is its capacity for processing large amounts of data."
  • "When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data."

What other advice do I have?

The work that we are doing with this solution is quite common and is very easy to do. My advice for anybody who is implementing this solution is to look at their needs and then look at the community. Normally, there are a lot of people who have already done what you need. So, even without experience, it is quite simple to do a lot of things. I would rate this solution a nine out of ten.
SS
Co-Founder at a tech vendor with 11-50 employees
Real User
Offers good machine learning, data learning, and Spark Analytics features

What is our primary use case?

We have built a product called "NetBot." We take any form of data, large email data, image,  videos or transactional data and we transform unstructured textual data videos in their structured form into reading into transactional data and we create an enterprise-wide smart data grid. That smart data grid is being used by the downstream analytics tool. We also provide machine-building for people to get faster insight into their data. 

What is most valuable?

We use all the features. We use it for end-to-end. All of our data analysis and execution happens through Spark. The features we find most valuable are the:  Machine learning Data learning Spark Analytics.

What needs improvement?

We've had problems using a Python process to try to access…
NK
Lead Consultant at a tech services company with 51-200 employees
Consultant
The data storage capacity means we can inject somewhere in the user database in more efficient ways

Pros and Cons

  • "The main feature that we find valuable is that it is very fast."
  • "We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time."

What other advice do I have?

The advice that I would give to someone considering this solution is that the quality of data has key streaming capabilities like velocity. This means how quickly you are going to refer to the data. These things matter by designing the solution. We need to take these things out. I would rate Apache Spark an eight out of ten. To make it a ten they should improve the speed. The data storage capacity means we can inject somewhere in the user database in more efficient ways.
Mohamed Ghorbel
Director of BigData Offer at IVIDATA
Real User
Top 20
Stable, fast, and easy to use

What is our primary use case?

We primarily use the solution to integrate very large data sets from another environment, such as our SQL environment, and draw purposeful data before checking it. We also use the solution for streaming very very large servers.

Pros and Cons

  • "The solution is very stable."
  • "The solution needs to optimize shuffling between workers."

What other advice do I have?

We use both on-premises and public and private cloud deployment models. We're partners with Databricks. I'm a consultant. Our company works for large enterprises such as banks and energy companies. 17 of our workers use Apache Spark. With the cloud, there are many companies that integrate Spark. Most projects in big data around the world use Spark, indirectly or directly. I'd rate the solution eight out of ten.
See 2 more Apache Spark Reviews
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.