Apache Spark Overview
What is Apache Spark?
Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory
Apache Spark Buyer's Guide
Download the Apache Spark Buyer's Guide including reviews and more. Updated: May 2021
Apache Spark Customers
NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
Apache Spark Video
What users are saying about Apache Spark pricing:
- "Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera."
- Highest Rating
- Lowest Rating
- Review Length
Showingreviews based on the current filters.
Technical Consultant at a tech services company with 1-10 employees
Dec 25, 2019
Good Streaming features enable to enter data and analysis within Spark Stream
What is our primary use case?We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.
Pros and Cons
- "I feel the streaming is its best feature."
- "When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."
What other advice do I have?On a scale of 1 to 10, I'd put it at an eight. To make it a perfect 10 I'd like to see an improved configuration bot. Sometimes it is a nightmare on Linux trying to figure out what happened on the configuration and back-end. So I think installation and configuration with some other tools. We are technical people, we could figure it out, but if aspects like that were improved then other people who are less technical would use it and it would be more adaptable to the end-user.
Provides fast aggregations, AI libraries, and a lot of connectors
What is our primary use case?We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase. Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them. This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It… more »
Pros and Cons
- "AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI."
- "Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing."
What other advice do I have?I would advise planning well before implementing this solution. In enterprise corporations like ours, there are a lot of policies. You should first find out your needs, and after that, you or your team should set it up based on your needs. If your needs change during development because of the business requirements, it will be very difficult. If you are clear about your needs, it is easier to set it up. If you know how Spark is used in your project, you have to define firewall rules and cluster needs. When you set up Spark, it should be ready for people's usage, especially for remote job…
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: May 2021.
502,104 professionals have used our research since 2012.
Stable and easy to set up with a very good memory processing engine
What is our primary use case?When we receive data from the messaging queue, we process everything using Apache Spark. Data Bricks does the processing and sends back everything the Apache file in the data lake. The machine learning program does some kind of analysis using the ML prediction algorithm.
Pros and Cons
- "The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly."
- "The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate."
What other advice do I have?We're customers and also partners with Apache. While we are on version 2.6, we are considering upgrading to version 3.0. I'd rate the solution nine out of ten. It works very well for us and suits our purposes almost perfectly.
Easy to code, fast, open-source, very scalable, and great for big data
What is our primary use case?I use it mostly for ETL transformations and data processing. I have used Spark on-premises as well as on the cloud.
Pros and Cons
- "Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica. Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark."
- "Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available."
What other advice do I have?I would definitely recommend Spark. It is a great product. I like Spark a lot, and most of the features have been quite good. Its initial learning curve is a bit high, but as you learn it, it becomes very easy. I would rate Apache Spark an eight out of ten.
Principal Architect at a financial services firm with 1,001-5,000 employees
Jul 17, 2019
Fast performance and has an easy initial setup
What is our primary use case?We use the solution for analytics.
Pros and Cons
- "I found the solution stable. We haven't had any problems with it."
- "It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster."
What other advice do I have?I would recommend the solution. I would rate it an eight or nine out of 10. For some areas, I would give it ten but I cannot use some parts. If you are going to use it for a consumer then I would be able to recommend it and you should go ahead. It doesn't work for me as I have different clients and different engagements.
Senior Consultant & Training at a tech services company with 51-200 employees
Easy to use and is capable of processing large amounts of data
What is our primary use case?We use this solution for information gathering and processing. I use it myself when I am developing on my laptop. I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.
Pros and Cons
- "The most valuable feature of this solution is its capacity for processing large amounts of data."
- "When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data."
What other advice do I have?The work that we are doing with this solution is quite common and is very easy to do. My advice for anybody who is implementing this solution is to look at their needs and then look at the community. Normally, there are a lot of people who have already done what you need. So, even without experience, it is quite simple to do a lot of things. I would rate this solution a nine out of ten.
Co-Founder at a tech vendor with 11-50 employees
Jan 29, 2020
Offers good machine learning, data learning, and Spark Analytics features
What is our primary use case?We have built a product called "NetBot." We take any form of data, large email data, image, videos or transactional data and we transform unstructured textual data videos in their structured form into reading into transactional data and we create an enterprise-wide smart data grid. That smart data grid is being used by the downstream analytics tool. We also provide machine-building for people to get faster insight into their data.
What is most valuable?We use all the features. We use it for end-to-end. All of our data analysis and execution happens through Spark. The features we find most valuable are the: Machine learning Data learning Spark Analytics.
What needs improvement?We've had problems using a Python process to try to access…
Lead Consultant at a tech services company with 51-200 employees
Jan 30, 2020
The data storage capacity means we can inject somewhere in the user database in more efficient ways
Pros and Cons
- "The main feature that we find valuable is that it is very fast."
- "We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time."
What other advice do I have?The advice that I would give to someone considering this solution is that the quality of data has key streaming capabilities like velocity. This means how quickly you are going to refer to the data. These things matter by designing the solution. We need to take these things out. I would rate Apache Spark an eight out of ten. To make it a ten they should improve the speed. The data storage capacity means we can inject somewhere in the user database in more efficient ways.
See 4 more Apache Spark Reviews
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.