Apache Spark Reviews, Competitors and Pricing

Anshuman Kishore

Director Product Development at Mycom Osi

Apr 1, 2024

Download

Available for free and can be deployed easily

Pros and Cons

"The product's deployment phase is easy."

"At times during the deployment process, the tool goes down, making it look less robust. To take care of the issues in the deployment process, users need to do manual interventions occasionally."

What is our primary use case?

I use the solution in my company for one of the cases where we have to deal with areas like topology engines and big topology chains.

What is most valuable?

Overall, my company likes the product since it is a good tool.

What needs improvement?

There can be challenges in getting a good developer for Apache Spark. Getting developers in the market with the right skill set for Apache Spark is tough. The aforementioned area can be considered for improvement in the product.

At times during the deployment process, the tool goes down, making it look less robust. To take care of the issues in the deployment process, users need to do manual interventions occasionally. I feel that the use of large datasets can be a cause of concern during the tool's deployment phase, making it an area where improvements are required.

For how long have I used the solution?

I have been using Apache Spark for seven to eight years.

Buyer's Guide

Apache Spark

April 2024

Free Report: Apache Spark Reviews and More

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.

DOWNLOAD NOW

768,578 professionals have used our research since 2012.

What do I think about the stability of the solution?

Stability-wise, I rate the solution an eight and a half out of ten.

What do I think about the scalability of the solution?

It is a very scalable solution.

In our company, there are users of Apache Spark, and then there are users of the applications that were developed with it.

Currently, my company does not plan to increase the use of the product.

How was the initial setup?

The product's deployment phase is easy.

The product's deployment phase involved the CI/CD pipeline and Jenkins pipeline.

Earlier, the solution was deployed on an on-premises model. Later on, the solution was deployed on a cloud model.

Initially, during the product's deployment phase, it took more than four to five hours. With the passage of time, the product's deployment process became easier.

Around 50 to 100 people in my company are involved in the product's deployment process.

What's my experience with pricing, setup cost, and licensing?

Considering the product version used in my company, I feel that the tool is not costly since the product is available for free.

What other advice do I have?

The tool offers functionality that helps my company deal with data processing in projects on a near real-time basis.

The impact of in-memory processing capabilities on the improvement of computational efficiency is one of the reasons why my company chose Apache Spark.

At the moment, my company plans to explore data analysis with Apache Spark. My company primarily used the product for data processing and not for data analysis.

If you buy the product with the capabilities of Azure DevOps and use the tool's dashboard, you find the solution to be good. The tool has an in-built UI and other good capabilities.

I feel that the product is fine and easy to use for those who plan to use it in the future. I recommended the tool to others based on the performance and scalability features it offers.

I managed data partitioning and distribution with Apache Spark once in my company.

The benefits of the use of the product revolve around the fact that it was easy to get the data processing done in a very quick and fastest possible way with the help of its n-memory processing and performance.

I rate the solution an eight and a half to nine out of ten.

Which deployment model are you using for this solution?

On-premises

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Last updated: Apr 1, 2024

Vineeth Marar

Cloud solution architect at 0

Mar 10, 2024

Download

Offers seamless integration with Azure services and on-premises servers

Pros and Cons

"The solution is scalable."

"The setup I worked on was really complex."

What is our primary use case?

My contribution primarily focused on the networking aspect, ensuring secure and reliable connections between Azure services and on-premises servers. The solution was complex, involving private links, virtual machines, and custom firewall rules to facilitate secure data transmission.

I use Apache Spark, especially for data processing and analytics. My work involves a broad range of technologies, including PostgreSQL, Apache Kafka, Spark, and various Azure services. Previously, my focus was more on networking, cybersecurity, and Azure's data services like SQL and Active Directory.

How has it helped my organization?

We've set up a Spark cluster running in Azure to process real-time data. This setup involves connecting Azure applications to the Spark cluster via Azure Private Link, ensuring secure data flow.

The architecture required detailed network design, including routing through Linux firewalls and ensuring data could be securely transmitted to and from on-premises servers.

While I was heavily involved in the network design aspect, the Spark cluster was primarily used for processing and analyzing data streams for various applications.

Moreover, from my experience, I haven't encountered significant challenges with integrations involving Spark. The crucial factor is having established connectivity.

Whether Spark is operating in Azure or on-premises doesn't significantly affect our operations, thanks to high-bandwidth solutions like ExpressRoute. The main consideration then becomes the cost. As long as we maintain performance standards, I don't see any issues, regardless of the deployment environment.

Ensuring the collection of relevant metrics and logs is critical for assessing performance improvements. The specifics of how these are collected or which tools are used might vary, but the goal is to gather comprehensive data for ongoing monitoring and improvement.

What is most valuable?

What I liked about the solution was its uniqueness. We provided the customer with a solution that hadn't been offered by anyone else before.

It involved multiple components, such as Spark cluster, CMAX, a backend VM, and a Linux VM for mapping the service processes to the backend, which is running on-premises where the Kafka service was running.

It was challenging for people to understand how to send traffic through the private link between all these services. Ensuring the traffic was sent to the correct destination with the correct source header without any operation issues was complex, but we achieved it.

We had multiple instances of fault tolerance and scalability.

What needs improvement?

The setup I worked on was really complex.

For how long have I used the solution?

I have been using it for a year.

What do I think about the stability of the solution?

The solution was definitely stable. There were no unstable services in it. Since most services were in Azure, everything worked better.

Azure's networking products, like ExpressRoute and Private Link service, are very stable. We didn't encounter any issues with the solution.

It took some time to complete, but after that, we haven't had a single support case.

What do I think about the scalability of the solution?

The solution is scalable. We used a load balancer at each tier, with multiple instances of the services running.

It's all scalable and relevant. We didn't have a lot of issues and have been monitoring the traffic flow.

We even projected the requests for the next two to three years and created scalable instances accordingly.

There are many users of Spark in our organization. For example, many customers are using Spark, often in conjunction with requests from third-party vendors. They frequently use Spark plug-ins as well.

Which solution did I use previously and why did I switch?

I've been exploring its capabilities in the OpenAI context, rather than dealing with external databases.

I've also started using Apache Kafka for messaging and event streaming, which is essential since our solutions often integrate with applications running in Azure, including event hubs and service bus for messaging. This experience includes interfacing with various technologies, not just within Microsoft's ecosystem but also with Amazon Web Services.

Learning new technologies is a continuous process, and I've never found it difficult to adapt, especially with something as foundational as Apache Kafka.

How was the initial setup?

The setup I worked on was really complex, not specifically because of Spark but due to the integration with multiple services.

It took us about a week to finalize the solution, as understanding the entire workflow and brainstorming on how to maintain private traffic was intricate.

Regarding the deployment process, it involved thorough planning and testing to ensure minimal latency. We managed to achieve a latency of around 20 to 30 milliseconds, which was pretty good.

What about the implementation team?

For the deployment process, once we have a clear understanding of the workflow, the services to be included, how they should be integrated, the policies, and the configurations to be applied, it becomes easier to structure and incorporate it into the ops pipeline.

We may need to standardize it a bit based on different customer requirements. This standardization allows customers to apply the necessary customizations once it's deployed.

It's a hybrid solution, with about 90% of the services running in the cloud and 10% on-premises.

What's my experience with pricing, setup cost, and licensing?

The licensing costs for Spark would depend on the specific packages and the needs of the project. Costs can vary based on requirements, affordability, and customer expectations.

Licensing costs can vary. For instance, when purchasing a virtual machine, you're asked if you want to take advantage of the hybrid benefit or if you prefer the license costs to be included upfront by the cloud service provider, such as Azure.

If you choose the hybrid benefit, it indicates you already possess a license for the operating system and wish to avoid additional charges for that specific VM in Azure. This approach allows for a reduction in licensing costs, charging only for the service and associated resources.

The licensing arrangements can differ based on the product and service. Some products might require a license purchase upfront, with subsequent charges based only on usage.

The availability of hybrid benefits can also influence licensing costs, especially if you're using third-party services like Palo Alto in a VM from the marketplace. If you have an existing license, your costs could be reduced, but purchasing a new license would include licensing fees in the overall cost.

What other advice do I have?

My advice is to thoroughly understand your own needs and environment before making a decision. Recommendations should be based on product features, quality, accuracy, and stability.

Cost is also a factor, but it should not be the only consideration. Depending on whether the priority is performance and scalability or cost-effectiveness, I would suggest a solution that best meets those needs, whether it's a managed service or a more cost-conscious option.

I would rate Spark as ten out of ten. I haven't had any issues with Spark in my experience.

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Last updated: Mar 10, 2024

Buyer's Guide

Apache Spark

April 2024

Free Report: Apache Spark Reviews and More

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.

DOWNLOAD NOW

768,578 professionals have used our research since 2012.

AmitMataghare

Associate Director at a consultancy with 10,001+ employees

Apr 29, 2022

Download

High performance, beneficial in-memory support, and useful online community support

Pros and Cons

"One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast."

"Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors."

What is our primary use case?

Apache Spark is a programming language similar to Java or Python. In my most recent deployment, we used Apache Spark to build engineering pipelines to move data from sources into the data lake.

What is most valuable?

One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast.

What needs improvement?

Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors.

For how long have I used the solution?

I have been using Apache Spark for approximately five years.

What do I think about the stability of the solution?

Apache Spark is stable.

What do I think about the scalability of the solution?

I have found Apache Spark to be scalable.

How are customer service and support?

Apache Spark is open-source, there is no team that will give you dedicated support, but you can post your queries on the community forums, and usually, you will receive a good response. Since it's open-source, you depend on freelance developers to respond to you, you cannot put a time limit there, but the response, on average, is pretty good.

How was the initial setup?

If Apache Spark is in the cloud, setting it up will require only minutes. If it's on Amazon, GCP, or Microsoft cloud, it'll take minutes to set everything up. However, if you are using the on-premise version, then it might take some time to set up the environment.

What other advice do I have?

I rate Apache Spark an eight out of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Atif Tariq

Cloud and Big Data Engineer | Developer at Huawei Cloud Middle East

Nov 29, 2023

Download

A scalable solution that can be used for data computation and building data pipelines

Pros and Cons

"The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast."

"Apache Spark should add some resource management improvements to the algorithms."

What is our primary use case?

Apache Spark is used for data computation, building data pipelines, or building analytics on top of batch data. Apache Spark is used to handle big data efficiently.

What is most valuable?

The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast.

What needs improvement?

Apache Spark should add some resource management improvements to the algorithms. Thereby, the solution can manage SKUs more efficiently with a physical and logical plan over the different data sets when you are joining it.

For how long have I used the solution?

I have been working with Apache Spark for six to seven years.

What do I think about the stability of the solution?

Apache Spark is a very stable solution. The community is still working on other parts, like performance and removing bottlenecks. However, from a stipulative point of view, the solution's stability is very good.

I rate Apache Spark a nine out of ten for stability.

What do I think about the scalability of the solution?

Apache Spark is a scalable solution. More than 50 to 100 users are using the solution in our organization.

How are customer service and support?

Apache Spark's technical support team responds on time.

How would you rate customer service and support?

Positive

How was the initial setup?

The solution’s initial setup is very easy.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an open-source solution, and there is no cost involved in deploying the solution on-premises.

What other advice do I have?

I would recommend Apache Spark to users doing analytics, data computation, or pipelines.

Overall, I rate Apache Spark ten out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Last updated: Nov 29, 2023

Lokesh Jayanna

Vice President at Goldman Sachs at a computer software company with 10,001+ employees

Nov 26, 2023

Download

Stable product with a valuable SQL tool

Pros and Cons

"The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it."

"At the initial stage, the product provides no container logs to check the activity."

What is our primary use case?

We use the product for extensive data analysis. It helps us analyze a huge amount of data and transfer it to data scientists in our organization.

What is most valuable?

The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it. It is a useful feature for us.

What needs improvement?

At the initial stage, the product provides no container logs to check the activity. It remains inactive for a long time without giving us any information. The containers could start quickly, similar to that of Jupyter Notebook.

For how long have I used the solution?

We have been using Apache Spark for eight months to one year.

What do I think about the stability of the solution?

It is a stable product. I rate its stability an eight out of ten.

What do I think about the scalability of the solution?

We have 45 Apache Spark users. I rate its scalability a nine out of ten.

How was the initial setup?

The complexity of the initial setup depends on the kind of environment an organization is working with. It requires one executive for deployment. I rate the process an eight out of ten.

What's my experience with pricing, setup cost, and licensing?

The product is expensive, considering the setup. However, from a standalone perspective, it is inexpensive.

What other advice do I have?

I advise others to analyze data and understand your business requirements before purchasing the product. I rate it an eight out of ten.

Which deployment model are you using for this solution?

On-premises

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Last updated: Nov 26, 2023

UjjwalGupta

Module Lead at Mphasis

Mar 14, 2024

Download

Helps to build ETL pipelines load data to warehouses

Pros and Cons

"The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily."

"Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial."

What is our primary use case?

We're using Apache Spark primarily to build ETL pipelines. This involves transforming data and loading it into our data warehouse. Additionally, we're working with Delta Lake file formats to manage the contents.

What is most valuable?

The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily.

What needs improvement?

Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial.

For how long have I used the solution?

I have been using the product for six years.

What do I think about the stability of the solution?

Apache Spark is generally considered a stable product, with rare instances of breaking down. Issues may arise in sudden increases in data volume, leading to memory errors, but these can typically be managed with autoscaling clusters. Additionally, schema changes or irregularities in streaming data may pose challenges, but these could be addressed in future software versions.

What do I think about the scalability of the solution?

About 70-80 percent of employees in my company use the product.

How are customer service and support?

We haven't contacted Apache Spark support directly because it's an open-source tool. However, when using it as a product within Databricks, we've contacted Databricks support for assistance.

Which solution did I use previously and why did I switch?

The main reason our company opted for the product is its capability to process large volumes of data. While other options like Snowflake offer some advantages, they may have limitations regarding custom logic or modifications.

How was the initial setup?

The solution's setup and installation of Apache Spark can vary in complexity depending on whether it's done in a standalone or cluster environment. The process is generally more straightforward in a standalone setup, especially if you're familiar with the concepts involved. However, setting up in a cluster environment may require more knowledge about clusters and networking, making it potentially more complex.

What's my experience with pricing, setup cost, and licensing?

The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks.

What other advice do I have?

If you're new to Apache Spark, the best way to learn is by using the Databricks Community Edition. It provides a cluster for Apache Spark where you can learn and test. I rate the product an eight out of ten.

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Last updated: Mar 14, 2024

reviewer1759647

Information Technology Business Analyst at a aerospace/defense firm with 10,001+ employees

Jul 25, 2023

Download

A highly scalable and affordable tool that can be used to gather information from different systems

Pros and Cons

"The product is useful for analytics."

"The product could improve the user interface and make it easier for new users."

What is most valuable?

We use it as an ETL tool to gather information from different systems. The product is useful for analytics.

What needs improvement?

The product could improve the user interface and make it easier for new users. It has a steep learning curve.

For how long have I used the solution?

I have been using the product for approximately three to four years. Currently, I am using the latest version.

What do I think about the stability of the solution?

The tool is stable. I rate the stability a ten out of ten.

What do I think about the scalability of the solution?

The tool is very scalable. I rate the scalability a ten out of ten. Approximately 30 users are using Apache Spark in our organization.

How are customer service and support?

We are using the free version of the product. So, we are not using any support.

How would you rate customer service and support?

Positive

How was the initial setup?

The basic installation is easy. However, we are working in the security business and need a very secure installation. It has been quite difficult. I rate the basic installation a ten out of ten. I rate the ease of setup a two or three out of ten for a more secure installation with all the security features. The solution is deployed on-premises in our organization. The deployment process requires a couple of weeks.

What's my experience with pricing, setup cost, and licensing?

We are using the free version of the solution.

What other advice do I have?

I would recommend the product. I think it's a good solution for analytics. Overall, I rate the product an eight out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Oscar Estorach

Chief Data-strategist and Director at Theworkshop.es

Aug 19, 2021

Download

Scalable, open-source, and great for transforming data

Pros and Cons

"The solution has been very stable."

"It's not easy to install."

What is our primary use case?

You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.

What is most valuable?

Overall, it's a very nice tool.

It is great for transforming data and doing micro-streamings or micro-batching.

The product offers an open-source version.

The solution has been very stable.

The scalability is good.

Apache Spark is a huge tool. It has many use cases and is very flexible. You can use it with so many other platforms.

Spark, as a tool, is easy to work with as you can work with Python, Scala, and Java.

What needs improvement?

If you are developing projects, and you need to not put them in a production scenario, you might need more than a cluster of servers, as it requires distributed computing.

It's not easy to install. You are typically dealing with a big data system.

It's not a simple, straightforward architecture.

For how long have I used the solution?

I've been using the solution for three years.

What do I think about the stability of the solution?

The stability is very good. There are no bugs or glitches and it doesn't crash or freeze. It's a reliable solution.

What do I think about the scalability of the solution?

We have found the scalability to be good. If your company needs to expand it, it can do so.

We have five people working on the solution currently.

How are customer service and technical support?

There isn't really technical support for open source. You need to do your own studying. There are lots of places to find information. You can find details online, or in books, et cetera. There are even courses you can take that can help you understand Spark.

Which solution did I use previously and why did I switch?

I also use Databricks, which I use in the cloud.

How was the initial setup?

When handling big data systems, the installation is a bit difficult. When you need to deploy the systems, it's better to use services like Databricks.

I am not a professional admin. I am a developer for and design architecture.

You can use it in your standalone system, however, it's not the best way. It would be okay for little branch codes, not for production.

What's my experience with pricing, setup cost, and licensing?

We use the open-source version. It is free to use. However, you do need to have servers. We have three or four. they can be on-premises or in the cloud.

What other advice do I have?

I have the solution installed on my computer and on our servers. You can use it on-premises or as a SaaS.

I'd rate the solution at a nine out of ten. I've been very pleased with its capabilities.

I would recommend the solution for the people who need to deploy projects with streaming. If you have many different sources or different types of data, and you need to put everything in the same place - like a data lake - Spark, at this moment, has the right tools. It's an important solution for data science, for data detectors. You can put all of the information in one place with Spark.

Which deployment model are you using for this solution?

On-premises

Disclosure: I am a real user, and this review is based on my own experience and opinions.