We performed a comparison between Apache Spark and Spark SQL based on real PeerSpot user reviews.
Find out in this report how the two Hadoop solutions compare in terms of features, pricing, service and support, easy of deployment, and ROI."I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library."
"Now, when we're tackling sentiment analysis using NLP technologies, we deal with unstructured data—customer chats, feedback on promotions or demos, and even media like images, audio, and video files. For processing such data, we rely on PySpark. Beneath the surface, Spark functions as a compute engine with in-memory processing capabilities, enhancing performance through features like broadcasting and caching. It's become a crucial tool, widely adopted by 90% of companies for a decade or more."
"The scalability has been the most valuable aspect of the solution."
"The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly."
"There's a lot of functionality."
"With Spark, we parallelize our operations, efficiently accessing both historical and real-time data."
"I appreciate everything about the solution, not just one or two specific features. The solution is highly stable. I rate it a perfect ten. The solution is highly scalable. I rate it a perfect ten. The initial setup was straightforward. I recommend using the solution. Overall, I rate the solution a perfect ten."
"The main feature that we find valuable is that it is very fast."
"Spark SQL's efficiency in managing distributed data and its simplicity in expressing complex operations make it an essential part of our data pipeline."
"Certain data sets that are very large are very difficult to process with Pandas and Python libraries. Spark SQL has helped us a lot with that."
"Data validation and ease of use are the most valuable features."
"Overall the solution is excellent."
"The solution is easy to understand if you have basic knowledge of SQL commands."
"One of Spark SQL's most beautiful features is running parallel queries to go through enormous data."
"This solution is useful to leverage within a distributed ecosystem."
"The team members don't have to learn a new language and can implement complex tasks very easily using only SQL."
"When using Spark, users may need to write their own parallelization logic, which requires additional effort and expertise."
"Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users."
"At the initial stage, the product provides no container logs to check the activity."
"Spark could be improved by adding support for other open-source storage layers than Delta Lake."
"I would like to see integration with data science platforms to optimize the processing capability for these tasks."
"Apache Spark's GUI and scalability could be improved."
"Apache Spark provides very good performance The tuning phase is still tricky."
"The migration of data between different versions could be improved."
"In terms of improvement, the only thing that could be enhanced is the stability aspect of Spark SQL."
"Being a new user, I am not able to find out how to partition it correctly. I probably need more information or knowledge. In other database solutions, you can easily optimize all partitions. I haven't found a quicker way to do that in Spark SQL. It would be good if you don't need a partition here, and the system automatically partitions in the best way. They can also provide more educational resources for new users."
"I've experienced some incompatibilities when using the Delta Lake format."
"In the next release, maybe the visualization of some command-line features could be added."
"There are many inconsistencies in syntax for the different querying tasks."
"Anything to improve the GUI would be helpful."
"It takes a bit of time to get used to using this solution versus Pandas as it has a steep learning curve."
"There should be better integration with other solutions."
Apache Spark is ranked 1st in Hadoop with 60 reviews while Spark SQL is ranked 4th in Hadoop with 14 reviews. Apache Spark is rated 8.4, while Spark SQL is rated 7.8. The top reviewer of Apache Spark writes "Reliable, able to expand, and handle large amounts of data well". On the other hand, the top reviewer of Spark SQL writes "Offers the flexibility to handle large-scale data processing". Apache Spark is most compared with Spring Boot, AWS Batch, SAP HANA, Cloudera Distribution for Hadoop and AWS Lambda, whereas Spark SQL is most compared with IBM Db2 Big SQL, HPE Ezmeral Data Fabric, SAP HANA and Netezza Analytics. See our Apache Spark vs. Spark SQL report.
See our list of best Hadoop vendors.
We monitor all Hadoop reviews to prevent fraudulent reviews and keep review quality high. We do not post reviews by company employees or direct competitors. We validate each review for authenticity via cross-reference with LinkedIn, and personal follow-up with the reviewer when necessary.