2018-06-27T19:19:00Z

What needs improvement with Apache Spark?

2

Please share with the community what you think needs improvement with Apache Spark.

What are its weaknesses? What would you like to see changed in a future version?

ITCS user
Guest
1616 Answers

author avatar
Real User

There is still enough space of improvement on Apache Spark in term of integration and improving speed. Apache spark community can use Rust, C++ implementation to improve performance.

2020-06-10T05:44:05Z
author avatar
Top 20LeaderboardReal User

The logging for the observability platform could be better.

2021-03-27T15:39:24Z
author avatar
Top 5LeaderboardReal User

Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available.

2021-02-01T12:04:16Z
author avatar
Top 10LeaderboardReal User

Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.

2020-10-28T02:27:29Z
author avatar
Top 5LeaderboardReal User

There are lots of items coming down the pipeline in the future. I don't know what features are missing. From my point of view, everything looks good. The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate. There should be more information shared to the user. The solution already has all the information tracked in the cluster. It just needs to be accessible or searchable.

2020-07-23T07:58:35Z
author avatar
Top 20Consultant

I would like to see integration with data science platforms to optimize the processing capability for these tasks.

2020-02-02T10:42:14Z
author avatar
Consultant

We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time. There is some latency in the system and latency in the data caching. The main issue is that we need to design it in a way that data will be available to us very quickly. It takes a long time and the latest data should be available to us much quicked.

2020-01-29T11:22:00Z
author avatar
Real User

We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data.

2020-01-29T11:22:00Z
author avatar
Consultant

I think for IT people it is good. The whole idea is that Spark works pretty easily, but a lot of people, including me, struggle to set things up properly. I like contributions and if you want to connect Spark with Hadoop its not a big thing, but other things, such as if you want to use Sqoop with Spark, you need to do the configuration by hand. I wish there would be a solution that does all these configurations like in Windows where you have the whole solution and it does the back-end. So I think that kind of solution would help. But still, it can do everything for a data scientist. Spark's main objective is to manipulate and calculate. It is playing with the data. So it has to keep doing what it does best and let the visualization tool do what it does best. Overall, it offers everything that I can imagine right now.

2019-12-23T07:05:00Z
author avatar
Top 20Real User

The solution needs to optimize shuffling between workers.

2019-12-09T10:58:00Z
author avatar
Consultant

When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data. Once you are experienced, it is easier and more stable. When you are trying to do something outside of the normal requirements in a typical project, it is difficult to find somebody with experience.

2019-10-13T05:48:00Z
author avatar
Real User

The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive.

2019-07-14T10:21:00Z
author avatar
Real User

The search could be improved. Usually, we are using other tools to search for specific stuff. We'll be using it how I use other tools - to get the details, but if there any way to search for little things that will be better. It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster. In the next release, if they can add more analytics, that would be useful. For example, for data, built data, if there was one port where you put the high one then you can pull any other close to you, and then maybe a log for the right script.

2019-07-10T12:01:00Z
author avatar
Real User

Better data lineage support.

2019-04-08T13:04:00Z
author avatar
Real User

It is like going back to the '80s for the complicated coding that is required to write efficient programs.

2019-03-17T03:12:00Z
author avatar
User

I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.

2018-06-27T19:19:00Z
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: July 2021.
524,194 professionals have used our research since 2012.