What is our primary use case?
We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.
What is most valuable?
I have worked with Hadoop a lot in my career and you need to do a lot of things to get it to Hello World. But in Spark it is easy. You could say it's an umbrella to do everything under the one shelf. It also has Spark Streaming. I feel the streaming is its best feature because I have extracted to enter data and analysis within Spark Stream.
What needs improvement?
I think for IT people it is good. The whole idea is that Spark works pretty easily, but a lot of people, including me, struggle to set things up properly. I like contributions and if you want to connect Spark with Hadoop its not a big thing, but other things, such as if you want to use Sqoop with Spark, you need to do the configuration by hand. I wish there would be a solution that does all these configurations like in Windows where you have the whole solution and it does the back-end. So I think that kind of solution would help. But still, it can do everything for a data scientist.
Spark's main objective is to manipulate and calculate. It is playing with the data. So it has to keep doing what it does best and let the visualization tool do what it does best.
Overall, it offers everything that I can imagine right now.
For how long have I used the solution?
I have been using Apache Spark for a couple of months.
What do I think about the stability of the solution?
In terms of stability, I have not seen any bugs, glitches or crashes. Even if there is, that's fine, because I would probably take care of it and then I'd have progressed further in the process.
What do I think about the scalability of the solution?
I have not tested the scalability yet.
In my company, there are two or three people that are using it for different products. But right now, the client I'm engaged with doesn't know anything about Spark or Hadoop. They are a typical financial company so they do what they do, and they ask us to do everything. They have pretty much outsourced their whole big data initiative to us.
Which solution did I use previously and why did I switch?
I have used MapReduce from Hadoop previously. Otherwise, I haven't used any other big data infrastructure.
In my work previously, not in this company, I was working with some big data, but I was extracting using a single-core off my PC. I realized over time that my system had eight cores. So instead, I used all of those cores for multi-core programming. Then I realized that Hadoop and Spark do the same thing but with different PC's. That was then I used multi-core programming and that's the point - Spark needs to go and search Hadoop and other things.
How was the initial setup?
The initial setup to get it to Hello World is pretty easy, you just have to install it. But when you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources. But you can get a lot of help from different sources on the internet. So it's great. A lot of people are doing it.
I work with a startup company. You know that in startups you do not have the luxury of different people doing different things, you have to do everything on your own, and it's an opportunity to learn everything. In a typical corporate or big organization you only have restricted SOPs, you have to work within the boundaries. In my organization, I have to set up all the things, configure it, and work on it myself.
What's my experience with pricing, setup cost, and licensing?
I would suggest not to try to do everything at once. Identify the area where you want to solve the problem, start small and expand it incrementally, slowly expand your vision. For example, if I have a problem where I need to do streaming, just focus on the streaming and not on the machine learning that Spark offers. It offers a lot of things but you need to focus on one thing so that you can learn. That is what I have learned from the little experience I have with Spark. You need to focus on your objective and let the tools help you rather than the tools drive the work. That is my advice.
What other advice do I have?
On a scale of 1 to 10, I'd put it at an eight.
To make it a perfect 10 I'd like to see an improved configuration bot. Sometimes it is a nightmare on Linux trying to figure out what happened on the configuration and back-end. So I think installation and configuration with some other tools. We are technical people, we could figure it out, but if aspects like that were improved then other people who are less technical would use it and it would be more adaptable to the end-user.