We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.
Pros and Cons
I feel the streaming is its best feature.
When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources.
What other advice do I have?
On a scale of 1 to 10, I'd put it at an eight. To make it a perfect 10 I'd like to see an improved configuration bot. Sometimes it is a nightmare on Linux trying to figure out what happened on the configuration and back-end. So I think installation and configuration with some other tools. We are technical people, we could figure it out, but if aspects like that were improved then other people who are less technical would use it and it would be more adaptable to the end-user.