What is most valuable?
With spark SQL we've now the capabilities to analyse very large quantities of data located in S3 on Amazon at very low cost comparing other solution we checked.
We also use our own Spark cluster to aggregate data on near real time and save the result on MySQL database.
We've started new projects using the machine learning library ML.
How has it helped my organization?
Until Spark we didn't have the ability to analyse this quantity of data we're talking about two TB/hour. So we're now able to produce a lot of reports, and are also able to develop machine learning based analysis to optimize our business.
We've central access to every piece of data in the company including finance, business, debug etc. and the ability to join all this data together.
What needs improvement?
Spark is actually very good for batch analysis much more good than Hadoop, it's much simple, much more quicker etc., but it actually lacks the ability to perform real-time querying like Vertica or Redshift.
Also, it is more difficult for an end user to work with Spark than normal database. even comparing with analytic database like Vertica or Redshift.
For how long have I used the solution?
We're now using Spark-Streaming and Spark-SQL for almost 2 years.
What was my experience with deployment of the solution?
We're working on AWS so we need to have a managed environment. We've choose to go with a solution based on Chef to deploy and configure the spark clusters. Tip : if you don't have any devops you can use the ec2 script (provided by spark distro) to deploy cluster on amazon. We've tested it and work perfectly.
What do I think about the stability of the solution?
Spark Streaming is difficult to stabilize as you're always dependant to your stream flow. If you start to be late on the consumer you've a serious problem. We've encountered a lot of stability issue to configure it as expected
What do I think about the scalability of the solution?
It's linked to stability in our case it's takes time to evaluate what is the correct size of the cluster you need. It's very important to always add to you jobs monitoring to be able to understand what's the problem. We use datadog as monitoring platform
Which solution did I use previously and why did I switch?
Yes to make this job we've used a MySQL database. We switch because MySQL is not a scalable solution and we've reach it's limits.
How was the initial setup?
Setup a spark cluster can be difficult. it's related to your clustering strategy. There is 4 solution at least.
ec2 script : work only on Amazon AWS
Standalone : manually configuration (hard)
Yarn : to leverage your already existing Hadoop environment.
Mesos : to use with your other Mesos ready application
What about the implementation team?
We use Databricks as online DB ad hoc query. It's work on AWS as managed service, it manage for you the cluster creation, configuration and monitoring.
Give a notebook oriented user interface to query any data source using Spark: DB, Parquet, CSV, Avro etc...
Which other solutions did I evaluate?
Yes we've started to evaluate analytics databases : vertica, exasol, and other for all the them the price was an issue regarding the quantity of data we want to manipulate.