With spark SQL we've now the capabilities to analyse very large quantities of data located in S3 on Amazon at very low cost comparing other solution we checked.
We also use our own Spark cluster to aggregate data on near real time and save the result on MySQL database.
We've started new projects using the machine learning library ML.
Improvements to My Organization:
Until Spark we didn't have the ability to analyse this quantity of data we're talking about two TB/hour. So we're now able to produce a lot of reports, and are also able to develop machine learning based analysis to optimize our business.
We've central access to every piece of data in the company including finance, business, debug etc. and the ability to join all this data together.
Room for Improvement:
Spark is actually very good for batch analysis much more good than Hadoop, it's much simple, much more quicker etc., but it actually lacks the ability to perform real-time querying like Vertica or Redshift.
Also, it is more difficult for an end user to work with Spark than normal database. even comparing with analytic database like Vertica or Redshift.
Use of Solution:
We're now using Spark-Streaming and Spark-SQL for almost 2 years.
We're working on AWS so we need to have a managed environment. We've choose to go with a solution based on Chef to deploy and configure the spark clusters. Tip : if you don't have any devops you can use the ec2 script (provided by spark distro) to deploy cluster on amazon. We've tested it and work perfectly.
Spark Streaming is difficult to stabilize as you're always dependant to your stream flow. If you start to be late on the consumer you've a serious problem. We've encountered a lot of stability issue to configure it as expected
It's linked to stability in our case it's takes time to evaluate what is the correct size of the cluster you need. It's very important to always add to you jobs monitoring to be able to understand what's the problem. We use datadog as monitoring platform
Yes to make this job we've used a MySQL database. We switch because MySQL is not a scalable solution and we've reach it's limits.
Setup a spark cluster can be difficult. it's related to your clustering strategy. There is 4 solution at least.
ec2 script : work only on Amazon AWS
Standalone : manually configuration (hard)
Yarn : to leverage your already existing Hadoop environment.
Mesos : to use with your other Mesos ready application
We use Databricks as online DB ad hoc query. It's work on AWS as managed service, it manage for you the cluster creation, configuration and monitoring.
Give a notebook oriented user interface to query any data source using Spark: DB, Parquet, CSV, Avro etc...
Other Solutions Considered:
Yes we've started to evaluate analytics databases : vertica, exasol, and other for all the them the price was an issue regarding the quantity of data we want to manipulate.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Mar 30 2016