On one corner we have Hadoop, a massively distributed JVM-based data
processing engine with a Map & Reduce API and a proven track record
in handling huge data-sets. On the other corner we have SSIS, a natively
non-distributed ETL engine part of the SQL Server family tool-set with
.NET code extensibility features and a drag and drop UI (for the most
part anyway). Two sweet technologies, probably shouldn’t be compared to
each other but we’re doing it anyway, pitted head to head against a data
mapping task to the death (or at least to the recycling of my test VMs)… Now FIGHT!
Recently I have been tasked with building a data processing layer tracking social signals with the following characteristics:
Input data is flat files. Although initially the
amount of data might not be classified under “Big Data” per-say, but
certainly had the potential to grow very quickly. Files were very small
JSON format (1 KB average).
Output data is flat files. Delimited file which will be queried through a Hive Warehouse layer.
Data is only Mapped and not Reduced.Which
means data is only extracted from the flat files and processed but
never aggregated, and in any case SSIS is not capable of reducing (or
aggregating) data in a scale-out architecture without building a custom
intermediary layer (such as temporarily placing data in a database).
Data Latency into Hive is of Paramount Importance.
Both technologies are capable of iterating through a large number of
flat files, extracting information and building an output, and when we
take the Reduce operation out of the equation, we level
the playing field and now both technologies can be scaled out, albeit
Hadoop in a perhaps more friendly manner.
Although these technologies have a wider application and usage that
they might be better suited to, in this experiment I was only interested
in performance figures on this basic task.
In order to test these technologies against the mapping task, I have
built two test machines, one for SSIS with SQL Server to support the
SSIS Catalogue database, and another for a simple 3 node Hadoop cluster,
the technical specification for each scenario is as follows:
|Integration Service (SSIS)
||4 Cores / Node
||2 Cores / Node
||8 GB / Node
||3 GB / Node
||Windows Server 2012
||SQL Server 2012
||Cloudera CDH 4
Although the specifications for each test setup is slightly
different, which makes the comparison fairly “unscientific”, the
over-all processing resources available for each test scenario should be
fairly comparable, with the Hadoop cluster gaining a slight edge in
terms of over-all CPU cores and RAM. Besides, we are only looking for
a really considerable difference in the result to warrant a favouritism
of one technology over the other in this business requirement.
I ran two test scenarios:
Scenario 1: 33,000 small (1KB) JSON input files, each file will have about 5 – 10 values to extract against a key (mapping).
Scenario 2: 33 input files (every 1,000 files in scenario 1 is concatenated)
The results of the test were as follows:
|Scenario 1 (33,000)
||Scenario 2 (33)
As can be deduced from the results above, 1 SSIS instance showed up to 66X better performance in handling and processing flat files than the same job running in a Hadoop cluster.
Learnings from SSIS vs Hadoop Test
There are a few key learnings that has been gained by doing this experiment:
- Hadoop has a terrible start time when operating on a file, the processing engine could take up-to 5 seconds before it could actually start processing the file, were SSIS takes less than 0.2 of a second. Java has never been a very agile language in my opinion.
- Hadoop is not intended to handle a large number of small files,
instead try combining smaller files into bigger concatenations.
Sometimes it is considerably faster to have a pre-processing step
that concatenates files into smaller batches.
- Although the number of “Reducers” for a Hadoop job could be easily controlled, it is more difficult to control how many “Mappers” available for a job across the cluster, and Hadoop does not always adhere to the user-set number of Mappers.
- Although SSIS outperforms Hadoop by an average of 50X on this simple
task, Hadoop scales in a much more user-friendly manner, and allows
users to “Reduce” or aggregate the data across all nodes for a
particular job, a feature that is not supported by the out-of-the-box
Don’t just jump on new technologies, you need to
test it and ensure that it is suitable for your particular business
requirement, Hadoop is a great distributed processing engine when used in the correct context.
It is too easy these days for managers and BI people to band around the
term “Hadoop” for everything “Big Data”, from data processing to
warehousing, but you need to take the time to separate the wheat from
HDInsight (Microsoft’s Hadoop distribution which
runs on Windows and Azure) was another technology that we were
investigating at the time, although performance was extremely terrible
that it was eliminated from the race fairly quickly.