Apache Hadoop Review

Relatively fast when reading data into other platforms but can't handle queries with insufficient memory


What is most valuable?

Impala. As compared to Hive on MapReduce, Impala on MPP returns results of SQL queries in a fairly short amount of time, and is relatively fast when reading data into other platforms like R (for further data analysis) or QlikView (for data visualisation).

How has it helped my organization?

The quick access to data enabled more frequent data backed decisions.

What needs improvement?

The key shortcoming is its inability to handle queries when there is insufficient memory. This limitation can be bypassed by processing the data in chunks.

For how long have I used the solution?

Two-plus years.

What do I think about the stability of the solution?

Typically instability is experienced due to insufficient memory, either due to a large job being triggered or multiple concurrent small requests.

What do I think about the scalability of the solution?

No. This is by default a cluster-based setup and hence scaling is just a matter of adding on new data nodes.

How is customer service and technical support?

Not applicable to Cloudera. We have a separate onsite vendor to manage the cluster.

Which solutions did we use previously?

No. Two years ago this was a new team and hence there were no legacy systems to speak of.

How was the initial setup?

Complex. Cloudera stack itself was insufficient. Integration with other tools like R and QlikView was required and in-house programs had to be built to create an automated data pipeline.

What's my experience with pricing, setup cost, and licensing?

Not much advice as pricing and licensing is handled at an enterprise level.

However do take into consider that data storage and compute capacity scale differently and hence purchasing a "boxed" / 'all-in-one" solution (software and hardware) might not be the best idea.

Which other solutions did I evaluate?

Yes. Oracle Exadata and Teradata.

What other advice do I have?

Try open-source Hadoop first but be aware of greater implementation complexity. If open-source Hadoop is "too" complex, then consider a vendor packaged Hadoop solution like HortonWorks, Cloudera, etc.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
1 visitor found this review helpful
Add a Comment
Guest
Sign Up with Email