Apache Hadoop Review

Parallel processing allows us to get jobs done, but the platform needs more direct integration of visualization applications

What is our primary use case?

We use it as a data lake for streaming analytical dashboards.

How has it helped my organization?

There is a lot of difference. I think the best case is that we are able to drill down to transactional records and really build a root-cause analysis for various issues that might arise, on demand. Because we're able to process in parallel, we don't have to wait for the big data warehouse engine. We process down what the data is and then build it up to an answer, and we can have an answer in an hour rather than 10 hours.

What is most valuable?

  • Scalability
  • Parallel processing

There are jobs that cannot be done unless you have massively parallel processing; for instance, processing call-detail records for telecom.

What needs improvement?

In general, Hadoop has as lot of different component parts to the platform - things like Hive and HBase - and they're all moving somewhat independently and somewhat in parallel. I think as you look to platforms in the cloud or into walled-garden concepts, like Cloudera or Azure, you see that the third-party can make sure all the components work together before they are used for business purposes. That reduces a layer of administration configuration and technical support.

I would like to see more direct integration of visualization applications.

For how long have I used the solution?

More than five years.

What do I think about the stability of the solution?

In general, stability can be a challenge. It's hard to say what stability means. You're in an environment that's before production-line manufacturing, where none of the parts relate together exactly as they should. So that can create some instability.

To realize the benefit of these kinds of open-source, big-data environments, you want to use as many different tools as you can get. That brings with it all this overhead of making them work together. It's kind of a blessing and a curse, at the same time: There's a tool for everything.

How are customer service and technical support?

Apache is the open-source foundation that Cloudera and Hortonworks contribute code and some work to. I don't know that there is actually support and structure, per se, for Apache.

We have had premium, at various times with various companies. From the three dominant companies I've worked with - Cloudera, Hortonworks, and MapR - there is a premium support package but that still only covers their base. Distribution is not necessarily all the add-ons that are on top of it, which is really a big challenge: to get everything to work together.

Which solution did I use previously and why did I switch?

There are the older relational database technologies: Netezza, SQL Server, MySQL, Oracle, Teradata. All have some advantages and some disadvantages. Most notably, they are all significantly more expensive in terms of the capital expense, rather than the operational expense. They are "walled-garden," so to speak, that are curated and have a distinct set of tools that work with them, and not the bleeding-edge ingenuity that comes with an open-source platform.

Data warehousing is 30 years old, at least. Big data is, in its current form, has only been around for four or five years old.

How was the initial setup?

There are capacities in which I have been responsible for setup, administration, and building the applications on those environments. Each of the components is relatively straightforward. The complexity comes from all the different components.

What other advice do I have?

Implement for defined use cases. Don't expect it to all just work very easily.

I would rate this platform a seven out of 10. On the one hand, it's the only place you can use certain functions, and on the other hand, it's not going to put any of the other ones out of business. It's really more of a complement. There is no fundamental battle between relational databases and Hadoop.

**Disclosure: I am a real user, and this review is based on my own experience and opinions.
More Apache Hadoop reviews from users
...who compared it with Oracle Exadata
Add a Comment