Apache Hadoop Review

The Distributed File System stores video, pictures, JSON, XML, and plain text all in the same file system.

What is most valuable?

The Distributed File System, which is the base of Hadoop, has been the most valuable feature with its ability to store video, pictures, JSON, XML, and plain text all in the same file system.

How has it helped my organization?

We do use the Hadoop platform internally, but mostly it is for R&D purposes. However, many of the recent projects that our IT consulting firm has taken on have deployed Hadoop as a solution to store high-velocity and highly variable data sizes and structures, and be able to process that data together quickly and efficiently.

What needs improvement?

Hadoop in and of itself stores data with 3x redundancy and our organization has come to the conclusion that the default 3x results in too much wasted disk space. The user has the ability to change the data replication standard, but I believe that the Hadoop platform could eventually become more efficient in their redundant data replication. It is an organizational preference and nothing that would impede our organization from using it again, but just a small thing I think could be improved.

For how long have I used the solution?

This version was released in January 2016, but I have been working with the Apache Hadoop platform for a few years now.

What was my experience with deployment of the solution?

The only issues we found during deployment were errors originating from between the keyboard and the chair. I have set up roughly 20 Hadoop Clusters and mostly all of them went off without a hitch, unless I configured something incorrectly on the pre-setup.

What do I think about the stability of the solution?

We have not encountered any stability problems with this platform.

What do I think about the scalability of the solution?

We have scaled two of the clusters that we have implemented; one in the cloud, one on-premise. Neither ran into any problems, but I can say with certainty that it is much, much easier to scale in a cloud environment than it is on-premise.

How are customer service and technical support?

Customer Service:

Apache Hadoop is open-source and thus customer service is not really a strong point, but the documentation provided is extremely helpful. More so than some of the Hadoop vendors such as MapR, Cloudera, or Hortonworks.

Technical Support:

Again, it's open source. There are no dedicated tech support teams that we've come across unless you look to vendors such as Hortonworks, Cloudera, or MapR.

Which solution did I use previously and why did I switch?

We started off using Apache Hadoop for our initial Big Data initiative and have stuck with it since.

How was the initial setup?

Initial setup was decently straightforward, especially when using Apache Ambari as a provisioning tool. (I highly recommend Ambari.)

What about the implementation team?

We are the implementers.

What's my experience with pricing, setup cost, and licensing?

It's open source.

Which other solutions did I evaluate?

We solely looked at Hadoop.

What other advice do I have?

Try, try, and try again. Experiment with MapReduce and YARN. Fine tune your processes and you will see some insane processing power

I would also recommend that you have at least a 12-node cluster: two master nodes, eight compute/data nodes, one hive node (SQL), 1 Ambari dedicated node.

For the master nodes, I would recommend 4-8 Core, 32-64 GB RAM, 8-10 TB HDD; the data nodes, 4-8 Core, 64 GB RAM, 16-20 TB RAID 10 HDD; hive node should be around 4 Core, 32-64 GB RAM, 5-6 TB RAID 0 HDD; and the Ambari dedicated server should be 2-4 Core, 8-12 GB RAM, 1-2 TB HDD storage.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
2 visitors found this review helpful
author avatarSarit Seal

Great job. Did you guys check out Cloudera, Horton or any of the commercial distributions or just downloaded Hadoop and started.
All the clients I worked with aligns with a distribution. I agree with your comment on stability I myself have set up yarn for my pet spark projects and have never faced a problem that I could not resolve.

author avatarColt Rodgers

We have since partnered with Hortonworks and are researching into the Cloudera and MapR spaces right now as well. Though our strong suit is Hortonworks, we do have a good implementation team for any of the distributions.