Cloudera Distribution for Hadoop Review

Features like Hive, Pig, Impala, Flume and Spark are valuable to us.


Valuable Features

Cloudera Manager is the most valuable feature for it’s ease of use, features, ease of upgrade and install components. CM can also be use to set up high availability within minutes. Others features like Hive, Pig, Impala, Flume and Spark are also valuable.

Improvements to My Organization

It's improved our storage and the availability of analytics tools such as Hive, Pig, Impala, and Spark helps us tremendously.

Room for Improvement

I'd like to see improvements to Impala. Also, it needs a more integrated environment with Spark, data warehouse, storage systems, cloud. Additionally, I'd want more UIs for components of ecosystem, preferably those UIs are centralized in a gateway.

Use of Solution

I've used it for 3.5 years.

Deployment Issues

For experimental and production clusters alike, use Cloudera Manager right from the beginning. RPM installation is good for learning.

Stability Issues

It has compatibility issues if installed in specialized hardware such as EMC Isilon or if node manager and data nodes are not co-located. For production, draw out a detailed plan on how to manage local repo for installation and upgrade. Never install from internet for production clusters.

Customer Service and Technical Support

Most of the clusters are for experimentation that don’t require support. For production clusters, implementations are through major vendors which are handled by them.

Initial Setup

It depends on mode of installation. Cloudera Manager is always more straight forward and manageable. Avoid RPM installation as much as possible. Lay out plans with system admin on upgrade plan, commission and decommission nodes. Investigate impact and consequences of having HBase and Hadoop in the same cluster or as separate cluster, what are the impacts on system admin, cost, upgrades, data migrations, resources, etc?

The complexity kicks in when performing parameter configurations. Find out what are the use cases, are there disk IO or compution IO bound, are there lots of structured data or unstructured data for text analytics, etc.

Implementation Team

Both vendor team and in-house depending on the cluster size and use cases. Some customers may require certain number of certified personnel, something to think about when choosing a partner.

Other Advice

Be prepared for fast changing landscape in how Hadoop works under the hood and how it is used. Each major release usually involved change of file system and data structure. How would they impact data migration. Ask questions like should they Upgrade or create a new cluster? Plans for training and skill upgrades.

Disclosure: My company has a business relationship with this vendor other than being a customer: We're a system integration partner.
Add a Comment
Guest

Sign Up with Email