Apache Hadoop Overview

Apache Hadoop is the #3 ranked solution in our list of top Data Warehouse tools. It is most often compared to Snowflake: Apache Hadoop vs Snowflake

What is Apache Hadoop?
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Apache Hadoop Buyer's Guide

Download the Apache Hadoop Buyer's Guide including reviews and more. Updated: May 2021

Apache Hadoop Customers
Amazon, Adobe, eBay, Facebook, Google, Hulu, IBM, LinkedIn, Microsoft, Spotify, AOL, Twitter, University of Maryland, Yahoo!, Cornell University Web Lab
Apache Hadoop Video

Filter Archived Reviews (More than two years old)

Filter by:
Filter Reviews
Industry
Loading...
Filter Unavailable
Company Size
Loading...
Filter Unavailable
Job Level
Loading...
Filter Unavailable
Rating
Loading...
Filter Unavailable
Considered
Loading...
Filter Unavailable
Order by:
Loading...
  • Date
  • Highest Rating
  • Lowest Rating
  • Review Length
Search:
Showingreviews based on the current filters. Reset all filters
Analytics Platform Manager at a consultancy with 10,001+ employees
Real User
Parallel processing allows us to get jobs done, but the platform needs more direct integration of visualization applications

What is our primary use case?

We use it as a data lake for streaming analytical dashboards.

Pros and Cons

  • "Two valuable features are its scalability and parallel processing. There are jobs that cannot be done unless you have massively parallel processing."
  • "I would like to see more direct integration of visualization applications."

What other advice do I have?

Implement for defined use cases. Don't expect it to all just work very easily. I would rate this platform a seven out of 10. On the one hand, it's the only place you can use certain functions, and on the other hand, it's not going to put any of the other ones out of business. It's really more of a complement. There is no fundamental battle between relational databases and Hadoop.
AM
CEO
Real User
We are able to ingest huge volumes/varieties of data, but it needs a data visualization tool and enhanced Ambari for management

What is our primary use case?

Big Data analytics, customer incubation. We host our Big Data analytics "lab" on Amazon EC2. Customers are new to Big Data analytics so we do proofs of concept for them in this lab. Customers bring historical, structured data, or IoT data, or a blend of both. We ingest data from these sources into the Hadoop environment, build the analytics solution on top, and prove the value and define the roadmap for customers.

Pros and Cons

  • "Initially, with RDBMS alone, we had a lot of work and few servers running on-premise and on cloud for the PoC and incubation. With the use of Hadoop and ecosystem components and tools, and managing it in Amazon EC2, we have created a Big Data "lab" which helps us to centralize all our work and solutions into a single repository. This has cut down the time in terms of maintenance, development and, especially, data processing challenges."
  • "Since both Apache Hadoop and Amazon EC2 are elastic in nature, we can scale and expand on demand for a specific PoC, and scale down when it's done."
  • "Most valuable features are HDFS and Kafka: Ingestion of huge volumes and variety of unstructured/semi-structured data is feasible, and it helps us to quickly onboard a new Big Data analytics prospect."
  • "Based on our needs, we would like to see a tool for data visualization and enhanced Ambari for management, plus a pre-built IoT hub/model. These would reduce our efforts and the time needed to prove to a customer that this will help them."
  • "General installation/dependency issues were there, but were not a major, complex issue. While migrating data from MySQL to Hive, things are a little challenging, but we were able to get through that with support from forums and a little trial and error."

What other advice do I have?

Our general suggestion to any customer is not to blindly look and compare different options. Rather, list the exact business needs - current and future - and then prepare a matrix to see product capabilities and evaluate costs and other compliance factors for that specific enterprise.
Learn what your peers think about Apache Hadoop. Get advice and tips from experienced pros sharing their opinions. Updated: May 2021.
501,499 professionals have used our research since 2012.
Software Architect at a tech services company with 10,001+ employees
Consultant
Gives us high throughput and low latency for KPI visualization

What is our primary use case?

Data aggregation for KPIs. The sources of data come in all forms so the data is unstructured. We needed high storage and aggregation of data, in the background.

Pros and Cons

  • "High throughput and low latency. We start with data mashing on Hive and finally use this for KPI visualization."

    What other advice do I have?

    I rate it an eight out of 10. It's huge, complex, slow. But does what it is meant for.
    Database/Middleware Consultant (Currently at U.S. Department of Labor) at a tech services company with 51-200 employees
    Consultant
    ​There are no licensing costs involved, hence money is saved on software infrastructure​

    What is our primary use case?

    Content management solution Unified Data solution Apache Hadoop running on Linux

    What is most valuable?

    Data ingestion: It has rapid speed, if Apache Accumulo is used. Data security Inexpensive

    What needs improvement?

    It needs better user interface (UI) functionalities.

    For how long have I used the solution?

    Three to five years.

    What's my experience with pricing, setup cost, and licensing?

    There are no licensing costs involved, hence money is saved on the software infrastructure.
    RC
    Senior Associate at a financial services firm with 10,001+ employees
    Real User
    Relatively fast when reading data into other platforms but can't handle queries with insufficient memory

    Pros and Cons

    • "As compared to Hive on MapReduce, Impala on MPP returns results of SQL queries in a fairly short amount of time, and is relatively fast when reading data into other platforms like R."
    • "The key shortcoming is its inability to handle queries when there is insufficient memory. This limitation can be bypassed by processing the data in chunks."

    What other advice do I have?

    Try open-source Hadoop first but be aware of greater implementation complexity. If open-source Hadoop is "too" complex, then consider a vendor packaged Hadoop solution like HortonWorks, Cloudera, etc.
    Big Data Engineer at a tech vendor with 5,001-10,000 employees
    Vendor
    HDFS allows you to store large data sets optimally. After switching to big data pipelines our query performances had improved hundred times.

    What is most valuable?

    HDFS allows you to store large data sets optimally.

    How has it helped my organization?

    After switching to big data pipelines, our query performance improved a hundred times.

    What needs improvement?

    Rolling restarts of data nodes need to be done in a way that can be further optimized. Also, I/O operations can be optimized for more performance.

    For how long have I used the solution?

    I have used Hadoop for over three years.

    What do I think about the stability of the solution?

    Once we had an issue with stability, due to a complete shutdown of a cluster. Bringing up a cluster took a lot of time because of some order that needed to be followed.

    What do I think about the scalability of the solution?

    We have not had scalability issues.

    How are

    Infrastructure Engineer at Zirous, Inc.
    Real User
    Top 20
    The Distributed File System stores video, pictures, JSON, XML, and plain text all in the same file system.

    What other advice do I have?

    Try, try, and try again. Experiment with MapReduce and YARN. Fine tune your processes and you will see some insane processing power results. I would also recommend that you have at least a 12-node cluster: two master nodes, eight compute/data nodes, one hive node (SQL), 1 Ambari dedicated node. For the master nodes, I would recommend 4-8 Core, 32-64 GB RAM, 8-10 TB HDD; the data nodes, 4-8 Core, 64 GB RAM, 16-20 TB RAID 10 HDD; hive node should be around 4 Core, 32-64 GB RAM, 5-6 TB RAID 0 HDD; and the Ambari dedicated server should be 2-4 Core, 8-12 GB RAM, 1-2 TB HDD storage.
    Senior Hadoop Engineer with 1,001-5,000 employees
    Vendor
    The heart of BigData

    What other advice do I have?

    First, understand your business requirement; second, evaluate the traditional RDBMS scalability and capability, and finally, if you have reached to the tip of an iceberg (RDBMS) then yes, you definitely need an island (Hadoop) for your business. Feasibility checks are important and efficient for any business before you can take any crucial step. I would also say “Don’t always flow with stream of a river because some time it will lead you to a waterfall, so always research and analyze before you take a ride.”
    Product Categories
    Data Warehouse
    Buyer's Guide
    Download our free Apache Hadoop Report and get advice and tips from experienced pros sharing their opinions.
    Quick Links