We just raised a $30M Series A: Read our story

Apache Hadoop OverviewUNIXBusinessApplication

Apache Hadoop is the #6 ranked solution in our list of top Data Warehouse tools. It is most often compared to Microsoft Azure Synapse Analytics: Apache Hadoop vs Microsoft Azure Synapse Analytics

What is Apache Hadoop?
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Buyer's Guide

Download the Data Warehouse Buyer's Guide including reviews and more. Updated: October 2021

Apache Hadoop Customers
Amazon, Adobe, eBay, Facebook, Google, Hulu, IBM, LinkedIn, Microsoft, Spotify, AOL, Twitter, University of Maryland, Yahoo!, Cornell University Web Lab
Apache Hadoop Video

Apache Hadoop Reviews

Filter by:
Filter Reviews
Industry
Loading...
Filter Unavailable
Company Size
Loading...
Filter Unavailable
Job Level
Loading...
Filter Unavailable
Rating
Loading...
Filter Unavailable
Considered
Loading...
Filter Unavailable
Order by:
Loading...
  • Date
  • Highest Rating
  • Lowest Rating
  • Review Length
Search:
Showingreviews based on the current filters. Reset all filters
JP
Vice President - Finance & IT at a consumer goods company with 1-10 employees
Real User
Great micro-partitions, helpful technical support and quite stable

Pros and Cons

  • "The solution is easy to expand. We haven't seen any issues with it in that sense. We've added 10 servers, and we've added two nodes. We've been expanding since we started using it since we started out so small. Companies that need to scale shouldn't have a problem doing so."
  • "The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning."

What is our primary use case?

As an example of a use case, when I was a contractor for Cisco, we were processing mobile network data and the volume was too big. RDBMS was not supporting anything. We started using the Hadoop framework to improve the process and get the results faster.

What is most valuable?

The data is stored in micro-partitions which makes the processes very fast compared to other RDBMS systems. Apache Spark is in the memory process, and it's much better than MapReduce.

Micro-partitions and the HDFS are both excellent features.

What needs improvement?

I'm not sure if I have any ideas as to how to improve the product.

Every year, the solution comes out with new features. Spark is one new feature, for example. If they could continue to release new helpful features, it will continue to increase the value of the solution.

The solution could always improve performance. This is a consistent requirement. Whenever you run it, there is always room for improvement in terms of performance.

The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning.

We would prefer it if users didn't just get pushed through to certification-based learning, as certifications are expensive. Maybe if they could arrange it so that the certification was at a lesser cost. The certification cost is currently around $2,500 or thereabout. 

For how long have I used the solution?

I've been using the solution for four years.

What do I think about the stability of the solution?

We haven't had too many problems with stability. For the POC we used a small amount of data and we started with 10 nodes. We're gradually increasing in now to 40 nodes. We haven't seen any issues after the small teething period in the beginning. The configuration issues and the performance issues have subsided. Once we learned how to stack everything, it has been much better.

What do I think about the scalability of the solution?

The solution is easy to expand. We haven't seen any issues with it in that sense. We've added 10 servers, and we've added two nodes. We've been expanding since we started using it since we started out so small. Companies that need to scale shouldn't have a problem doing so.

We are supporting a multitenancy model and we get the data on supporting the users. I would say, per organization, we have eight to 10 users and probably have a total of around 40 users across the board.

How are customer service and technical support?

We started on the solution as a POC. Once we got into production, we had some minor issues. We get great support. They share advice and helped us tweak some things in terms of the configurations. We've been satisfied with the level of service we've been provided.

Which solution did I use previously and why did I switch?

We have only ever used Apache Hadoop, or a version of it. When we looked for the commercial tier, there was Cloudera and Hortonworks. We started with the Hortonworks due to the fact that at that time we felt it was cost-effective. However, Cloudera bought Hadoop and Hortonworks and now it's all basically the same solution.  

How was the initial setup?

The initial setup was a little complex the first time around. We were new to the system, and we didn't have any expertise at that time. Once we get some support and insights into how to work everything properly it went more smoothly.

First, we started with a POC - proof of concept. It takes a couple of days in terms of understanding and configuring everything, etc. When we went to production, it was a couple of hours for deployment and we put into practice everything we learned from the POC.

There's definitely a learning curve. It's stable for us now. 

We have a team of developers doing multiple tasks on the solution and few of them are taking care of Hadoop, so we do have a few people handling maintenance.

What about the implementation team?

As we were new to the solution, we found we needed some outside assistance to guide us. However, that was for the POC. In the end, I did it myself. 

What other advice do I have?

We're just a customer. We don't have a business relationship with Hadoop. 

My day-to-day job is data modeling and architecting.

Originally we used it as an open-source solution. We downloaded it, then we went for a commercial version of it.

In terms of advice, I'd tell other potential users that whether the solution is right for them depends on a few items. If the data volume is too big, it's IoT data, or the stream of data is too much, this solution can handle it and I would definitely recommend Apache Hadoop. 

Recently, in the last 18 months, I've been working with the Snowflake, it's a Data Lake project, and I am really impressed with that one. I got a certification so that we started using Snowflake set for our Data Lake environment.

I'd rate the solution eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
GA
Founder & CTO at a tech services company with 1-10 employees
Real User
Processes large data sets across clusters of computers

Pros and Cons

  • "Hadoop is designed to be scalable, so I don't think that it has limitations in regards to scalability."
  • "From the Apache perspective or the open-source community, they need to add more capabilities to make life easier from a configuration and deployment perspective."

What is our primary use case?

We mainly use Apache Hadoop for real-time streaming. Real-time streaming and integration using Spark streaming and the ecosystem of Spark technologies inside Hadoop.

What is most valuable?

I actually like most of the capabilities, but I think Spark has added reposit capabilities on top of the Hadoop ecosystem. The Spark area includes the capabilities that I like the most with Hadoop. 

What needs improvement?

I don't have any concerns because each part of Hadoop has its use cases. To date, I haven't implemented a huge product or project using Hadoop, but on the level of POCs, it's fine. 

The community of Hadoop is now a cluster, I think there is room for improvement in the ecosystem.

From the Apache perspective or the open-source community, they need to add more capabilities to make life easier from a configuration and deployment perspective.

For how long have I used the solution?

I have been using this solution for roughly five years.

What do I think about the stability of the solution?

I've never experienced any bugs or glitches.

What do I think about the scalability of the solution?

Hadoop is designed to be scalable, so I don't think that it has limitations in regards to scalability.

How was the initial setup?

It's a well-known fact that Hadoop's configuration is pretty hard. 

What other advice do I have?

Usually, people need to study and prepare for a few use cases and compare multiple ecosystems before choosing one. When people think of using a big data solution, Hadoop comes to mind. For certain use cases, Hadoop is comparable with other technologies. For example, when building a sort of real-time data warehouse — an enterprise data hub —, people don't think about using Hadoop directly. People often use solutions like DROID for building.

At the end of the day, you need to compare technologies — existing technologies against their use cases. You need to study your use case and select the technology inside of Hadoop that will fit your use case. You may find another ecosystem that solves your problem, just keep in mind, Hadoop is not the only solution, there are a lot of solutions. It depends on the use case. 

Overall, on a scale from one to ten, I would give Hadoop a rating of eight.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Microsoft Azure
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Find out what your peers are saying about Apache, VMware, Snowflake Computing and others in Data Warehouse. Updated: October 2021.
542,823 professionals have used our research since 2012.
YM
CEO at AM-BITS LLC
Real User
Top 20
Good stability and scalability but the visualization isn't good

Pros and Cons

  • "The ability to add multiple nodes without any restriction is the solution's most valuable aspect."
  • "There is a lack of virtualization and presentation layers, so you can't take it and implement it like a radio solution."

What is our primary use case?

We primarily use the solution for the enterprise data hub and big data warehouse extension.

What is most valuable?

The ability to add multiple nodes without any restriction is the solution's most valuable aspect.

What needs improvement?

What needs improvement depends on the customer and the use case. The classical Hadoop, for example, we consider an old variant. Most now work with flash data.

There is a very wide application for this solution, but in enterprise companies, if you work with classical BI systems, it would be good to include an additional presentation layer for BI solutions.

There is a lack of virtualization and presentation layers, so you can't take it and implement it like a radio solution. 

For how long have I used the solution?

We've been working with the solution for three to four years.

What do I think about the stability of the solution?

The solution is stable. It has very good disaster stability and multi-rack configuration.

What do I think about the scalability of the solution?

It is possible to scale the solution. We work with companies that have hundreds of users.

How was the initial setup?

The initial setup might not be straightforward for our customers, but it's easy enough for us to handle. However, if we don't build a proof of concept for the company first it may take some time and be quite complex. Pilot projects take about three months to deploy and full spec projects take up to a year because we have to work in all requirements in data governance, security, etc.

What's my experience with pricing, setup cost, and licensing?

We originally built on Hortonworks tech which didn't require any licensing, but that is getting discontinued in 2022, so it's been proposed we move to Cloudera which will have licensing costs associated with it.

What other advice do I have?

We use the on-premises deployment model. It's a requirement for the company we work with, which is a bank. Often customers demand we work with on-premises deployment models.

I'd rate the solution seven out of ten. In terms of the ability to build middleware and offer scalability, it would be 10 out of 10 from me. However,  if you take into account only the visualization, I'd only rate it at three or four out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
SS
Technical Lead at a government with 201-500 employees
Real User
Good distributed processing and performance, but very expensive

Pros and Cons

  • "The performance is pretty good."
  • "The solution is very expensive."

What is most valuable?

The distributed processing is excellent. 

On the solution, Spark is very good. 

The performance is pretty good.

What needs improvement?

For the visualization tools, we use Apache Hadoop and it is very slow.

It lacks some query language. We have to use Apache Linux. Even so, the query language still has limitations with just a bit of documentation and many of the visualization tools do not have direct connectivity. They need something like BigQuery which is very fast. We need those to be available in the cloud and scalable.

The solution needs to be powerful and offer better availability for gathering queries.

The solution is very expensive.

For how long have I used the solution?

I've been using the solution for about five years now.

What do I think about the stability of the solution?

The solution is stable and offers good performance. It doesn't crash or freeze. It's not buggy at all.

What do I think about the scalability of the solution?

You can scale the solution if you need to. We find that it's pretty easy to expand it out.

There were about 13-20 people using it at any given time.

How are customer service and technical support?

The technical support was pretty good. It's my understanding that the company was pretty satisfied with the level of support they received. They were knowledgeable and responsive.

Which solution did I use previously and why did I switch?

I've also worked with MySQL and Postgres. Hadoop is more for analytical processing. While the others claim to have a distributor, Hadoop is far better in that regard. It's excellent compared to other options.

How was the initial setup?

The initial setup was pretty straightforward. It was not overly complex for our team.

What's my experience with pricing, setup cost, and licensing?

The solution isn't cheap. It's quite costly.

What other advice do I have?

The solution is perfect for those dealing with a huge amount of data. Still, you need to check to make sure it meets your company's requirements. You need to understand them before actually choosing the technology you'll ultimately use.

Overall, I would rate the solution at a seven out of ten.

Which deployment model are you using for this solution?

Public Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
DD
Partner at a tech services company with 11-50 employees
Real User
Highly elastic and stable, but it needs better security

What is our primary use case?

There are several use cases for Hadoop. Sometimes it's used for data warehousing. Other times, it's analytics. And In some cases, it's used to do transformation. For example, I have one client using it to decompress, compress, or encrypt data on ingestion. So, he used it like an ETL engine.

What is most valuable?

Hadoop is extensible — it's elastic.

What needs improvement?

Hadoop's security could be better.

For how long have I used the solution?

I've been using Hadoop for about eight years. I'm not sure exactly.

What do I think about the stability of the solution?

Performance is one of the reasons people choose Hadoop.

What do I think about the scalability of the solution?

Scalability is one of Hadoop's strong suits.

How are customer

What is our primary use case?

There are several use cases for Hadoop. Sometimes it's used for data warehousing. Other times, it's analytics. And In some cases, it's used to do transformation. For example, I have one client using it to decompress, compress, or encrypt data on ingestion. So, he used it like an ETL engine.

What is most valuable?

Hadoop is extensible — it's elastic.

What needs improvement?

Hadoop's security could be better.

For how long have I used the solution?

I've been using Hadoop for about eight years. I'm not sure exactly.

What do I think about the stability of the solution?

Performance is one of the reasons people choose Hadoop.

What do I think about the scalability of the solution?

Scalability is one of Hadoop's strong suits.

How are customer service and support?

I've never had to use Hadoop support. 

How was the initial setup?

The complexity of Hadoop's setup depends on the customer and their needs. However, most of my customers wind up using Hadoop as a service, which makes it very easy. It doesn't need much maintenance. My staff maintains multiple systems, so it's not like there would ever be somebody dedicated to one, and Hadoop is not a high-touch platform.

What other advice do I have?

I rate Hadoop seven out of 10. It's very good, but it could always be better. To anyone considering Hadoop, I recommend that you be mindful of what you're trying to achieve.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)
Disclosure: My company has a business relationship with this vendor other than being a customer: Implementer
Flag as inappropriate
YT
Technical Architect at RBSG Internet Operations
Real User
Good database and highly scalable, with good plug and play analytics tools

Pros and Cons

  • "The most valuable feature is the database."
  • "It would be good to have more advanced analytics tools."

What is our primary use case?

We are primarily dumping all the prior payment transaction data into a loop system and then we use some of the plug and play analytics tools to translate it.

What is most valuable?

The most valuable feature is the database.

What needs improvement?

We're finding vulnerabilities in running it 24/7. We're experiencing some downtime that affects the data.

It would be good to have more advanced analytics tools.

For how long have I used the solution?

I've been using the solution for five years.

What do I think about the scalability of the solution?

The solution is scalable. From a payments perspective, we're using the solution on a large scale.

How are customer service and technical support?

We've never contacted technical support.

Which solution did I use previously and why did I switch?

We didn't previously use a different solution.

How was the initial setup?

The initial setup was complex. There was a lot of data that we had to bring over from various sources and it was quite a long process.

What about the implementation team?

We did have some assistance with the implementation.

What other advice do I have?

We use the on-premises deployment model.

We're more inclined towards an operational data source to fill our customer's needs. Hadoop is good for analytics and some reporting requirements. 

It's a good solution for those needing something for the purposes of management reporting.

I'd rate the solution eight out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
MukundMishra
Practice Lead (BI/ Data Science) at a tech services company with 11-50 employees
Real User
Good for managing and replication of big data but needs a better user interface

Pros and Cons

  • "It's good for storing historical data and handling analytics on a huge amount of data."
  • "The solution could use a better user interface. It needs a more effective GUI in order to create a better user environment."

What is most valuable?

The solution is perfect for when you have big data. It's good for managing and replication.

It's good for storing historical data and handling analytics on a huge amount of data.

What needs improvement?

It could be because the solution is open source, and therefore not funded like bigger companies, but we find the solution runs slow.

The solution isn't as mature as SQL or Oracle and therefore lacks many features.

The solution could use a better user interface. It needs a more effective GUI in order to create a better user environment.

For how long have I used the solution?

I've been using the solution for seven years.

What do I think about the stability of the solution?

The solution is stable.

What other advice do I have?

I've used the solution under cloud, hybrid and on-premises deployment models.

I'd recommend the solution, but it depends on the company's requirements. If you don't have huge amounts of data, you probably don't need Hadoop. If you need a completely private environment, and you have lots of big data, consider Hadoop. You don't even need to invest in the infrastructure as you can just use a cloud deployment.

I'd rate the solution seven out of ten. I'd rate it higher if it had a better user interface.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
AR
Co-Founder at a tech services company with 201-500 employees
Real User
Top 5
Powerful data ingestion and consolidation tools prepare the data for predictive analytics

What is our primary use case?

The primary use is as a data lake. 

How has it helped my organization?

Using this solution has allowed us to consolidate the data. It has made it such that data science-based algorithms can be written for predictive analytics.

What is most valuable?

The most valuable features are powerful tools for ingestion, as data is in multiple systems.

What needs improvement?

It would be helpful to have more information on how to best apply this solution to smaller organizations, with less data, and grow the data lake.

For how long have I used the solution?

I have been using Apache Hadoop for two years.

What is our primary use case?

The primary use is as a data lake. 

How has it helped my organization?

Using this solution has allowed us to consolidate the data. It has made it such that data science-based algorithms can be written for predictive analytics.

What is most valuable?

The most valuable features are powerful tools for ingestion, as data is in multiple systems.

What needs improvement?

It would be helpful to have more information on how to best apply this solution to smaller organizations, with less data, and grow the data lake.

For how long have I used the solution?

I have been using Apache Hadoop for two years.

Disclosure: I am a real user, and this review is based on my own experience and opinions.