The primary use case of this solution is data engineering and data files.
The deployment model we are using is private, on-premises.
Apache Hadoop is the #6 ranked solution in our list of top Data Warehouse tools. It is most often compared to Microsoft Azure Synapse Analytics: Apache Hadoop vs Microsoft Azure Synapse Analytics
Download the Data Warehouse Buyer's Guide including reviews and more. Updated: October 2021
The primary use case of this solution is data engineering and data files.
The deployment model we are using is private, on-premises.
We don't use many of the Hadoop features, like Pig, or Sqoop, but what I like most is using the Ambari feature. You have to use Ambari otherwise it is very difficult to configure.
What comes with the standard setup is what we mostly use, but Ambari is the most important.
Hadoop itself is quite complex, especially if you want it running on a single machine, so to get it set up is a big mission.
It seems that Hadoop is on it's way out and Spark is the way to go. You can run Spark on a single machine and it's easier to setup.
In the next release, I would like to see Hive more responsive for smaller queries and to reduce the latency. I don't think that this is viable, but if it is possible, then latency on smaller guide queries for analysis and analytics.
I would like a smaller version that can be run on a local machine. There are installations that do that but are quite difficult, so I would say a smaller version that is easy to install and explore would be an improvement.
This solution is stable but sometimes starting up can be quite a mission. With a full proper setup, it's fine, but it's a lot of work to look after, and to startup and shutdown.
This solution is scalable, and I can scale it almost indefinitely.
We have approximately two thousand users, half of the users are using it directly and another thousand using the products and systems running on it. Fifty are data engineers, fifteen direct appliances, and the rest are business users.
There are several forums on the web, and Google search works fine. There is a lot of information available and it often works.
They also have good support in regards to the implementation.
I am satisfied with the support. Generally, there is good support.
We used the more traditional database solutions such as SAP IQ and Data Marks, but now it's changing more towards Data Science and Big Data.
We are a smaller infrastructure, so that's how we are set up.
The initial setup is quite complex if you have to set it up yourself. Ambari makes it much easier, but on the cloud or local machines, it's quite a process.
It took at least a day to set it up.
I did not use a vendor. I implemented it myself on the cloud with my local machine.
There was an evaluation, but it was a decision to implement with Data Lake and Hortonworks data platform.
It's good for what is meant to do, a lot of big data, but it's not as good for low latency applications.
If you have to perform quick queries on naive or analytics it can be frustrating.
It can be useful for what it was intended to be used for.
I would rate this solution a seven out of ten.
We primarily use this product to integrate legacy systems.
It helps us work with older products and more easily create solutions.
The most valuable thing about this program for us is that it is very powerful and very cheap. We're using a lot of the program's modules and features because we're using software and hardware that can be difficult to integrate. For example, we're using supersets and a lot of old products from difficult systems. We love having the various options and features that allow us to work with flexibility.
We are using HDTM circuit boards, and I worry about the future of this product and compatibility with future releases. It's a concern because, for now, we do not have a clear path to upgrade. The Hadoop product is in version three and we'd like to upgrade to the third version. But as far as I know, it's not a simple thing.
There are a lot of features in this product that are open-source. If something isn't included with the distribution we are not limited. We can take things from the internet and integrate them. As far as I know, we are using Presto which isn't included in HDP (Hortonworks Data Platform) and it works fine. Not everything has to be included in the release. If something is outside of HDP and it works, that is good enough for me. We have the flexibility to incorporate it ourselves.
The product is well tested and very stable. We have no problems with the stability of it at all. Really we just install it and forget about fussing with it. We just use the features it offers to be productive.
This is a scalable solution and we like what it does. It is currently serving about 100 users at our organization and it seems like it can handle more easily.
We actually have not used technical support. Everything we needed a solution for we just use Google and it's enough for us. Sometimes we do have issues, but not often. The issues are mainly to do with the terminals because it's a bit complicated to integrate these other systems. We have managed to solve all the problems up till now.
We had a very old version of Hadoop which was already installed by another company and we upgraded it. We didn't really switch we just upgraded what was here.
The initial setup wasn't very easy because of the incredible security, but we have managed to get by that. It's sort of simple, in my opinion, once you get past that part. I think, in all, it took about half of a year. But it wasn't a new deployment, it's an upgrade and the bigger challenge was moving the data. We pretty much just supported the existing product and moved to HDP.
We have everything on-premises and we did the deployment and maintenance.
It took four people. We want to increase usage of Hadoop and we are thinking about it very heavily. We're actually in the process of doing it. At the same time, we are integrating things from other systems to Hadoop.
I would give this product a rating of eight out of ten. It would not be a ten out of ten because of some problems we are having with the upgrade to the newer version. It would have been better for us if these problems were not holding us back. I think eight is good enough.
We use this solution for our Enterprise Data Lake.
Using this solution has reduced the overall TCO. It has also improved data processing time for the machine and provides greater insight into our unstructured data.
The most valuable features are the ability to process the machine data at a high speed, and to add structure to our data so that we can generate relevant analytics.
We would like to have more dynamics in merging this machine data with other internal data to make more meaning out of it.
We use it as a data lake for streaming analytical dashboards.
There is a lot of difference. I think the best case is that we are able to drill down to transactional records and really build a root-cause analysis for various issues that might arise, on demand. Because we're able to process in parallel, we don't have to wait for the big data warehouse engine. We process down what the data is and then build it up to an answer, and we can have an answer in an hour rather than 10 hours.
There are jobs that cannot be done unless you have massively parallel processing; for instance, processing call-detail records for telecom.
In general, Hadoop has as lot of different component parts to the platform - things like Hive and HBase - and they're all moving somewhat independently and somewhat in parallel. I think as you look to platforms in the cloud or into walled-garden concepts, like Cloudera or Azure, you see that the third-party can make sure all the components work together before they are used for business purposes. That reduces a layer of administration configuration and technical support.
I would like to see more direct integration of visualization applications.
In general, stability can be a challenge. It's hard to say what stability means. You're in an environment that's before production-line manufacturing, where none of the parts relate together exactly as they should. So that can create some instability.
To realize the benefit of these kinds of open-source, big-data environments, you want to use as many different tools as you can get. That brings with it all this overhead of making them work together. It's kind of a blessing and a curse, at the same time: There's a tool for everything.
Apache is the open-source foundation that Cloudera and Hortonworks contribute code and some work to. I don't know that there is actually support and structure, per se, for Apache.
We have had premium, at various times with various companies. From the three dominant companies I've worked with - Cloudera, Hortonworks, and MapR - there is a premium support package but that still only covers their base. Distribution is not necessarily all the add-ons that are on top of it, which is really a big challenge: to get everything to work together.
There are the older relational database technologies: Netezza, SQL Server, MySQL, Oracle, Teradata. All have some advantages and some disadvantages. Most notably, they are all significantly more expensive in terms of the capital expense, rather than the operational expense. They are "walled-garden," so to speak, that are curated and have a distinct set of tools that work with them, and not the bleeding-edge ingenuity that comes with an open-source platform.
Data warehousing is 30 years old, at least. Big data is, in its current form, has only been around for four or five years old.
There are capacities in which I have been responsible for setup, administration, and building the applications on those environments. Each of the components is relatively straightforward. The complexity comes from all the different components.
Implement for defined use cases. Don't expect it to all just work very easily.
I would rate this platform a seven out of 10. On the one hand, it's the only place you can use certain functions, and on the other hand, it's not going to put any of the other ones out of business. It's really more of a complement. There is no fundamental battle between relational databases and Hadoop.
Big Data analytics, customer incubation.
We host our Big Data analytics "lab" on Amazon EC2. Customers are new to Big Data analytics so we do proofs of concept for them in this lab. Customers bring historical, structured data, or IoT data, or a blend of both. We ingest data from these sources into the Hadoop environment, build the analytics solution on top, and prove the value and define the roadmap for customers.
Initially, with RDBMS alone, we had a lot of work and few servers running on-premise and on cloud for the PoC and incubation. With the use of Hadoop and ecosystem components and tools, and managing it in Amazon EC2, we have created a Big Data "lab" which helps us to centralize all our work and solutions into a single repository. This has cut down the time in terms of maintenance, development and, especially, data processing challenges.
We were using MySQL and PostgreSQL for these engagements, and scaling and processing were not as easy when compared to Hadoop. Also, customers who are embarking on a big journey with semi-structured information prefer to use Hadoop rather than a RDBMS stack. This gives them clarity on the requirements.
In addition, since both Apache Hadoop and Amazon EC2 are elastic in nature, we can scale and expand on demand for a specific PoC, and scale down when it's done.
Flexibility, ease of data processing, reduced cost and efforts are the three key improvements for us.
HDFS and Kafka: Ingestion of huge volumes and variety of unstructured/semi-structured data is feasible, and it helps us to quickly onboard a new Big Data analytics prospect.
Based on our needs, we would like to see a tool for data visualization and enhanced Ambari for management, plus a pre-built IoT hub/model. These would reduce our efforts and the time needed to prove to a customer that this will help them.
We have a three-node cluster running on cloud by default, and it has been stable so far without any stoppages due to Hadoop or other ecosystem components.
Since this is primarily for customer incubation, there is a need to process huge volumes of data, based on the proof of value engagement. During these processes, we scale the number of instances on demand (using Amazon spot instances), use them for a defined period, and scale down when the PoC is done. This gives us good flexibility and we pay only for usage.
Since this is mostly community driven, we get a lot of input from the forums and our in-house staff who are skilled in doing the job. So far, most of the issues we have had during setup or scaling have primarily been on the infrastructure side and not on the stack. For most of the problems we get answers from the community forums.
We didn't have any major issues except for knowledge, so we hired the right person who had hands-on experience with this stack, and worked with the cloud provider to get the right mechanism for handling the stack.
General installation/dependency issues were there, but were not a major, complex issue. While migrating data from MySQL to Hive, things are a little challenging, but we were able to get through that with support from forums and a little trial and error. In addition, the old PoCs which were migrated had issues in directly connecting to Hive. We had to build some user functions to handle that.
We normally do not suggest any specific distributions. When it comes to cloud, our suggestion would be to choose different types of instances offered by Amazon cloud, as we are technology partners of Amazon for cost savings. For all our PoCs, we stick to the default distribution.
None, as this stack is familiar to us and we were sure it could be used for such engagements without much hassle. Our primary criteria were the ability to migrate our existing RDBMS-based PoC and connectivity via our ETL and visualization tool. On top of that, support for semi-structured data for ELT. All three of these criteria were a fit with this stack.
Our general suggestion to any customer is not to blindly look and compare different options. Rather, list the exact business needs - current and future - and then prepare a matrix to see product capabilities and evaluate costs and other compliance factors for that specific enterprise.
Data aggregation for KPIs. The sources of data come in all forms so the data is unstructured. We needed high storage and aggregation of data, in the background.
We start with data mashing on Hive and finally use this for KPI visualization. This intermediate step not only mashes data in the form that we want through data Cube slicing, but also helps us save states as snapshots for multiple time frames.
Without this, we would have had to plan another data source for only this purpose. Moving this step closer to processing worked better than keeping it at visualization. Although we can't completely avoid using data stores/snapshots at visualization, this step proved to be promising for getting data ready for better analytics and insights.
High throughput and low latency. We start with data mashing on Hive and finally use this for KPI visualization.
At the beginning, MRs on Hive made me think we should get down to Hadoop MRs to have better control of the data. But later, Hive as a platform upgraded very well. I still think a Spark-type layer on top gives you an edge over having only Hive.
I rate it an eight out of 10. It's huge, complex, slow. But does what it is meant for.
It needs better user interface (UI) functionalities.
There are no licensing costs involved, hence money is saved on the software infrastructure.
Impala. As compared to Hive on MapReduce, Impala on MPP returns results of SQL queries in a fairly short amount of time, and is relatively fast when reading data into other platforms like R (for further data analysis) or QlikView (for data visualisation).
The quick access to data enabled more frequent data backed decisions.
The key shortcoming is its inability to handle queries when there is insufficient memory. This limitation can be bypassed by processing the data in chunks.
Typically instability is experienced due to insufficient memory, either due to a large job being triggered or multiple concurrent small requests.
No. This is by default a cluster-based setup and hence scaling is just a matter of adding on new data nodes.
Not applicable to Cloudera. We have a separate onsite vendor to manage the cluster.
No. Two years ago this was a new team and hence there were no legacy systems to speak of.
Complex. Cloudera stack itself was insufficient. Integration with other tools like R and QlikView was required and in-house programs had to be built to create an automated data pipeline.
Not much advice as pricing and licensing is handled at an enterprise level.
However do take into consider that data storage and compute capacity scale differently and hence purchasing a "boxed" / 'all-in-one" solution (software and hardware) might not be the best idea.
Yes. Oracle Exadata and Teradata.
Try open-source Hadoop first but be aware of greater implementation complexity. If open-source Hadoop is "too" complex, then consider a vendor packaged Hadoop solution like HortonWorks, Cloudera, etc.
HDFS allows you to store large data sets optimally.
After switching to big data pipelines, our query performance improved a hundred times.
Rolling restarts of data nodes need to be done in a way that can be further optimized. Also, I/O operations can be optimized for more performance.
I have used Hadoop for over three years.
Once we had an issue with stability, due to a complete shutdown of a cluster. Bringing up a cluster took a lot of time because of some order that needed to be followed.
We have not had scalability issues.
The community is very supportive and provided prompt replies and suggestions to JIRA tickets.
We didn’t have a previous solution. It was a move from RDBMS to big data.
Initial setup of a few nodes was simple, but as we increased the node count it became complex, as we need to maintain rack topology, etc.
It’s free and it is open source.
I would suggest using this product. We were able to use this for petabytes of data.
The Distributed File System, which is the base of Hadoop, has been the most valuable feature with its ability to store video, pictures, JSON, XML, and plain text all in the same file system.
We do use the Hadoop platform internally, but mostly it is for R&D purposes. However, many of the recent projects that our IT consulting firm has taken on have deployed Hadoop as a solution to store high-velocity and highly variable data sizes and structures, and be able to process that data together quickly and efficiently.
Hadoop in and of itself stores data with 3x redundancy and our organization has come to the conclusion that the default 3x results in too much wasted disk space. The user has the ability to change the data replication standard, but I believe that the Hadoop platform could eventually become more efficient in their redundant data replication. It is an organizational preference and nothing that would impede our organization from using it again, but just a small thing I think could be improved.
This version was released in January 2016, but I have been working with the Apache Hadoop platform for a few years now.
The only issues we found during deployment were errors originating from between the keyboard and the chair. I have set up roughly 20 Hadoop Clusters and mostly all of them went off without a hitch, unless I configured something incorrectly on the pre-setup.
We have not encountered any stability problems with this platform.
We have scaled two of the clusters that we have implemented; one in the cloud, one on-premise. Neither ran into any problems, but I can say with certainty that it is much, much easier to scale in a cloud environment than it is on-premise.
Apache Hadoop is open-source and thus customer service is not really a strong point, but the documentation provided is extremely helpful. More so than some of the Hadoop vendors such as MapR, Cloudera, or Hortonworks.Technical Support:
Again, it's open source. There are no dedicated tech support teams that we've come across unless you look to vendors such as Hortonworks, Cloudera, or MapR.
We started off using Apache Hadoop for our initial Big Data initiative and have stuck with it since.
Initial setup was decently straightforward, especially when using Apache Ambari as a provisioning tool. (I highly recommend Ambari.)
We are the implementers.
It's open source.
We solely looked at Hadoop.
Try, try, and try again. Experiment with MapReduce and YARN. Fine tune your processes and you will see some insane processing power
I would also recommend that you have at least a 12-node cluster: two master nodes, eight compute/data nodes, one hive node (SQL), 1 Ambari dedicated node.
For the master nodes, I would recommend 4-8 Core, 32-64 GB RAM, 8-10 TB HDD; the data nodes, 4-8 Core, 64 GB RAM, 16-20 TB RAID 10 HDD; hive node should be around 4 Core, 32-64 GB RAM, 5-6 TB RAID 0 HDD; and the Ambari dedicated server should be 2-4 Core, 8-12 GB RAM, 1-2 TB HDD storage.
With the increase in data size for the business, this horizontal scalable appliance has answered every business question in terms of storage and processing. Hadoop ecosystem has not only provided a reliable distributed aggregation system but has also allowed room for analytics which has resulted in great data insights.
The Apache team is doing great job and releasing Hadoop versions much ahead of what we can think about. Every room for improvement is fixed as soon as a version is released by ASF. Currently, Apache Oozie 4.0.1 has some compatibility issues with Hadoop 2.5.2.
Not at all.
We did when we started initially with Hadoop 1.x, which did’t have HA, but now we don’t have any stability issue.
Hadoop is known for its scalability. Yahoo stores approx. 455 PB in their Hadoop cluster.
It depends on the Hadoop distributor. I would rate Hortonworks 9/10.Technical Support:
I would rate Hortonworks 9/10.
We previously used Netezza. We switched because our business required a highly scalable appliance like Hadoop.
It's a bit complex in terms of build around for commodities, but soon it will ease up as the product matures.
We used a vendor team who were 9/10.
Valuable storage and processing with a lower cost than previously.
Best in pricing and licensing depends on the flavors, but remember it is only good if you have very large data set which cannot be handled by traditional RDBMS.
First, understand your business requirement; second, evaluate the traditional RDBMS scalability and capability, and finally, if you have reached to the tip of an iceberg (RDBMS) then yes, you definitely need an island (Hadoop) for your business. Feasibility checks are important and efficient for any business before you can take any crucial step. I would also say “Don’t always flow with stream of a river because some time it will lead you to a waterfall, so always research and analyze before you take a ride.”