Pentaho Data Integration and Analytics Reviews, Competitors and Pricing

Anton Abrarov

Project Leader at a mining and metals company with 10,001+ employees

Jun 8, 2022

Download

Fastens the data flow processes and has a user-friendly interface

Pros and Cons

"It has a really friendly user interface, which is its main feature. The process of automating or combining SQL code with some databases and doing the automation is great and really convenient."

"As far as I remember, not all connectors worked very well. They can add more connectors and more drivers to the process to integrate with more flows."

What is our primary use case?

The company where I was working previously was using this product. We were using it for ETL process management. It was like a data flow automatization.

In terms of deployment, we were using an on-premise model because we had sensitive data, and there were some restrictions related to information security.

How has it helped my organization?

Our data flow processes became faster with this solution.

What is most valuable?

It has a really friendly user interface, which is its main feature. The process of automating or combining SQL code with some databases and doing the automation is great and really convenient.

What needs improvement?

As far as I remember, not all connectors worked very well. They can add more connectors and more drivers to the process to integrate with more flows.

The last time I saw this product, the onboarding instructions were not clear. If the process of onboarding this product is made more clear, it will take the product to the next level. There is a possibility that the onboarding process has already improved, and I haven't seen it.

Buyer's Guide

Pentaho Data Integration and Analytics

April 2024

Free Report: Pentaho Data Integration and Analytics Reviews and More

Learn what your peers think about Pentaho Data Integration and Analytics. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.

DOWNLOAD NOW

768,886 professionals have used our research since 2012.

For how long have I used the solution?

I have used this solution for two or three years.

What do I think about the stability of the solution?

I would rate it an eight out of ten in terms of stability.

What do I think about the scalability of the solution?

We didn't have to scale too much. So, I can't evaluate it properly in terms of scalability.

In terms of its users, only our team was using it. There were approximately 20 users. It was not for the whole company.

How are customer service and support?

We didn't use too much customer support. We were using the open-source resources through Google Search. So, we were just using text search. There were some helpful forums where we were able to find the answers to our questions.

Which solution did I use previously and why did I switch?

I didn't use any other solution previously. This was the only one.

How was the initial setup?

I wasn't a part of its deployment. In terms of maintenance, as far as I know, it didn't require much maintenance.

What was our ROI?

We absolutely saw an ROI. It was hard to calculate, but we felt it in terms of
the speed of our processes. After using this product, we could do some of the things much faster than before.

What's my experience with pricing, setup cost, and licensing?

I mostly used the open-source version. I didn't work with a license.

Which other solutions did I evaluate?

I did not evaluate other options.

What other advice do I have?

I would recommend using this product for data engineering and Extract, Transform, and Load (ETL) processes.

I would rate it an eight out of ten.

Which deployment model are you using for this solution?

On-premises

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Rodrigo Vazquez

CDE & BI Delivery Manager at a tech services company with 501-1,000 employees

May 10, 2022

Download

Connects to different databases, origins of data, files, and SFTP

Pros and Cons

"I can create faster instructions than writing with SQL or code. Also, I am able to do some background control of the data process with this tool. Therefore, I use it as an ELT tool. I have a station area where I can work with all the information that I have in my production databases, then I can work with the data that I created."

"I work with different databases. I would like to work with more connectors to new databases, e.g., DynamoDB and MariaDB, and new cloud solutions, e.g., AWS, Azure, and GCP. If they had these connectors, that would be great. They could improve by building new connectors. If you have native connections to different databases, then you can make instructions more efficient and in a more natural way. You don't have to write any scripts to use that connector."

What is our primary use case?

I just use it as an ETL. It is a tool that helps me work with data so I can solve any of my production problems. I work with a lot of databases. Therefore, I use this tool to keep information organized.

I work with a virtual private cloud (VPC) and VPN. If I work in the cloud, I use VPC. If I work on-premises, I work with VPNs.

How has it helped my organization?

I can create faster instructions than writing with SQL or code. Also, I am able to do some background control of the data process with this tool. Therefore, I use it as an ELT tool. I have a station area where I can work with all the information that I have in my production databases, then I can work with the data that I created.

Right now, I am working in the business intelligence area. However, we use BI in all our companies. So, it is not only in one area. So, I create different data parts for different business units, e.g., HR, IT, sales, and marketing.

What is most valuable?

A valuable feature is the number of connectors that I have. So, I can connect to different databases, origins of data, files, and SFTP. With SQL and NoSQL databases, I can connect, put it in my instructions, send it to my staging area, and create the format. Thus, I can format all my data in just one process.

What needs improvement?

I work with different databases. I would like to work with more connectors to new databases, e.g., DynamoDB and MariaDB, and new cloud solutions, e.g., AWS, Azure, and GCP. If they had these connectors, that would be great. They could improve by building new connectors. If you have native connections to different databases, then you can make instructions more efficient and in a more natural way. You don't have to write any scripts to use that connector.

Hitachi can make a lot of improvements in the tool, e.g., in performance or latency or putting more emphasis on cloud solutions or NoSQL databases.

For how long have I used the solution?

I have more than 15 years of experience working with it.

What do I think about the stability of the solution?

The stability depends on the version. At the beginning, it was more focused on stability. As of now, some things have been deprecated. I really don't know why. However, I have been pretty happy with the tool. It is a very good tool. Obviously, there are better tools, but Pentaho is fast and pretty easy to use.

What do I think about the scalability of the solution?

It is scalable.

How are customer service and support?

Their support team will receive a ticket on any failures that you might have. We have a log file that lets us review our errors, both in Windows and Unix. So, we are able to check both operating systems.

If you don't pay any license, you are not allowed to use their support at all. While I have used it a couple of times, that was more than 10 years ago. Now, I just go to their community and any Pentaho forums. I don't use the support.

Which solution did I use previously and why did I switch?

I have used a lot of ETL data integrators, such as DataStage, Informatica, Talend, Matillion, Python, and even SQL. MicroStrategy, Qlik, and Tableau have instructional features, and I try to use a lot of tools to do instructions.

How was the initial setup?

I have built the solution. It does not change for cloud or on-premise developments.

You create in your development environments, then you move to test. After that, you do the volume and integrity testing, then you go to UAT. Finally, you move to production. It does depend on the customer. You can thoroughly create the entire product structure as well as all the files that you need. Once you put it in production, it should work. You should have the same structure in development, test, and production.

What was our ROI?

It is free. I don't spend money on it.

It will reduce a lot of the time that you work with data.

What's my experience with pricing, setup cost, and licensing?

I use it because it is free. I download from their page for free. I don't have to pay for a license. With other tools, I have to pay for the licenses. That is why I use Pentaho.

I used to work with the complete suite of Pentaho, not only Data Integration. I used to build some solutions from scratch. I used to work with the Community version and Enterprise versions. With the Enterprise version, it is more than building cubes. I am building a BI solution that I can explore. Every time that I use Pentaho Data Integration, I never spend any money because it comes free with the tool. If you pay for the Enterprise license, Pentaho Data Integration is included. If you don't pay for it and use the Community version, Data Integration is included for free.

Which other solutions did I evaluate?

I used to work with a reseller of Pentaho. That is why I started working with it. Also, I did some training for Pentaho at the company that I used to work for in Argentina, where we were a Platinum reseller.

Pentaho is easy to use. You don't need to install anything. You can just open the script and start working on it. That is why I chose it. With Informatica, you need to do a server installation, but some companies might not allow some installation in their production or normal environment.

I feel pretty comfortable using the solution. I have tried to use other tools, but I always come back to Pentaho because it is easier.

Pentaho is open source. While Informatica is a very good tool, it is pretty expensive. That is one of the biggest cons for the data team because you don't want to pay money for tools that just only help you to work.

What other advice do I have?

I would rate this solution as eight out of 10. One of the best things about the solution is that it is free.

I used to sell Pentaho. It has a lot of pros and cons. From my side, there are more pros than cons. There isn't one tool that can do everything that you need, but this tool is one of those tools that helps you to complete your tasks and it is pretty integrable with other tools. So, you can switch Pentaho on and off from different tools and operating systems. You can use it in Unix, Linux, Windows, and Mac.

If you know how to develop different things and are very good at Java, you can create your own connectors. You can create a lot of things.

It is a very good tool if you need to work with data. There isn't a database that you can't manage with this tool. You can work with it and manage all the data that you want to manage.

Which deployment model are you using for this solution?

Hybrid Cloud

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Buyer's Guide

Pentaho Data Integration and Analytics

April 2024

Free Report: Pentaho Data Integration and Analytics Reviews and More

Learn what your peers think about Pentaho Data Integration and Analytics. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.

DOWNLOAD NOW

768,886 professionals have used our research since 2012.

ABDULGAFFAR

Assistant General Manager at DTDC Express Limited

Jan 8, 2021

Download

Scales well with data and processes, but the cost should be lower and real-time processing capabilities improved

Pros and Cons

"The amount of data that it loads and processes is good."

"I would like to see improvements made for real-time data processing."

What is our primary use case?

We are using just the simple features of this product.

We're using it as a data warehouse and then for building dimensions.

What needs improvement?

The shortcoming in version 7 is that we are unable to connect to Google Cloud Storage (GCS), where I can write the results from Pentaho. I'm able to connect to S3 using Pentaho 8, but when using it for GCS, I'm unable to connect. With people moving from on-premises deployments to the cloud, be it S3, Azure, or Google, we need a plugin where we can interact with these cloud vendors.

I would like to see improvements made for real-time data processing. It is something that I will be looking out for.

For how long have I used the solution?

We have been using Pentaho Data Integration for three years.

What do I think about the stability of the solution?

For all of the features that we have been using, it is a stable product.

What do I think about the scalability of the solution?

In terms of data loading and processes, the scalability is good.

We have a team of four people who are using it for analytics.

How are customer service and technical support?

As we are using the Community Version, we have not been in contact with technical support. Instead, we rely on forums and websites when we need to resolve a problem.

Which solution did I use previously and why did I switch?

In the past, I have worked with Talend, as well as SAP BO Data Services (BODS). However, that was with another company. This organization started with Pentaho and we are still using it.

How was the initial setup?

It is a straightforward setup process. It took between three and four hours to complete.

What's my experience with pricing, setup cost, and licensing?

We are using the Community Version, which is available free of charge.

The price of the regular version is not reasonable and it should be lower.

What other advice do I have?

My advice for anybody who is researching this product is that if they want to do batch processing, then this is a good choice. The amount of data that it loads and processes is good.

Based on the features that I have used and my experience, I would rate this solution a seven out of ten.

Which deployment model are you using for this solution?

On-premises

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Nirmal KosuruWorks at Cognizio

Report as inappropriate

May 11, 2021

Yes the integration tool should be made available as Professional or Community / Standard / Enterprise Editions and Pricing should be made accordingly on the industry by industry basis or cases by case. And also there should be Transparency in the pricing and availability of community edition as the case was earlier when Pentaho management realeased it into market.

Ridwan Saeful Rohman

Data Engineering Associate Manager at Zalora Group

Nov 22, 2022

Download

Good abstraction and useful drag-and-drop functionality but can't handle very large data amounts

Pros and Cons

"The abstraction is quite good."

"If you develop it on MacBook, it'll be quite a hassle."

What is our primary use case?

I still use this tool on a daily basis. Comparing it to my experience with other ETL tools, the system that I created using this tool was quite simple. It is just as simple as extracting the data from MySQL, exporting it on the CSV, putting it on the S3, then pushing it into Redshift.

The PDI Kettle Job and Kettle Transformation are bundled by the shell script, then scheduled, and orchestrated by Jenkins.

We still use this tool due to the fact that there are a lot of old systems that still use it. The new solution that we use is mostly on Airflow. We are still in the transition phase. To be clear, Airflow is a data orchestration tool that mainly uses Python. Everything from the ETL, all the way to the scheduling and the monitoring of any issues. It's in one system and entirely on Airflow.

How has it helped my organization?

In my current company, it does not have any major impact. We use it for old and simple ETLs only.

In terms of setting the ETL tools that we put on the panel, it's quite useful. However, this kind of functionality that we currently put on the solution can be easily switched by other tools that exist on the market. It's time to change it entirely to Airflow. We'll likely change in the next six months.

What is most valuable?

This solution offers tools that are drag and drop. The script is quite minimal. Even if you do not come from IT or your background is not in software engineering, it is possible. It is quite intuitive. You can drag and drop many functions.

The abstraction is quite good.

Also, if you're familiar with the product itself, they have transformational abstractions and job abstractions. Here, we can create a smaller transformation in the Kettle transformation, and then the bigger ones on the Kettle job. For someone who has familiarity with Python or someone who has no scripting background at all, the product is useful.

For larger data, we are using Spark.

The solution enables us to create pipelines with minimal manual, or custom coding efforts. Even if you have no advanced experience in scripting, it is possible to create ETL tools. I have a recent graduate coming from a management major who has no experience with SQL. I trained him for three months, and within that time he became quite fluent, with no prior experience using ETL tools.

Whether or not it's important to handle the creation of pipelines with minimal coding depends on the team. If I change the solution to Airflow, then I will need more time to teach them to become fluent in the ETL tool. By using these kinds of abstractions in the product, I can compress the training time to just three months. With Airflow, it will take longer than six months to get new users to the same point.

We use the solution's ability to develop and deploy data pipeline templates and reuse them.

The old system was created by someone prior to me in my organization and we still use it. It was developed by him a long time ago. We also use the solution for some ad hoc reporting.

The ability to develop and deploy data pipeline templates once and reuse them is really important to us. There are some requests to create the pipelines. I create them and then deploy them on our server. It then has to be as robust as when we do the scheduling so that it does not fail.

We like the automation. I cannot imagine how the data teams will work if everything was done on an ad hoc basis. Everything should be automated. Using my organization as an example, I can with confidence say that 95% of our data distributions are automated and only 5% ad hoc. With this solution, we query the data manually. We process the data on the spreadsheets manually and then distribute it to the organization. It’s important to be robust and be able to automate.

So far, we can deploy the solution easily on the cloud, which is on AWS. I haven't really tried it on another server. We deploy it on our AWS EC2, however, we develop it on our local computer, which consists of people who use Windows. There are some people who also use MacBooks.

I personally have used it on both. I have to develop both on Windows and MacBook. I can say that Windows is easier to navigate. On the MacBook, the display becomes quite messed up if you are enabling the dark mode.

The solution did reduce our ETL development time if you compare it to the scripting. However, this will really depend on your experience.

What needs improvement?

Five years ago, when I had less experience with scripting, I would have definitely used this product over Airflow, as it will be easier for me with the abstraction being quite intuitive. Five years ago, I would choose the product over the other tools using pure scripting as it would reduce most of my time in terms of developing ETL tools. This isn't the case anymore as I have more familiarity with scripting.

When I first joined my organization, I was still using Windows. It is quite straightforward to develop the ETL system on Windows. However, when I changed my laptop to MacBook, it was quite a hassle. When we tried to open the application, we had to open the terminal first, go to the solution's directory, and then run the executable file. The display also becomes quite messed up when we enable dark mode on MacBook.

Therefore, if you develop it on MacBook, it'll be quite a hassle, however, when you develop it on Windows, it's not really different from other ETL tools on the market, like SQL Server Integration Services, Informatica, et cetera.

For how long have I used the solution?

I have been using this tool since I moved to my current company, which is about one year ago.

What do I think about the stability of the solution?

The performance is good. I have not done a test on the bleeding edge of the product. We only do simple jobs. In terms of data, we extract it and then exported it from MySQL to the CSV. There were only millions of data points, not billions of data points. So far, it has met our expectations. It's quite good for a smaller number of data points.

What do I think about the scalability of the solution?

I'm not sure that the product could keep up with the data growth. It can be useful for millions of data points. However, I haven't explored the option of billions of data points. I think there are better solutions that are on the market. It's also applied to the other drag-and-drop ETL tools too like SQL Server Integration Service, Informatica, etc.

How are customer service and support?

We don't really use technical support. The current version that we are using is no longer supported by their representatives. We didn't update it yet to the newer version.

How would you rate customer service and support?

Neutral

Which solution did I use previously and why did I switch?

We're moving to Airflow. The reason for the switch was mostly due to a problem when we are debugging. If you're familiar with the SQLs for integration services, the ETL tools from Microsoft and the debugging function are quite intuitive. You can exactly spot which transformation has failed or which transformation has an error. However, in the solution, from what my colleagues told me, it is hard to do that. When there is an error, we cannot directly spot where the error is coming from.

Airflow is quite customized and it's not as rigid as this product. We can deploy the simple ETL tools all the way to the machine learning systems on Airflow. Airflow mainly uses Python, which our team is quite familiar with. This solution is still handled by only two people out of 27 people on our team. Not enough people know it.

How was the initial setup?

There are no separations between the deployment and other teams. Each of our teams acts like an individual contributor. We handle the implementation process all the way from face-to-face business meetings, setting timelines, developing the tools, and defining the requirements, to the production deployment.

The initial setup is straightforward. Currently, the use of versioning control in our organization is quite loose. We are not using any versioning control software. The way we deploy it is just as simple as putting the Kettle transformation file into our EC2 server and rewriting the old file, that's it.

What's my experience with pricing, setup cost, and licensing?

I'm not really sure what the price for the product is. I don't handle the purchasing or the commissioning.

What other advice do I have?

We put it on our AWS EC2 server, however, when we developed it, it was put on our local server. We deploy it onto our EC2 server. We bundle it on our shell scripts and the shell scripts are run by Jenkins.

I'd rate the solution a seven out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.

reviewer1872000

Senior Data Analyst at a tech services company with 51-200 employees

Jun 8, 2022

Download

We're able to query large data sets without affecting performance

Pros and Cons

"One of the most valuable features is the ability to create many API integrations. I'm always working with advertising agents and using Facebook and Instagram to do campaigns. We use Pentaho to get the results from these campaigns and to create dashboards to analyze the results."

"Parallel execution could be better in Pentaho. It's very simple but I don't think it works well."

What is our primary use case?

I use it for ETL. We receive data from our clients and we join the most important information and do many segmentations to help with communication between our product and our clients.

How has it helped my organization?

Before we used Pentaho, our processes were in Microsoft Excel and the updates from databases had to be done manually. Now all our routines are done automatically and we have more time to do other jobs. It saves us four or five hours daily.

In terms of ETL development time, it depends on the complexity of the job, but if the job is simple it saves two or three hours.

What is most valuable?

One of the most valuable features is the ability to create many API integrations. I'm always working with advertising agents and using Facebook and Instagram to do campaigns. We use Pentaho to get the results from these campaigns and to create dashboards to analyze the results.

I'm working with large data sets. One of the clients I'm working with is a large credit card company and the database from this client is very large. Pentaho allows me to query large data sets without affecting its performance.

I use Pentaho with Jenkins to schedule the jobs. I'm using the jobs and transformations in Pentaho to create many links.

I always find ways to have minimal code and create the processes with many parameters. I am able to reuse processes that I have created before.

Creating jobs and putting them into production, as well as the visibility that Pentaho gives, are both very simple.

What needs improvement?

Parallel execution could be better in Pentaho. It's very simple but I don't think it works well.

For how long have I used the solution?

I've been working with Pentaho for four or five years.

What do I think about the stability of the solution?

The stability is good.

What do I think about the scalability of the solution?

It's scalable.

How are customer service and support?

I find help on the forums.

Which solution did I use previously and why did I switch?

I used SQL Server Integration Services, but I have much more experience with Pentaho. I have also worked with Apache NiFi but it is more focused on single data processes but I'm always working with batch processes and large data sets.

How was the initial setup?

The first deployment was very complex because we didn't have experience with the solution, but the next deployment was simpler.

We create jobs weekly in Pentaho. The development time takes, on average, one week and the deployment takes just one day or so.

We just put it on Git and pull a server and schedule the execution.

We use it on-premises while the infrastructure is Amazon and Azure.

What other advice do I have?

I always recommend Pentaho for working with automated processes and to do API integrations.

Which deployment model are you using for this solution?

On-premises

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Aqeel UR Rehman

BI Analyst at Vroozi

May 19, 2022

Download

Simple to use, supports custom transformations, and the open-source version can be used free of charge

Pros and Cons

"This solution allows us to create pipelines using a minimal amount of custom coding."

"I have been facing some difficulties when working with large datasets. It seems that when there is a large amount of data, I experience memory errors."

What is our primary use case?

I have used this ETL tool for working with data in projects across several different domains. My use cases include tasks such as transforming data that has been taken from an API like PayPal, extracting data from different sources such as Magenta or other databases, and transforming all of the information.

Once the transformation is complete, we load the data into data warehouses such as Amazon Redshift.

How has it helped my organization?

There are a lot of different benefits we receive from using this solution. For example, we can easily accept data from an API and create JSON files. The integration is also very good.

I have created many data pipelines and after they are created, they can be reused on different levels.

What is most valuable?

The best feature is that it's simple to use. There are simple data transformation steps available, such as trimming data or performing different types of replacement.

This solution allows us to create pipelines using a minimal amount of custom coding. Anyone in the company can do so, and it's just a simple step. If any coding is required then we can use JavaScript.

What needs improvement?

I have been facing some difficulties when working with large datasets. It seems that when there is a large amount of data, I experience memory errors. If there is a large amount of data then there is definitely a lag.

I would like to see a cloud-based deployment because it will allow us to easily handle a large amount of data.

For how long have I used the solution?

I have been working with Hitachi Lumada Data Integration for almost three years, across two different organizations.

What do I think about the stability of the solution?

There is definitely some lag but with a little improvement, it will be a good fit.

What do I think about the scalability of the solution?

This is a good product for an enterprise-level company.

We use this solution for all of our data integration jobs. It handles the transformation. As our business grows and the demand for data integration increases, our usage of this tool will also increase.

Between versions, they have added a lot of plugins.

How are customer service and support?

The technical support does not reply in a timely manner. I have filled out the support request form, one or two times, asking about different things, but I have not received a reply.

The support they have in place does not work very well. I would rate them one or two out of ten.

How would you rate customer service and support?

Negative

Which solution did I use previously and why did I switch?

In this business, they initially began with this product and did not use another one beforehand. I have also worked on the cloud-level integration tool.

How was the initial setup?

The initial setup and deployment are straightforward.

I have deployed it on different servers and on average, it takes an hour to complete. I have not read any documentation regarding installation. With my experience, we were able to set everything up.

What's my experience with pricing, setup cost, and licensing?

I primarily work on the Community Version, which is available to use free of charge. I have asked for pricing information but have not yet received a response.

What other advice do I have?

We are currently using version 8.3 but version 9 is available. More features to support big data are available in the newest release.

My advice for anybody who is considering this product is if they're looking for any kind of custom transformation, or they're gleaning data from multiple sources and sending it to multiple destinations, I definitely recommend this tool.

Overall, this is a good product and I recommend it.

I would rate this solution an eight out of ten.

Which deployment model are you using for this solution?

On-premises

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Renan Guedert

Business Intelligence Specialist at a recruiting/HR firm with 11-50 employees

Apr 20, 2022

Download

Creates a good, visual pipeline that is easy to understand, but doesn't handle big data well

Pros and Cons

"Sometimes, it took a whole team about two weeks to get all the data to prepare and present it. After the optimization of the data, it took about one to two hours to do the whole process. Therefore, it has helped a lot when you talk about money, because it doesn't take a whole team to do it, just one person to do one project at a time and run it when you want to run it. So, it has helped a lot on that side."

"A big problem after deploying something that we do in Lumada is with Git. You get a binary file to do a code review. So, if you need to do a review, you have to take pictures of the screen to show each step. That is the biggest bug if you are using Git."

What is our primary use case?

It was our principle to make the whole ETL and data warehousing on our projects. We created a whole step for collecting all the raw data from APIs and other databases from flat files, like Excel files, CSV files, and JSON files, to do the whole transformation and data preparation, then model the data and put it in SQL Server and integration services.

For business intelligence projects, it is sometimes pretty good, when you are extracting something from the API, to have a step to transform the JSON file from the API to an SQL table.

We use it heavily as a virtual machine running on Windows. We have also installed the open-source version on the desktop.

How has it helped my organization?

Lumada provides us with a single, end-to-end data management experience from ingestion to insights. This single data management experience is pretty good because then you don't have every analyst doing their own stuff. When you have one unique tool to do that, you can keep improving as well as have good practices and a solid process to do the projects.

What is most valuable?

It has many resourceful things. It has a variety of the things that you can do. It is also pretty open, since you can put in a Python script or JavaScript for everything. If you don't have the native tool on the application, you can build your own using scripts. You can build your other steps and jobs on the application. The liberty of the application has been pretty good.

Lumada enables us to create pipelines with minimal manual coding efforts, which is the most important thing. When creating a pipeline, you can see which steps are failing in the process. You can keep up the process and debug, if you have problems. So, it creates a good, visual pipeline that makes it easy to understand what you are doing during the entire process.

What needs improvement?

There is no straight-line explanation about bugs and errors that happen on the software. I must search heavily on the Internet, some YouTube videos, and other forums to know what is happening. The proper site of Hitachi and Lumada doesn't have the best explanation about bugs, errors, and the functions. I must search for other sources to understand what is happening. Usually, it is some guy in India or Russia who knows the answer.

A big problem after deploying something that we do in Lumada is with Git. You get a binary file to do a code review. So, if you need to do a review, you have to take pictures of the screen to show each step. That is the biggest bug if you are using Git.

After you create a data pipeline, if you could make a JSON file or something with another language, we could simplify the steps for creating what we are doing. Or, a simple flat file of text could be even better than that, but generated by their own platform so people can look and see what is happening. You shouldn't need to download the whole project in your own Pentaho, I would like to just look at the code and see if there is something wrong.

When I use it for open-source applications, it doesn't handle big data too well. Therefore, we have to use other kinds of technologies to manage that.

I would like it more accessible for Macs. Previously, I always used Linux, but some companies that I worked for before used MacBooks. It would be good if I could use Pentaho in that too since I need to use other tools or create a virtual machine to use Pentaho. So, it would be pretty good if the solution had a friendly version for Macs or Linux-based programs, like Ubuntu.

For how long have I used the solution?

I have been using it for six years, but more heavily over the last two years.

How are customer service and support?

I don't bring issues to Hitachi since Lumada is open source in some kind of way.

Once, when I had a problem with connections because of the software, I saw the issue in the forums on the Internet because there was some type of bug happening.

Which solution did I use previously and why did I switch?

At my first company, we used just Lumada. At my second company, we used a lot of QlikView, SQL, Python, and Lumada. At my third company, we used Python and SQL much more. I used Lumada just once at that company. At my new company, I don't use it at all. I just use Azure Data Factory and SQL.

With Pentaho, we finally have data pipelines. We didn't have solid data pipelines before. After the data pipelines became very solid, the team who created them became very popular in the company.

How was the initial setup?

To set up the things, we used a virtual machine. It was a version where we can download it and unlock a machine too. You can do Ctrl-C and Ctrl-V with Pentaho because all you need to have is the newest version of Java. So, it was pretty smooth to do the setup. It took an hour maximum to deploy.

What was our ROI?

Sometimes, it took a whole team about two weeks to get all the data to prepare and present it. After the optimization of the data, it took about one to two hours to do the whole process. Therefore, it has helped a lot when you talk about money, because it doesn't take a whole team to do it, just one person to do one project at a time and run it when you want to run it. So, it has helped a lot on that side.

The solution reduced our ETL development time by a lot because a whole project used to take about a month to get done previously. After having Lumada, it took just a week. For a big company in Brazil, it saves a team at least $10,000 a month.

Which other solutions did I evaluate?

I just use the ETL tool. For data visualization, we are using Power BI. For data storage, we use SQL Server, Azure, or Google BigQuery.

We are just using the open-source application for ETL. We have never looked into other tools of Hitachi because they are paid.

I know other companies who are using Alteryx, which has a friendlier user interface, but they have fewer tools and are more difficult to utilize. My wife uses Alteryx, and I find it is not as good after I used Lumada because they have more solutions and it's open-source. Though, Alteryx has more security and better support.

What other advice do I have?

For someone who wants simple solutions, open-source tools are very perfect for someone who isn't a programmer or knowledgeable about technology. In one week, you can try to understand this solution and do your first project. In my opinion, it is the best tool for people starting out.

Lumada is a great tool. I would rate it as a straight seven out of 10. It gets the work done. The open-source version doesn't work well with big data sources, but there is a lot of flexibility and liberty to do everything you want and need. If the open-source version worked better with big data, then I would give it a straight eight since there is always room for improvement. Sometimes when debugging, some errors can be pretty difficult. It is a tool in principle, when you are starting business intelligence and data engineering, to understand everything that is going on.

Disclosure: PeerSpot contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.

José Orlando Maia

Data Engineer at a tech services company with 201-500 employees

Apr 20, 2022

Download

We can parallelize the extraction from various servers simultaneously, accelerating our extraction

Pros and Cons

"The area where Lumada has helped us is in the commercial area. There are many extractions to compose reports about our sales team performance and production steps. Since we are using Lumada to gather data from each industry in each country. We can get data from Argentina, Chile, Brazil, and Colombia at the same time. We can then concentrate and consolidate it in only one place, like our data warehouse. This improves our production performance and need for information about the industry, production data, and commercial data."

"Lumada could have more native connectors with other vendors, such as Google BigQuery, Microsoft OneDrive, Jira systems, and Facebook or Instagram. We would like to gather data from modern platforms using Lumada, which is a better approach. As a comparison, if you open Power BI to retrieve data, then you can get data from many vendors with cloud-native connectors, such as Azure, AWS, Google BigQuery, and Athena Redshift. Lumada should have more native connectors to help us and facilitate our job in gathering information from these new modern infrastructures and tools."

What is our primary use case?

My primary use case is to provide integration with my source systems, such as ERP systems and SAP systems, and web-based systems, having them primarily integrate with my data warehouse. For this process, I use ETL to treat and gather all the information from my first system, then consolidate it in my data warehouse.

How has it helped my organization?

We needed to gather data from many servers at my company. We had probably 10 or 12 equivalent databases spread around the world, i.e., Brazil, Paraguay, or Chile, and had an instance in each country. So, this server is Microsoft SQL Server-based. We are using Lumada to get the data from these international databases. We can parallelize the extraction from various servers at the same time because we have the same structure, schemas, and tables in each of these SQL Server-based servers. This provides a good value for us, as we can extract data at the same time in parallel, which accelerates our extraction.

In one integration process, I can retrieve data from 10 or 12 servers at the same time in one transformation. In the past, using SQL Server or other manual tools, we needed to have 10 or 12 different processes, one per server. Using Lumada in parallel accelerates our extraction. The tools that Lumada provides enable us to transform the data during this process, integrating the data in our data warehouse with good performance.

Because Lumada uses Java virtual machines, we can deploy and operate in whatever operational system that we want. We can deploy on Linux, even when we had a Linux version from Lumada and a Windows version from Lumada.

It is simple to deploy my ETLs because Lumada has the Pentaho Server version. I installed the desktop version so we can deploy our transformations in the repository. We install our own Lumada on a server, then we have a web interface to schedule our ETLs. We are also able to reschedule our ETLs. We can schedule the hour that we want to run our ETL processes and transformations. We can schedule how many times we want to process the data. We can save all our transformations in a repository located in a Pentaho Server. Since we have a repository, we can save many versions of our transformation, such as 1.0, 1.1, and 1.2, in the repository. I can save four or five versions of a transformation. I can ask Lumada to run only the last version that I saved in the database.

Lumada offers a web interface to follow these transformations. We can check the logs to see if the transformations were successfully completed, we had a network query, or some database log issues. Using Lumada, there is a feature where we can get logs at the execution time. We can also be notified by email if transformations occurred successfully or failed. We have a file for each process that we schedule on Pentaho Server.

The area where Lumada has helped us is in the commercial area. There are many extractions to compose reports about our sales team performance and production steps. Since we are using Lumada to gather data from each industry in each country. We can get data from Argentina, Chile, Brazil, and Colombia at the same time. We can then concentrate and consolidate it in only one place, like our data warehouse. This improves our production performance and need for information about the industry, production data, and commercial data.

What is most valuable?

The features that I use the most are Microsoft Excel table input, S3 CSV Input, and CSV input. Today, the features that are more valuable to me are the table input, then the CSV input. These both are very important. We extract data from the table system for our transactional databases, which are commonly used. We also use the CSV input to get data from AWS S3 and our data lake.

In Lumada, we can parallelize the steps. The performance to query the databases for me is good, especially for transactional databases. Because Lumada uses Java, we can adjust the amount of memory that we want to use to do transformations. So, it is accessible. It's possible to set up the amount of memory that we want to use in the Java VM, which is good. Therefore, Lumada is good, especially with transactional database extraction. It has good performance, not higher performance, but good performance as we query data, and it is possible to parallelize the query. For example, if we have three or four servers to get the data, then we can retrieve the data at the same time, in parallel, in these databases. This is good because we don't need to wait while one of the extractions finishes.

Using Lumada, we don't need to do many manual transformations because we have a native company for many of our transformations. Thus, Lumada is a low-code tool to gather data from SQL, Python, or other transformation tools.

What needs improvement?

Lumada could have more native connectors with other vendors, such as Google BigQuery, Microsoft OneDrive, Jira systems, and Facebook or Instagram. We would like to gather data from modern platforms using Lumada, which is a better approach. As a comparison, if you open Power BI to retrieve data, then you can get data from many vendors with cloud-native connectors, such as Azure, AWS, Google BigQuery, and Athena Redshift. Lumada should have more native connectors to help us and facilitate our job in gathering information from these new modern infrastructures and tools.

For how long have I used the solution?

I have been using Lumada Data Integration for at least four years. I started using it in 2018.

How are customer service and support?

Because we are using the free version of Lumada, we have used only the support on the communities and forums on the Internet.

Lumada does have a paid version, where Hitachi support is specialized in Lumada support.

How was the initial setup?

It is simple to deploy Lumada because we can deploy our transformation in three to five simple steps, saving our transformation in a repository.

I open the Pentaho Server web-based version, then I find the transformation that I deployed. I can schedule this transformation at the hour or recurrence in which I want to run the transformation. It is easy. Because at the end of the process, I can save my transformation and Lumada generates the XML file. We can send this XML file to any user of Lumada, who can open up this model and get the transformation that I developed. As a deployment process, it is straightforward, simple, and not complex.

What was our ROI?

Using Lumada compared to using SQL manually, ETL development time is half the time it took using a basic manual transformation.

What's my experience with pricing, setup cost, and licensing?

There are more types of connectors, but you need to pay.

You need to go through the paid version to have Hitachi Lumada specialized support. However, if you are using the free version, then you will have only the community support. You will depend on the releases from Hitachi to solve some problem or questions that you have, such as bug fixes. You will need to wait for the newest versions or releases to solve these types of problems.

Which other solutions did I evaluate?

I also use Talend Data Integration. For me, Lumada is straightforward and makes it simpler to have transformations as drag and drops. Comparing Talend and Lumada, I think Lumada is easier to use, more than Talend. The comprehension needed for these tools is less with Lumada with than Talend. I can learn Lumada in a day and proceed with my transformations, using some tutorials, since Lumada is easier to use. Whereas, Talend is a more complex solution with more complex transformations.

In Talend's open version, i.e., free version, you won't have a Talend server to deploy models. Thus, you deploy Talend models on the server. If you want to schedule some transformation, then you need to use the operational system where you have infrastructure to run transformations and deploy them. For example, in Talend, we deployed a data model in Talend, but we needed to use Windows Scheduler to also schedule the packets in Talend to process the data in the free version of Talend. Whereas, in the free version of Lumada, we already had it based on the web server. Therefore, we can run our transformations and deploy them on the server. We can schedule in a web interface, which guides us with scheduling the data and checking our logs to see how many transformations we have at a time. This is the biggest difference between Talend and Lumada.

What other advice do I have?

I don't use many templates. I use the solution based on a case-by-case basis.

Considering that Lumada is a free tool, I would rate it as nine out of 10 for the free version.

Which deployment model are you using for this solution?

On-premises

Disclosure: I am a real user, and this review is based on my own experience and opinions.