Pentaho Data Integration and Analytics Room for Improvement

DP
Enterprise Data Architect at a manufacturing company with 201-500 employees

Some of the scheduling features about Lumada drive me buggy. The one issue that always drives me up the wall is when Daylight Savings Time changes. It doesn't take that into account elegantly. Every time it changes, I have to do something. It's not a big deal, but it's annoying. That's the one issue, but I see the limitation, and it might not be easily solvable. 

View full review »
Jacopo Zaccariotto - PeerSpot reviewer
Head of Data Engineering at InfoCert

It's difficult to use custom code. Implementing a pipeline with pre-built blocks is straightforward, but it's harder to insert custom code inside the pre-built blocks. The web interface is rusty, and the biggest problem with Pentaho is debugging and troubleshooting. It isn't easy to build the pipeline incrementally. At least in our case, it's hard to find a way to execute step by step in the debugging mode.

Repository management is also a shortcoming, but I'm not sure if that's just a limitation of the free version. I'm not sure if Pentaho can use an external repository. It's a flat-file repository inside a virtual machine. Back in the day, we would want to deploy this repository on a database.

Pentaho's data management covers ingestion and insights but I'm not sure if it's end-to-end management—at least not in the free version we are using—because some of the intermediate steps are missing, like data cataloging and data governance features. This is the weak spot of our Pentaho version.

View full review »
Ryan Ferdon - PeerSpot reviewer
Senior Data Engineer at Burgiss

If you're working with a larger data set, I'm not so sure it would be the best solution. The larger things got the slower it was.

It was kind of buggy sometimes. And when we ran the flow, it didn't go from a perceived start to end, node by node. Everything kicked off at once. That meant there were times when it would get ahead of itself and a job would fail. That was not because the job was wrong, but because Pentaho decided to go at everything at once, and something would process before it was supposed to. There were nodes you could add to make sure that, before this node kicks off, all these others have processed, but it was a bit tedious. 

There were also caching issues, and we had to write code to clear the cache every time we opened the program, because the cache would fill up and it wouldn't run. I don't know how hard that would be for them to fix, or if it was fixed in version 10.

Also, the UI is a bit outdated, but I'm more of a fan of function over how something looks.

One other thing that would have helped with Pentaho was documentation and support on the internet: how to do things, how to set up. I think there are some sites on how to install it, and Pentaho does have a help repository, but it wasn't always the most useful.

View full review »
Buyer's Guide
Pentaho Data Integration and Analytics
April 2024
Learn what your peers think about Pentaho Data Integration and Analytics. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
768,578 professionals have used our research since 2012.
PR
Senior Engineer at a comms service provider with 501-1,000 employees

Although it is a low-code solution with a graphical interface, often the error messages that you get are of the type that a developer would be happy with. You get a big stack of red text and Java errors displayed on the screen, and less technical people can get intimidated by that. It can be a bit intimidating to get a wall of red error messages displayed. Other graphical tools that are focused at the power user level provide a much more user-friendly experience in dealing with your exceptions and guiding the user into where they've made the mistake.

Sometimes, there are so many options in some of the components. Some guidance about when to use certain options embedded into the interface would be good so that people know that if they set something, what would it do, and when should they use an option. It is quite light on that aspect.

View full review »
Dale Bloom - PeerSpot reviewer
Credit Risk Analytics Manager at MarketAxess

I haven't been able to broach all the functionality of the Enterprise edition because it hasn't been integrated into our server. We're still building out the server, app server, and repository to support it.

In the Community edition, it would be nice to have more modules that allow you to code directly within the application. It could have R or Python completely integrated into it, but this could also be because I'm using an older version.

View full review »
RicardoDíaz - PeerSpot reviewer
COO / CTO at a tech services company with 11-50 employees

Their client support is very bad. It should be improved. There is also not much information on Hitachi forums or Hitachi web pages. It is very complicated.

In terms of the flexibility to deploy in any environment, such as on-premise or in the cloud, we can do the cloud deployment only through virtual machines. We might also be able to work on different environments through Docker or Kubernetes, but we don't have an Azure app or an AWS app for easy deployment to the cloud. We can only do it through virtual machines, which is a problem, but we can manage it. We also work with Databricks because it works with Spark. We can work with clustered servers, and we can easily do the deployment in the cloud. With a right-click, we can deploy Databricks through the app on AWS or Azure cloud.

View full review »
VK
Solution Integration Consultant II at a tech vendor with 201-500 employees

It could be better integrated with programming languages, like Python and R. Right now, if I want to run a Python code on one of my ETLs, it is a bit difficult to do. It would be great if we have some modules where we could code directly in a Python language. We don't really have a way to run Python code natively. 

View full review »
TJ
Manager, Systems Development at a manufacturing company with 5,001-10,000 employees

Its basic functionality doesn't need a whole lot of change. There could be some improvement in the consistency of the behavior of different transformation steps. The software did start as open-source and a lot of the fundamental, everyday transformation steps that you use when building ETL jobs were developed by different people. It is not a seamless paradigm. A table input step has a different way of thinking than a data merge step.

View full review »
Anton Abrarov - PeerSpot reviewer
Project Leader at a mining and metals company with 10,001+ employees

As far as I remember, not all connectors worked very well. They can add more connectors and more drivers to the process to integrate with more flows.

The last time I saw this product, the onboarding instructions were not clear. If the process of onboarding this product is made more clear, it will take the product to the next level. There is a possibility that the onboarding process has already improved, and I haven't seen it. 

View full review »
RV
CDE & BI Delivery Manager at a tech services company with 501-1,000 employees

I work with different databases. I would like to work with more connectors to new databases, e.g., DynamoDB and MariaDB, and new cloud solutions, e.g., AWS, Azure, and GCP. If they had these connectors, that would be great. They could improve by building new connectors. If you have native connections to different databases, then you can make instructions more efficient and in a more natural way. You don't have to write any scripts to use that connector.

Hitachi can make a lot of improvements in the tool, e.g., in performance or latency or putting more emphasis on cloud solutions or NoSQL databases. 

View full review »
AG
Assistant General Manager at DTDC Express Limited

The shortcoming in version 7 is that we are unable to connect to Google Cloud Storage (GCS), where I can write the results from Pentaho. I'm able to connect to S3 using Pentaho 8, but when using it for GCS, I'm unable to connect. With people moving from on-premises deployments to the cloud, be it S3, Azure, or Google, we need a plugin where we can interact with these cloud vendors.

I would like to see improvements made for real-time data processing. It is something that I will be looking out for.

View full review »
Ridwan Saeful Rohman - PeerSpot reviewer
Data Engineering Associate Manager at Zalora Group

Five years ago, when I had less experience with scripting, I would have definitely used this product over Airflow, as it will be easier for me with the abstraction being quite intuitive. Five years ago, I would choose the product over the other tools using pure scripting as it would reduce most of my time in terms of developing ETL tools. This isn't the case anymore as I have more familiarity with scripting.

When I first joined my organization, I was still using Windows. It is quite straightforward to develop the ETL system on Windows. However, when I changed my laptop to MacBook, it was quite a hassle. When we tried to open the application, we had to open the terminal first, go to the solution's directory, and then run the executable file. The display also becomes quite messed up when we enable dark mode on MacBook.

Therefore, if you develop it on MacBook, it'll be quite a hassle, however, when you develop it on Windows, it's not really different from other ETL tools on the market, like SQL Server Integration Services, Informatica, et cetera.

View full review »
RK
Senior Data Analyst at a tech services company with 51-200 employees

Parallel execution could be better in Pentaho. It's very simple but I don't think it works well.

View full review »
Aqeel UR Rehman - PeerSpot reviewer
BI Analyst at Vroozi

I have been facing some difficulties when working with large datasets. It seems that when there is a large amount of data, I experience memory errors. If there is a large amount of data then there is definitely a lag.

I would like to see a cloud-based deployment because it will allow us to easily handle a large amount of data.

View full review »
Renan Guedert - PeerSpot reviewer
Business Intelligence Specialist at a recruiting/HR firm with 11-50 employees

There is no straight-line explanation about bugs and errors that happen on the software. I must search heavily on the Internet, some YouTube videos, and other forums to know what is happening. The proper site of Hitachi and Lumada doesn't have the best explanation about bugs, errors, and the functions. I must search for other sources to understand what is happening. Usually, it is some guy in India or Russia who knows the answer.

A big problem after deploying something that we do in Lumada is with Git. You get a binary file to do a code review. So, if you need to do a review, you have to take pictures of the screen to show each step. That is the biggest bug if you are using Git.

After you create a data pipeline, if you could make a JSON file or something with another language, we could simplify the steps for creating what we are doing. Or, a simple flat file of text could be even better than that, but generated by their own platform so people can look and see what is happening. You shouldn't need to download the whole project in your own Pentaho, I would like to just look at the code and see if there is something wrong.

When I use it for open-source applications, it doesn't handle big data too well. Therefore, we have to use other kinds of technologies to manage that.

I would like it more accessible for Macs. Previously, I always used Linux, but some companies that I worked for before used MacBooks. It would be good if I could use Pentaho in that too since I need to use other tools or create a virtual machine to use Pentaho. So, it would be pretty good if the solution had a friendly version for Macs or Linux-based programs, like Ubuntu.

View full review »
José Orlando Maia - PeerSpot reviewer
Data Engineer at a tech services company with 201-500 employees

Lumada could have more native connectors with other vendors, such as Google BigQuery, Microsoft OneDrive, Jira systems, and Facebook or Instagram. We would like to gather data from modern platforms using Lumada, which is a better approach. As a comparison, if you open Power BI to retrieve data, then you can get data from many vendors with cloud-native connectors, such as Azure, AWS, Google BigQuery, and Athena Redshift. Lumada should have more native connectors to help us and facilitate our job in gathering information from these new modern infrastructures and tools.

View full review »
Michel Philippenko - PeerSpot reviewer
Project Manager at a computer software company with 51-200 employees

I was not happy with the Pentaho Report Designer because of the way it was set up. There was a zone and, under it, another zone, and under that another one, and under that another one. There were a lot of levels and places inside the report, and it was a little bit complicated. You had to search all these different places using a mouse, clicking everywhere. The interface does not enable you to find things and manage all that. I don't know if other tools are better for end-users when it comes to the graphical interface, but this was a bit complicated. In the end, we were able to do everything with Pentaho.

And when you want to improve the appearance of your report, Pentaho Report Designer has complicated menus. It is not very user-friendly. The result is beautiful, but it takes time.

Also, each report is coded in a binary file, so you cannot read it. Maybe that's what the community or the developers want, but it is inconvenient because when you want to search for information, you need to open the graphical interface and click everywhere. You cannot search with a text search tool because the reports are coded in binary. When you have a lot of reports and you want to find where a precise part of one of your reports is, you cannot do it easily.

The way you specify parameters in Pentaho Report Designer is a little bit complex. There are two interfaces. The job creators use the PDI which provides the ETL interface, and it's okay. Creating the jobs for extract/transform/load is simpler than in other solutions. But there is another interface for the end-users of Pentaho and you have to understand how they relate to each other, so it's a little bit complex. You have to go into XML files, which is not so simple.

Also, using the solution overall is a little bit difficult. You need to be an engineer and somebody with a technical background. It's not absolutely easy, it's a technical tool. I didn't immediately understand it and had to search for information and to think about it.

View full review »
RE
Data Architect at a consumer goods company with 1,001-5,000 employees

I would like to see improvement when it comes to integrating structured data with text data or anything that is unstructured. Sometimes we get all kinds of different files that we need to integrate into the warehouse. 

By using some of the Python scripts that we have, we are able to extract all this text data into JSON. Then, from JSON, we are able to create external tables in the cloud whereby, at any one time, somebody has access to this data on the S3 drive.

View full review »
NA
Systems Analyst at a university with 5,001-10,000 employees

The transition to the web-based solution has taken a little longer and been more tedious than we would like and it's taken away development efforts towards the reporting side of the tool. They have a reporting tool called Pentaho Business Analytics that does all the report creation based on the data integration tool. There are a lot of features in that product that are missing because they've allocated a lot of their resources to fixing the data integration, to make it more web-based. We would like them to focus more on the user interface for the reporting.

The reporting definitely needs improvement. There are a lot of general, basic features that it doesn't have. A simple feature you would expect a reporting tool to have is the ability to search the repository for a report. It doesn't even have that capability. That's been a feature that we've been asking for since the beginning and it hasn't been implemented yet. We have between 500 and 800 reports in our system now. We've had to maintain an external spreadsheet with IDs to identify the location of all of those reports, instead of having that built into the system. It's been frustrating for us that they can't just build a simple search feature into the product to search for report names. It needs to be more in line with other reporting tools, like Tableau. Tableau has a lot more features and functions.

Because the reporting is lacking, only the deans and above are using it. It could be used more, and we'd like it to be used more.

Also, while the solution provides us with a single, end-to-end data management experience from ingestion to insights, it does but it doesn't give us a full history of where it's coming from. If we change a field, we can't trace it through from the reporting to the ETL field. Unfortunately, it's a manual process for us. Hitachi has a new product to do that and it searches all the fields, documents, and files just to get your pipeline mapped, but we haven't bought that product yet.

View full review »
KM
Data Architect at a tech services company with 1,001-5,000 employees

I would like to see support for some additional cloud sources - Azure, Snowflake.

View full review »
ES
System Engineer at a tech services company with 11-50 employees

I would like to see better support from one version to the next, and all the more so if there are third-party elements that you are using. That's one of the differences between the Community Edition and the Enterprise Edition. 

In addition to better integration with third-party tools, what we have seen is that some of the tools just break from one version to the next and aren't supported anymore in the Community Edition. What is behind that is not really clear to us, but the result is that we can't migrate, or we have to migrate to other parts. That's the most inconvenient part of the tool.

We need to test to see if all our third-party plugins are still available in a new version. That's one of the reasons we decided we would move from the tool to the completely open-source version for the ETL part. That's one of the results of the migration hassle we have had every time.

The support for the Enterprise Edition is okay, but what they have done in the last three or four years is move more and more things to that edition. The result is that they are breaking the Community Edition. That's what our impression is.

The Enterprise Edition is okay, and there is a clear path for it. You will not use a lot of external plugins with it because, with every new version, a lot of the most popular plugins are transferred to the Enterprise Edition. But the Community Edition is almost not supported anymore. You shouldn't start in the Community Edition because, really early on, you will have to move to the Enterprise Edition. Before, you could live with and use the Community Edition for a longer time.

View full review »
SK
Lead, Data and BI Architect at a financial services firm with 201-500 employees

The documentation is very basic.

The testing and quality could really improve. Every time that there is a major release, we are very nervous about what is going to get broken. We have had a lot of experience with that, as even the latest one was broken. Some basic things get broken. That doesn't look good for Hitachi at all. If there is one place I would advise them to spend some money and do some effort, it is with the quality. It is not that hard to start putting in some unit tests so basic things don't get broken when they do a new release. That just looks horrible, especially for an organization like Hitachi.

View full review »
DG
Director of Software Engineering at a healthcare company with 10,001+ employees

The performance could be improved. If they could have analytics perform well on large volumes, that would be a big deal for our products.  

View full review »
VM
Technical Manager at a computer software company with 51-200 employees

I don't think they market it that well. We can make suggestions for improvements but they don't seem to take the feedback on board. This contrasts with Informatica who are really helpful and seem to listen more to their customer feedback. I would also really like to see improved data capture. At the moment the emphasis seems to be on data processing. I would like to see a real-time processing data integration tool. This would provide instant reporting whenever the data changes. I'm still in the very recent stage concerning Pentaho Data Integration, but it can't really handle what I describe as "extreme data processing" i.e. when there is a huge amount of data to process. That is one area where Pentaho is still lacking.

View full review »
TG
Analytics Team Leader at HealtheLink

Since Hitachi took over, I don't feel that the documentation is as good within the solution. It used to have very good help built right in. There's good documentation when you go to the site but the help function within the solution hasn't been as good since Hitachi took over.

View full review »
it_user164838 - PeerSpot reviewer
CEO with 51-200 employees

There some steps that should perform better like the json input, but because of the flexibility we at inflow, override it by using scripting steps. Of course it's ideal to use the steps that come with the software but if you can write your own step that's powerful. Also, it would be nice to have the drivers for the data sources shipped with Pentaho Kettle instead of looking for the right ones on the Internet.

View full review »
it_user373128 - PeerSpot reviewer
Data Architect & ETL Lead at a financial services firm with 1,001-5,000 employees

Since there have already been newer versions, maybe some of these features are already fixed now. The most troublesome missing feature was the capability to produce crosstab reports with formatting capabilities in the BI Reporting product. The one annoyance that troubled us a lot was the fact that every step in a transformation that needed data, created its own data connection. With some data sources like Greenplum, this was a problem, because they have a limit on available number of connections.

View full review »
it_user414117 - PeerSpot reviewer
Senior Data Engineer at a tech company with 501-1,000 employees

Its performance can be improved so it will work better with Big Data. Also, sometimes it can be very buggy which keeps away some potential users.

View full review »
it_user382572 - PeerSpot reviewer
Pentaho Consultant at a comms service provider with 10,001+ employees

In the community version the scheduling tool is not good, and we had to build it ourselves.

View full review »
it_user376926 - PeerSpot reviewer
Data Developer at a tech services company with 10,001+ employees

They could improve the logging generator. Sometimes the error description is so generic that it is not possible to detect the problem.

View full review »
OM
IT-Services Manager & Solution Architect at Stratis

The solution needs better, higher-quality documentation, similar to AWS. Right now, we find that although documentation exists, it's not easy to find the answers we seek.

I have tried some cloud services with the ETL, so perhaps that would be good to add.

The product needs more plugins. Right now, it just has a standard database connection and there are other solutions there that can have straightforward connections for Oracle, MySQL, and stuff like that. However, more plugins would make it a much better product.

View full review »
it_user402600 - PeerSpot reviewer
Senior Consultant at a financial services firm with 10,001+ employees

Support for common Hadoop utilities can be expanded, such as bulk load with composite row keys for HBase, and include drivers for Impala out-of-the-box. A richer interface to Hive could also be beneficial as we currently have to go through a raw connection and execute SQL scripts, for which some syntax is not respected.

As of version 6, there are also some new issues introduced that pose a bit of an annoyance:


1) On kettle's ramp up - log4j errors

2) IBM Websphere MQ Producer - variable substitution for the URL does not work - you have to hardcode.

3) shared.xml for DB connections - variable substitution for connection properties does not work - have to hardcode things like Kerberos principal for a Hive/Impala connection.

View full review »
it_user396720 - PeerSpot reviewer
Graduate Teaching Assistant with 1,001-5,000 employees

I would like to see the data visualization tool combined with BI so I can see how data is progressing through various stages. I do think that they are working on this already. I also found, in my case, that the statistical data input wasn't working (.sas7bdat input wasn't working).

View full review »
VD
Specialist in Relational Databases and Nosql at a computer software company with 5,001-10,000 employees

I'm currently looking at a new competitor that's got some interesting features that this solution doesn't have. I have found this competitor has a feature braking system that is not present in the Pentaho Data Integration approach. The way their system sets can somehow maintain a track for the last executions and store the state which gives you the potential to run from the point that it ended the last time. It's very interesting. It would be nice if Pentaho had this type of feature.

Often you are required to install plugins. If you need to have access to, in my case, Neo4j databases new folder databases, you do need a plugin to do it.

View full review »
it_user391695 - PeerSpot reviewer
Business Intelligence Consultant at Sanmargar Team

A big advantage, but also a problem, is that it is open source. Almost anyone can develop their own Pentaho code and release it. Now, Pentaho is a little messy, and some parts of it are super new and some look like it were developed at the beging. I think that developers should stop inventing new parts of it, and it can take a while to clean the code and optimize the older parts of it. Some old plugins, after a long time, still doesn’t work properly enough.

View full review »
it_user426030 - PeerSpot reviewer
Global Consultant - Big Data, BI, Analytics, DWH & MDM at a tech consulting company with 1,001-5,000 employees

Pentaho Dashboard Designer - needs an improvement on the various features of the Dashboards, since there are CTools available and which help to fulfil the gaps, but it needs developers involvement. A full fledged Dashboard designer to perform all the functions of what we do in CDE/CDF would be a great improvement for Pentaho.

Build Process - an inbuilt build process would provide an advantage to migrate between DEV-QA-UAT-PROD, currently it is mostly performed manually.

Data Profiling - including data profiling as part of PDI would be a great improvement to the platform and helps customers to save a lot of effort/cost of data quality.

View full review »
it_user384984 - PeerSpot reviewer
Sr BI Administrator at a healthcare company with 1,001-5,000 employees

PDI excels at the development part. Administration and monitoring are pretty weak and basic. But, I must say I have been spoiled with the great capabilities that Powercenter offers ‘out-of-the-box’ The Pentaho development team seems to rely very heavily on Linux/Unix for the admin part. Debugging could be enhanced with better feed-back.

View full review »
it_user172275 - PeerSpot reviewer
Consultant at a comms service provider with 11-50 employees

One thing that I don't like, just a little, is the backward compatibility. I used Pentaho from version 4, and version 6 does not work with the whole ETL design. So backward compatibility is a problem.

View full review »
it_user254223 - PeerSpot reviewer
Project Manager - Business Intelligence at www.datademy.es

There is not a data quality or MDM solution in the Pentaho DI suite.

View full review »
it_user384993 - PeerSpot reviewer
Datawarehouse Administrator at a tech services company with 501-1,000 employees

The User Console, aka workspace, and the development of dashboards. They work but they require some programmer skills. This means a continous application management on behalf of IT dept.

View full review »
it_user415695 - PeerSpot reviewer
Project Lead at a tech services company with 10,001+ employees

I have used multiple versions of this product. The initial version we were on was v3.2 and we were had multiple issues, but currently don't find any issues as a blocker. In general, it would be good if we could get better performance from this product.

View full review »
it_user426117 - PeerSpot reviewer
DWH Specialist at a healthcare company with 1,001-5,000 employees

The product itself is great, the biggest downside in my opinion is that it is hard to find (hire) people with expertise. Our experience with Pentaho software is that few people have the required expertise. Hiring additional resources for projects can be tough.

Our solution is that we tend to train our own people, it’s definitely not hard to learn, basically anyone with SQL knowledge and experience in another tool can learn using Pentaho Data Integration very easily, but you might end up training them yourselves.


View full review »
it_user8199 - PeerSpot reviewer
BI developer - (Jaspersoft/Pentaho/Pentaho C-Tools/Kettle/Talend/Data warehouse) at a tech services company with 501-1,000 employees
  • Searching repository for reports or dashboards
  • Repository UI
  • Loading of percentage reports and dashboards
View full review »
it_user392367 - PeerSpot reviewer
Research Assistant at a university with 1,001-5,000 employees

I would like to have more languages/scripts supported in user-defined classes. Right now the options are very limited. I know, if I want to do core programming I can always import my classes/jars into it, but it would be really nice to have more functionality in terms of programming language and support in UD classes/operator. Besides that, different parallel algorithms/skeletons would be great. For example, it could suggest which parallel algorithm I should use on a particular operator or a set of operators. It would be really cool to have such a functionality.

View full review »
it_user375219 - PeerSpot reviewer
Consultant at a tech vendor with 501-1,000 employees

The rule executor step can be improved. It has one limitation: we cannot give it a dynamic file name in this step.

View full review »
it_user369171 - PeerSpot reviewer
Brazil IT Coordinator at a transportation company with 1,001-5,000 employees

I would like to see more improvements with AS400 DB2. I journalled the tables/instance and the data migration is too slow if I compare it with other databases.

View full review »
it_user386202 - PeerSpot reviewer
Business Intelligence Supervisor at a manufacturing company with 501-1,000 employees

An easier upgrade process for community tools would be nice. They also need to update the ad-hoc reports tool, as the one available is outdated. To get round this, we are using Excel as the output for some reports.

View full review »
Buyer's Guide
Pentaho Data Integration and Analytics
April 2024
Learn what your peers think about Pentaho Data Integration and Analytics. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
768,578 professionals have used our research since 2012.