Pentaho Review

It does a lot of what we need but off-the-shelf solutions often can’t do exactly what you need


Being in the business of online-to-offline ad attribution and advertising analytics, we need tools to help us analyze billions of records to discover interesting insights for our clients. One of the tools we use is Pentaho, an open source business intelligence platform that allows us to manage, transform, and explore our data. It offers some nice GUI tools, can be quickly set up on top of existing data, and has the advantage of being on our home team.

But for all the benefits of Pentaho, making it work for us has required tweaking and in some cases replacing Pentaho with other solutions. Don’t take this the wrong way: we like Pentaho, and it does a lot of what we need. But at the edges, any off-the-shelf solution often can’t do exactly what you need.

Perhaps the biggest problem we faced was getting queries against our cubes to run quickly. Because Pentaho is built around Mondrian, and Mondrian is a ROLAP, every query against our cubes requires building dozens of queries that join tables with billions of rows. In some cases this meant that Mondrian queries could require hours to run. Our fix has been to make extensive use of summary tables, i.e. summarizing counts of raw data at levels we know our cubes will need to execute queries. This has allowed us to take queries that ran in hours to run in seconds by doing the summarization for all queries once in advance. At worst our Mondrian queries can take a couple minutes to complete if we ask for really complicated things.

Early on, we tried to extend our internal use of Pentaho to our clients by using Action Sequences, also known as xactions after the Action Sequence file extension. Our primary use of xactions was to create simple interfaces for getting the results of Mondrian queries that could then be displayed to clients in our Rails web application. But in addition to sometimes slow Mondrian queries (in the world of client-facing solutions, even 15 seconds is extremely slow), xactions introduce considerable latency as they start up and execute, adding as much as 5 seconds on top of the time it takes to execute the query.

Ultimately we couldn’t make xactions fast enough to deliver data to the client interface, so we instead took the approach we use today. We first discover what is useful in Pentaho internally, then build solutions that query directly against our RDBMS to quickly deliver results to clients. Although, to be fair to Mondiran, some of these solutions require us to summarize data in advance of user requests to get the speed we want because that data is just that big and the queries are just that complex.

We’ve also made extensive use of Pentaho Data Integration, also known as Kettle. One of the nice features about Kettle is Spoon, a GUI editor for writing Kettle jobs and transforms. Spoon made it easy for us to set up ETL processes in Kettle and take advantage of Kettle’s ability to easily spread load across processing resources. The tradeoff, as we soon learned, was that Spoon makes the XML descriptions of Kettle jobs and transforms difficult to work on concurrently, a major problem for us since we use distributed version control. Additionally, Kettle files don’t have a really good, general way of reusing code short of writing custom Kettle steps in Java, so it makes maintaining our large collection of Kettle jobs and transforms difficult. On the whole, Kettle was great for getting things up and running quickly, but over time we find its rapid development advantages are outweighed by the advantages of using a general programming language for our ETL. The result is that we are slowly transitioning to writing ETL in Ruby, but only transitioning 0n an as-needed basis since our existing Kettle code works well.

As we move forward, we may find additional places where Pentaho does not fully meet our needs and we must find other solutions to our unique problems. But on the whole, Pentaho has proven to be a great starting platform for getting our analytics up and running and has allowed us to iteratively build out our technologies without needing to develop custom solutions from scratch for everything we do. And, I expect, Pentaho will long have a place at our company as an internal tool for initial development of services we will offer to our clients.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
7 visitors found this review helpful
7 Comments
BI Expert with 51-200 employeesVendor

Hi Gordon,

Thank you for the great review. I am currently choosing between Pentaho Kettle and general programming language to build our ETL in Hadoop. After initial investigation, I am also concerned about the flexibility and maintenancebility of off-shelf ETL solution to build large and complex ETL processes down the road. We are building some ETL PoC with Python. So far, it seems much more flexible for what we need to do, however, it takes a lot more time to get things up running and requires more programming expertise.

Your post is very helpful.

Thank you!

11 January 14
Owner with 51-200 employeesVendor

We frequently call Python scripts from within Kettle. So we have the best of both worlds.
Kettle's ETL scripts are no harder to maintain than source code. They are just text files that can easily be maintained within a version control system. The benefit is the visual nature of the ETL written in Kettle.

14 February 14
ConsultantConsultant

Does the etl portion of Pentaho support some kind of function build? As a simple example, if you are building a solution to run against Oracle and sql server. Could you build a function that would run code based on the technology it is running against so that you would not have to have 2 separate solutions to maintain?

03 March 14
Owner with 51-200 employeesVendor

Do you mean something like an ORM layer that would allow your code to be database agnostic?
Pentaho ETL could run custom Java code (implemented as a user-defined Step) that has access to an ORM layer.
We've done something similar when we need to access OpenERP (before the OpenERP steps are available in the new Pentaho 5.0).

03 March 14
ConsultantConsultant

Thanks for your reply.
ORM? kinda sorta. In some ETL products – like Oracle Data Integrator – you can define code that runs based on the technology you are running against. So for example you could define a function say called anytrim()) that runs trim(field) against oracle sources and Ltrim(Rtrim(field)) against sql server sources. If you then have to run against db2 you could just augment the function with that technology’s code and at run time the etl runs the appropriate code for the technology. Very handy for date functions as each db seems to want to handle dates with its own set of non-standard functions. Very handy so we can have one set of implementation code that can hit any technology we want.
I don’t want to have to go to java though. Ick.

03 March 14
Owner with 51-200 employeesVendor

I see what you mean now.
One of the ways you can use this is by specifying a user-defined function in the Modified Javascript Value step.
Pentaho uses Javascript as the scripting language.
You can define this once within a tab that is marked as Start only (executed only once instead of per data stream row).

Of course to get better performance for a more complex user-defined functions, the custom Java step would be preferable, or define anyTrim() within Oracle and MS SQL Server as UDF, so you can use it straight in a query.

04 March 14
User at a financial services firmReal User

Have you looked into using Talend?? It's got a great user interface, very similar to kettle, and their paid for version has version control that works very well, and you get the ability to run "joblets" which are basically re-usable pieces of code. Even in the free version there is version control, although it's pretty clumsy, and not joblets in the free, and the free version is difficult to get working with Github.

22 April 14
Guest

Sign Up with Email