Being in the business of online-to-offline ad attribution and advertising analytics, we need tools to help us analyze billions of records to discover interesting insights for our clients. One of the tools we use is Pentaho, an open source business intelligence platform that allows us to manage, transform, and explore our data. It offers some nice GUI tools, can be quickly set up on top of existing data, and has the advantage of being on our home team.
But for all the benefits of Pentaho, making it work for us has required tweaking and in some cases replacing Pentaho with other solutions. Don’t take this the wrong way: we like Pentaho, and it does a lot of what we need. But at the edges, any off-the-shelf solution often can’t do exactly what you need.
Perhaps the biggest problem we faced was getting queries against our cubes to run quickly. Because Pentaho is built around Mondrian, and Mondrian is a ROLAP, every query against our cubes requires building dozens of queries that join tables with billions of rows. In some cases this meant that Mondrian queries could require hours to run. Our fix has been to make extensive use of summary tables, i.e. summarizing counts of raw data at levels we know our cubes will need to execute queries. This has allowed us to take queries that ran in hours to run in seconds by doing the summarization for all queries once in advance. At worst our Mondrian queries can take a couple minutes to complete if we ask for really complicated things.
Early on, we tried to extend our internal use of Pentaho to our clients by using Action Sequences, also known as xactions after the Action Sequence file extension. Our primary use of xactions was to create simple interfaces for getting the results of Mondrian queries that could then be displayed to clients in our Rails web application. But in addition to sometimes slow Mondrian queries (in the world of client-facing solutions, even 15 seconds is extremely slow), xactions introduce considerable latency as they start up and execute, adding as much as 5 seconds on top of the time it takes to execute the query.
Ultimately we couldn’t make xactions fast enough to deliver data to the client interface, so we instead took the approach we use today. We first discover what is useful in Pentaho internally, then build solutions that query directly against our RDBMS to quickly deliver results to clients. Although, to be fair to Mondiran, some of these solutions require us to summarize data in advance of user requests to get the speed we want because that data is just that big and the queries are just that complex.
We’ve also made extensive use of Pentaho Data Integration, also known as Kettle. One of the nice features about Kettle is Spoon, a GUI editor for writing Kettle jobs and transforms. Spoon made it easy for us to set up ETL processes in Kettle and take advantage of Kettle’s ability to easily spread load across processing resources. The tradeoff, as we soon learned, was that Spoon makes the XML descriptions of Kettle jobs and transforms difficult to work on concurrently, a major problem for us since we use distributed version control. Additionally, Kettle files don’t have a really good, general way of reusing code short of writing custom Kettle steps in Java, so it makes maintaining our large collection of Kettle jobs and transforms difficult. On the whole, Kettle was great for getting things up and running quickly, but over time we find its rapid development advantages are outweighed by the advantages of using a general programming language for our ETL. The result is that we are slowly transitioning to writing ETL in Ruby, but only transitioning 0n an as-needed basis since our existing Kettle code works well.
As we move forward, we may find additional places where Pentaho does not fully meet our needs and we must find other solutions to our unique problems. But on the whole, Pentaho has proven to be a great starting platform for getting our analytics up and running and has allowed us to iteratively build out our technologies without needing to develop custom solutions from scratch for everything we do. And, I expect, Pentaho will long have a place at our company as an internal tool for initial development of services we will offer to our clients.