Pentaho Review
Pentaho BI Suite Review: Pentaho Analytics – Part 4 of 6


Introduction

This is the fourth of a six-part review of the Pentaho BI suite. In each part of the review, we will take a look at the components that make up the BI suite, according to how they would be used in the real world.

In this fourth part, we'll be discussing the Pentaho Analytics tools and facilities, which provides the ability to view, “slice and dice” data from multiple dimensions. This particular feature is the most associated with the word “Business Intelligence” due to its usefulness to aid cross-data-domain decision-making processes. Any decent BI suites have at least one facility with which users can perform data analysis with.

One important note, specifically for Pentaho, the Analytics toolset is where the real advantage of the Enterprise Edition (EE) over Community Edition (CE) starts to show-through – other than the much more polished UI.

In the Pentaho BI Suite, we have these analytics tools:

  1. Saiku Analytics (In EE this is called “Analysis Report”) – A tool built into Pentaho User Console (PUC) that utilizes the available analysis models. Do not confuse this with the Saiku Reporting.
  2. Pentaho Model Data Source – In part three of the review, we discussed this facility to create data models for Ad-hoc reporting. The second usage of this facility is to create an OLAP “cube” for use with the Saiku Analytics tool. Once this is setup by the data personnel, data owners can use it to generate analytic reports.
  3. Schema Workbench – A separate program that allows for handcrafting OLAP cube schemas. Proviciency with MDX query language is not necessary but can come in handy in certain situations.

As usual, we'll discuss each of these components individually. The screenshots below are sanitized and there are no real data being represented. A fictitious company called “DonutWorld” is used to illustrate and relate the concepts.

Saiku Analytics (Analysis Report in EE)

One of the benefit of having a Data Warehouse is to be able to model existing data in a structure that is conducive to analysis. If we try to feed tools such as this with a heavily normalized transaction database, we are inviting two problems:

1. We will be forced to do complex joins which will manifest itself in performance hit and difficulty when business rules change

2. We lose the ability to apply non-transactional business rules to the data which is closer to the rule maintainers (typically those who work closely with the business decision-makers)

Therefore to use this tool effectively we need to be thinking in terms of what questions need to be answered, then work our way backwards employing data personnels to create the suitable model for the said questions. Coincidentally, this process of modeling data suitable for reporting is a big part of building a Data Warehouse.

Learning experience: Those who are familiar with MS Excel (or Libre Office) Pivot Tables will be at home with this tool. Basically, as the model allows, we can design the view or report by assigning dimensions into columns and rows, and then assigning measures to define what kind of numbers we are expecting to see. We will discuss below what 'dimension' and 'measure' mean in this context, but for an in-depth treatment, we recommend consulting your data personnels.

Usage experience: The EE version of this tool has a clearer interface as far as where to drop dimensions and measures, but the CE version is usable once we are accustomed to how it works. Another point for the EE version (version 5.0) is the ability to generate total sums in both row and column direction and a much more usable Excel export.

Figure 1. The EE version of the Analysis Report (Saiku Analytics in CE)

Pentaho Model Data Source

The Data Source facility is accessible from within the PUC. As described in Part 3 of this review, once you have logged in, look for a section on the screen that allows you to create or manage existing data sources.

Here we are focusing on using this feature to setup “cubes” instead of “models.” This is something that your data personnels should be familiar with, guided by the business questions that needs answering.

Unlike the “model”, the “cubes” are not flat, rather it consists of multiple dimensions that determines how the measures are aggregated. Out of these “cubes” non-technical users can create reports by designing it just like they would Pivot Tables. The most useful aspect of this tool is to abstract a construction of an OLAP cube schema to its most core concepts. For example, given a fact table, this tool will try to generate an OLAP cube schema. And in most part, it's doing a good job in the sense that the cube is immediately usable to generate Analysis Reports.

This tool also hide the distinction between Hierarchies and Levels of dimensions. For the most part, you can do a lot with just one Level anyway, so this is easier to grasp for beginners in OLAP schema design.

Learning experience: The data personnel must be 1) familiar with the BI table structures or at the very least can pinpoint which of the tables are facts and dimensions; 2) comfortable with designing OLAP dimensions and measures. Data owners must be familiar with the structure and usage of the data. The combined efforts by these two roles are the building blocks of a workflow/process.

Usage experience: Utilizing the workflow/process defined above, an organization will generate a collection of OLAP cubes that can be used to analyze the business data with increasing accuracy and usefulness. The most important consideration from the business standpoint, is that all of this will take some time to materialize. The incorrect attitude here would be to expect instant results, which will not transpire unless the dataset is overly simplistic.

Figure 2. Creating a model out of a SQL query

NOTE: Again, this is where the maturity level of the Data Warehouse is tested. For example, a DW with sufficient maturity will notify the data personnel of any data model changes which will trigger the updating of the OLAP cube, which may or may not have an effect on the created reports and dashboards.

If the DW is designed correctly, there should be quite a few fact tables that can readily be used in the OLAP cube.

Schema Workbench

The Schema Workbench is for those who needs to create a custom OLAP schema that cannot be generated via the Data Source facility in the PUC. Usually this involves complicated measure definitions, multi-Hierarchy or multi-Level dimensions, or to evaluate and optimize MDX queries.

NOTE: In the 5.0 version of PUC, we can import existing MDX queries into the Data Source Model making it available for the Analysis Report (or Saiku Ad-Hoc report in the CE version). As can be seen in the screenshot below, the program is quite complex with the numerous features to handcraft an OLAP cube schema.

Once a schema is validated in the Workbench, we need to publish it. Using the password defined in the pentaho-solutions/system/publisher_config.xml, the Workbench will prompt for the location of the cube within the BI Server and the displayed name. From that point, it will be available to choose from the drop-down list on the top left of the Saiku Analytics tool.

Figure 3. A Saiku report in progress

OLAP Cube Schema Considerations

Start by defining the fact table (bi_convection in the above example), then start defining dimensions and measures.

We have been talking about these concepts of dimension and measure. Let's briefly define them:

  1. A dimension is a way to view existing business data. For instance, a single figure such as sales number can be viewed from the perspectives. We can view it per sales regions, per salesperson or department, or chronologically. Using aggregation function such as sum, average, min/max, standard deviation, etc. we can come up with different numbers that shows the data in a manner that we can draw conclusion from.
  2. A measure is the numbers or counts of business data that can provide an indication on how the business is doing. For a shoe manufacturing company, obviously the number of shoes sold is one very important measure, another would be the average price of sold shoes. Combined with dimensions, we can use the measures to make a business decision.

In the Schema Workbench, as you select the existing BI table fields into the proper dimensions, it will validate the accessibility of the fields using the existing database connection, then create a view of the measures using a certain user-configurable way to aggregate the numbers.

In the creation of an OLAP cube schema, there is a special dimension that enables us to see data chronologically. Due to its universal nature, this dimension is a good one to start with. The time dimension is typically served by a special BI table that contains a flat list of rows containing time and date information within the needed granularity (some businesses requires seconds, others days, or even weeks or months).

TIP: Measures can be defined using “case when” SQL construct, which opens a whole other level of flexibility.

When should we use MDX vs SQL?

The MDX query language, with its powerful concepts like ParallelPeriods, is suitable for generating tabular data containing aggregated data that is useful for comparison purposes.

True to its intended purposes, MDX queries allows for querying data which is presented in a multi-dimensional fashion. While SQL is easier to grasp and has a wider base of users/experts in any industry.

In reality, we use these two languages at different levels, the key is to be comfortable with both, and discover the cases where one would make more sense than the other.

NOTE: The powerful Mondrian engine is capable, but without a judicious use of database indexing, query performance can crawl into minutes instead of seconds easily. This is where data personnels with database tuning experiences would be extremely helpful.

Summary

The analytics tools in the Pentaho BI Suite is quite comprehensive. Certainly better than some of the competing tools out there. The analytic reports are made available on the Pentaho User Console (PUC) where users login and initiate the report generation. There are three facilities available:

The Analysis Report (or Saiku Analytics in CE version) is a good tool for building reports that look into an existing OLAP cube and do the “slicing and dicing” of data.

The Data Source facility can also be used to create OLAP cubes from existing BI tables in the DW. A good use of this facility is to build a collection of OLAP cubes to answer business questions.

The Schema Workbench is a standalone tool which allows for handcrafting custom OLAP cube schemas. This tool is handy for complicated measure definitions and multilevel dimensions. It is also a good MDX query builder and evaluator.

Next in part-five, we will discuss the Pentaho Dashboard design tools.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
2 visitors found this review helpful

Add a Comment

Guest
Why do you like it?

Sign Up with Email