Microsoft BI Review
My 30 tips for building a Microsoft BI solution, Part II: Tips 6-10


# 6: Use a framework for your Integration Services solution(s) because data is evil

I know how it is. You may have started your ETL project using the SQL Server import / export wizard or you may have done a point integration of a couple of tables through data tools. You might even have built an entire solution from the ground up and been pretty sure that you thought of everything. You most likely have not. Data is a tricky thing. So tricky in fact that I over the years have built up an almost paranoid distrust against it. The only sure thing I can say is that it will change (both intentionally and unintentionally) over time and your meticulously crafted solution will fail. Best case scenario is that it simply will stop working. Worst case scenario is that this error / these errors have not caused a failure technically but have done faulty insert / update / delete operations against your data warehouse for months. This is not discovered until you have a very angry business manager on the line who has been doing erroneous reporting up the corporate chain for months. This is the most likely scenario. A good framework should have functionality for recording data lineage (what has changed) and the ability to gracefully handle technical errors. It won’t prevent these kinds of errors from happening but it will help you recover from them a lot faster. For inspiration read The Data Warehouse ETL Toolkit.

#7: Use a framework for your Integration Services solution(s) to maintain control and boost productivity

Integration Services is a powerful ETL tool that can handle almost any data integration challenge you throw at it. To achieve this it has to be very flexible. Like many of Microsoft’s products its very developer oriented. The issue with this is that there are as many ways of solving a problem as there are Business Intelligence consultants on a project. By implementing a SSIS framework (and sticking with it!) you ensure that the solution handles similar problems in similar ways. So when the lead developer gets hit by that bus you can put another consultant on the project who only needs to be trained on the framework to be productive. A framework will also boost productivity. The up-front effort of coding it, setting it up and forcing your team to use it is dwarfed by the benefits of templates, code reuse and shared functionality. Again, read The Data Warehouse ETL Toolkit for inspiration.

#8: Test and retest your calculations.

Come into the habit of testing your MDX and DAX calculations as soon as possible. Ideally this should happen as soon as you finish a calculation, scope statement, etc. Both MDX and DAX get complicated really fast and unless you are a Chris Webb you will loose track pretty quickly of dependencies and why numbers turn out as they do. Test your statements in isolation and the solution as a whole and verify that everything works correctly. Also these things can have a severe performance impact so remember to clear the analysis services cache and do before and after testing (even if you have cache warmer). Note that clearing the cache means different things to tabular and dimensional as outlined here.

#9: Partition your data and align it from the ground up.

Note that you need the enterprise version of SQL Server for most of this. If you have large data sets you should design your solution from the ground up to utilize partitioning. You will see dramatic performance benefits from aligning your partitions all the way from your SSIS process to your Analysis Services cubes / tabular models. Alignment means that if you partition your relational fact table by month and year, you should do the same for your analysis services measure group / tabular table. Your SSIS solution should also be partition-aware to maximize its throughput by exploiting your partitioning scheme.

#10: Avoid using the built-in Excel provider in Integration Services.

I feel a bit sorry for the Excel provider. It knows that people seeing it will think “Obviously I can integrate Excel data with my SSIS solution, its a MS product and MS knows that much of our data is in Excel”. The problem is that Excel files are inherently unstructured. So for all but the simplest Excel workbooks the provider will struggle to figure out what data to read. Work around this by either exporting your Excel data to flat files or look at some third party providers.

Disclosure: I am a real user, and this review is based on my own experience and opinions.

1 Comment

it_user4014ConsultantPOPULAR

Hi Peter !

Let's discuss from point 6 to 10 in here;

#6: I totally agree with you, never trust or undermine the fact that data will be coming in the format as suggested by the ETL Team. There is always a possibility of wrong data types, bad data, switched data, all kind of data to be appear as source data, so as a ETL developer you need to make sure you put data validation checks for each and every case you have in mind. Still you might miss out some cases. The good thing about MS SQL Server 2012 is now they have provided the TRY_CAST function which can be used to avoid casting errors. A craftily designed framework would be handy to have where ETL developers need to know about the framework, so invest on building a framework which can be used across multiple ETL projects. I strongly agreed with your point that data is evil and sometimes is such hard to load single files which have all kind of these bad data validation errors.

#7: Definitely, by having a framework you can save time by not spending your time on writing same piece of code again and again. While designing your ETL, please beware of the data types which you are using, for some people there is slight difference between Float, Decimal & Numeric data type but if you have been writing ETL solutions you know what kind of a mess it would create if you don\t pick up the right data type, same for Date & DateTime data types.

#8: MDX calculation needs to be tested again and again which is called regression testing. All these years i have been building end to end BI solutions, which involves writing complex ETL's, it is like impossible for QA agents to identify the problem in calculations, so while you assign someone task of verifying MDX calculation or just verifying the BI Dashboard output, make sure he has enough knowledge of Data Analysis. He would be proficient enough to query the database and be able to browse the Cube and also perform cross Data Verification. As a BI Consultant I invest much time in training my QA agents to be able to perform this regression testing.

#9: Partition is always a good practice when you are sure that data influx might going to be run into billions of rows. But if you are designing a BI Solution for an organization which might not have this big amount of data under Analysis then you may avoid partitioning.

#10: Strongly recommended, built in Excel provider is going to make you crazy really soon by having it own data type sensing ability, although you can try to turn it off by setting the property of Type Guess = 0, but there are so many problem with excel provider it always sense the data types for each source column.

One thing I need to mention, is carefully designed ETL with customized logging process can save you tons of time while analyzing the cause of data failure. And it's always good to have the ETL logging process which can be shared with your client as well.

Regards,
Hasham Niaz

27 July 13
Guest
Why do you like it?

Sign Up with Email