Data profiling and data quality control
Improvements to My Organization:
I worked on a project which was to enable the Impact Analysis for the changes happened in source system. By leveraging the IA data profiling capability, the profiling result was re-directed into an external database container, and some of BI reports were generated on top of that to support monitoring the data quality.
Room for Improvement:
The product can be better in the future by improving the following:
- Make the IA db more visible so that the analyzed result and output of data rule executable can be easily seen by external programs
- Enhance the integration between DataStage and Information Analyzer. Since v8.7, the Data Rule stage has been introduced which has dramatically enabled and eased the use of Data rules in IA. I would like to see a similar stage or solution that can be useful in terms of data profiling
- Support more CLIs or extend the existing IAAdmin.sh and IAJob.sh scripts, which could be more flexible for the end users.
- Provide more log details when running any data profiling or rules from the console directly. As of v9.1, the troubleshooting process is still painful. Developers have to do that from multiple places, either from the underlying DataStage job log or the log files placed in both server and client side.
Use of Solution:
Yes. When I tried to deploy some data rules in production, the package has to be placed on the server. This could be better if IA supports the package located on the machine where the client software installed.
Yes. When trying to run a data rule towards tables containing big volume of data, the IA db does not perform well.
Well, the setup is not complex because the IA is one of Information Server components, however, as a post-installation the configuration process for the end users is not simple. Users have to manually setup the IA db from provided scripts, configure the engine and IA db connections from the console.
Information Analyzer is still a good product. It enables the auto data profiling for the end users and it is also capable of doing data quality analysis via business language oriented rules. Being a component of Information Server suite, IA is also leveraging some advantages from it, like shared metadata and powerful parallel engine. Although there are some features which need to be enhanced, I would still recommend others to use IA as a product for data profiling and data quality control.