Documentation has excellent potential as it provides the capability to generate data lineage and a complete set of rich documentation and output in multiple formats.
The tool supports multiple target update methods.
Support for multiple design methodologies including Data Vault which is impressive, although we didn't test that feature.
Improvements to My Organization:
It did not pass usability, reliability, sustainability or performance tests so there was no potential benefit other than minor documentation advantages over the current automated documentation solution.
Room for Improvement:
In my opinion, the tool requires a different architecture that would make it a more practical solution.
- Repository only supports a single target database
- No support for change data capture or delta detection!
- Must be used to stage data or it requires a redundant copy of staged data
- Relies 100% upon target database server with no ability to scale ETL to separate server
- Relies 100% upon generated procedural Sql (Tsql, Pl*Sql, etc.) so it inherits performance limitations with the enforced use of cursors.
- Relies 100% on the database engine to do all the work with no ability to offload it some of the work.
- Use of external tools such as SSIS for acquiring data to the staging area is possible but are in the background forcing you to lose some normal visibility that those tools provide. Also you lose the multitude of add-ons that some ETL tools provide such as adapters.
- Actually cannot acquire data from sources other than flat files. It supports use of linked servers (which many shops don't allow) or calling an ETL tool such as SSIS in the background for data movement. In that case you're still relying on an ETL tool for the ELT portion so then you're actually using two different ETL tools - Wherescape and SSIS, etc. instead of just one tool. Realize then that instead of just supporting an ETL tool now you have to support Wherescape and an ETL tool.
Specific list of areas for improvement:
* Performance (this is the area that needs the most improvement)
* Efficient storage utilization
* Count validations (data movements don't validate counts which is 101 basics to prevent data leaks)
* Scheduler reliability and transparency
* Ability to separate data acquisition/staging processing from target schema population
* Providing automatic record count validations
* Support for change data capture
* Support for in-line transformational procedure language that extends beyond database procedures
* Documentation lineage accuracy and completeness
* Support for sharing metadata repository across multiple targets
* Improvements in staging architecture requiring fewer copies of source data
* Supporting transparency into post-gens
* Providing encapsulated pre and post events for custom code that doesn't get wiped out when you re-generate supported code from the tool.
* Better support for task orchestration in a top-down way that can be easily understood.
* Support for direct parameter passing from job schedulers.
Use of Solution:
6 Week POC doing thorough analysis using specific test cases and comparing to an existing production application using a more traditional ETL tool.
Current version as of June 2015
The tool could not meet minimum performance or productivity requirements and showed significant lack of architectural integrity so it was never deployed past the POC. Current 5 hour process was turned into 7 days even after significant performance tuning. Storage space required kept grossly exceeding the existing process requirements that it was supposed to replace. Even the training class on-pren was awkward as getting the tool working for all the students was an on going problem.
The scheduler often hangs with nothing to inform the user why it isn't working. Sometimes this is simply due to a developer starting a scheduler from their PC which integrates with the repository but isn't obvious when looking at the shared scheduler. Other issues seemed to be only resolved by re-generating the process from scratch and hope for a different result (which often there is), especially if the scheduler hangs. It is very easy to choose an incorrect option when generating processes only to result in something that just hangs with no error message. Then you're knee deep in database procedural language trying to debug a tool's generated code.
The sheer resources that it consumes easily overwhelms a server and causes the database to freeze due to exceeding resource constraints. The tool generates an unexpected and seemingly unpredictable amount of staging data (both in and out of the database depending on data movement option chosen) consuming very large amounts of disk space with redundant copies of data. This tool will not promote friendly relations with your system administrators.
The product was deemed unscalable in terms of data storage and performance.
Very good, although this was a POC so I can't speak for production support after the product is purchased.
Currently use a traditional ETL tool that meets our requirements. The test scenario involved around 250 GB of source data with a resulting 50 GB (compressed) star schema target. The ETL tool can talk to all our sources including multiple cloud sources using SOAP and REST web services and this tool is just a database procedure generator with no adapter capability. It instead relies on ETL tools to do that work so it doesn't eliminate ETL tools ending up having to support not one tool but two tools - both in licensing, support and expertise which was expected to cause significant increase in total support costs.
The tool relies on ODBC connections requiring complications in security in an AD environment over using OLEDB.
It requires a developer to either access the server directly or setup file shares. It was often confusing when doing development from a PC but working with a scheduler on a server and knowing what files need to be available from the PC verses from the server. The tool requires a great deal of attention to disk usage and location especially when using native database load utilities as it unloads all the data first into temporary files during the load which can easily consume your servers available space in unpredictable ways. Getting the environment functioning just for the training class was unexpectedly difficult.
The metadata repository is unique to a single target database and cannot be shared with any other target which prevents use beyond a silo datamart. This means data lineage is a silo repository for each target.
Software vendor was the lead implementor.
The ROI was expected to be very negative to the point of failure.
Cost and Licensing Advice:
Factor in the price of specialized consulting who know this product. They're hard to find and expensive.
Other Solutions Considered:
The existing traditional mainstream ETL tool and the previous one have been functioning without problems but the concept of a self-documenting tool that provides a common framework and shared metadata repository was a compelling alternative. This tool was already in use in our shop in a limited way so it begged the question on whether it could serve an expanded role.
Advice: Before choosing any tool it is best to do your homework and compare features to competitors and test the tool in a real life volume scenario. Software is never what its advertised from any vendor. Don't take other's word for it, especially if they have a vested interest in you using the tool. I know of testimonials about this tool and others that are based on wishful thinking and not reality. Caveat Emptor my friends. So the real question is how far off is it and does it have show stoppers that will prevent it's use in your environment? Fail to take this precaution at your and your organization's peril.
The compelling selling point for this and similar tools as "datamart in a box" providers is that they are pre-built frameworks and documentation. The opportunity is that staff with little data warehousing experience can get up to speed easily and not have to worry about building logging, orchestration and error frameworks or even know much about data warehouse design.
Unfortunately this tool's basic architecture has severe flaws which make it likely impractical for most real world marts or data warehouses. It has some significant cons in the following areas:
- Metadata support specific (cannot share across multiple) target database
- Unreliable lineage
- Lacks capability to interface with application API's
- Lacks out-of-box CDC control capabilities
- Poor/outdated GUI (looks like early 1990s)
- Lack of available resources in the market place who can operate this tool
- Lack of customer base and information available
- Primitive logging and error handling
- Forced duplication of staging data multiple times
- Inability to separate source data capture from targets (tightly couples sources with targets)
- Stores all objects in single target database causing number of issues such as backup/recovery, fragmentation, limited organization options for objects
Disclosure: I am a real user, and this review is based on my own experience and opinions.