Archive for September, 2006

ETL Startegy to store data validation rules

Wednesday, September 6th, 2006

Every time there is movement of data the results have to be tested against the expected results. For every ETL process, test conditions for testing data are defined before/during design and development phase itself.  Some that are missed can be added later on.

Various test conditions are used to validate data when the ETL process is migrated from DEV-to->QA-to->PRD. These test conditions are can exists in the developer’s/tester’s mind /documented in word or excel. With time the test conditions either lost ignored or scattered all around to be really useful.

In production if the ETL process runs successfully without error is a good thing. But it does not really mean anything. You still need rules to validate data processed by ETL. At this point you need data validation rules again!

A better ETL strategy is to store the ETL business rules in a RULES table by target table, source system. These rules can be in SQL text. This will create a repository of all the rules in a single location which can be called by any ETL process/ auditor at any phase of the project life cycle.

There is also no need to re-write /rethink rules. Any or all of these rules can be made optional, tolerances can be defined, called immediately after the process is run or data can be audited at leisure.

This Data validation /auditing system will basically contain
A table that contains the rules,
A process to call is dynamically and
A table to store the results from the execution of the rules

Benefits:

Rules can be added dynamically with no cange to code.

Rules are stored permanantly.

Tolerance level can be changed with ever changing the code

Biz rules can be added or validated by business experts without worring about the ETL code.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc.

Hard references to database objects, Synonyms and ETL

Wednesday, September 6th, 2006

1. Any time hard references are made to a database object it is an invitation to trouble. Especially in ETL where the object (table /view /procedure) can belong to schema.

2. Oracle has –> Database server name –> Schema –> Object

3. SCHEMA_NAME.TABLE_NAME is a bad way of referencing objects from the ETL architecture point of view as well as from data warehouse architecture point of view. The reason being any hard coding of the objects will cause trouble when they are moved from one schema to another.

4. An alternative could be to log in with the same SCHEMA_NAME where the objects exists.  But this will create a security hazard.  As ETL processes in will be in the same schema as that of the data warehouse structures and processes.

5. Also refereeing a object directly can also be issue because if the table name is changed or the column name is changes it will result in immediate code change.

6. Much argument can made that it’s not really an issue but just a little work or little more work. But what most people don’t understand is it is not how much work but when this work can come up. Most likely when you are migrating/ production run, or when the team is really at short of time.

7. Hard coded references also reduce flexibility… what if ETL wants to read data from some other schema to test an ETL process? What then?

Ok! What’s the solution then… 

hardcodedref.JPG

 

A better way would be to have a dedicated schema for ETL user. And all the objects tables /views /procedures can be references by public synonyms. This way any object can be replaced by a new reference, temporary object, view or private synonym without; ever changing the code. It’s a much better way there are many more advantages but of course there will be more maintenance.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc.