Archive for the ‘ETL Testing’ Category

ETL Startegy to store data validation rules

Wednesday, September 6th, 2006

Every time there is movement of data the results have to be tested against the expected results. For every ETL process, test conditions for testing data are defined before/during design and development phase itself.  Some that are missed can be added later on.

Various test conditions are used to validate data when the ETL process is migrated from DEV-to->QA-to->PRD. These test conditions are can exists in the developer’s/tester’s mind /documented in word or excel. With time the test conditions either lost ignored or scattered all around to be really useful.

In production if the ETL process runs successfully without error is a good thing. But it does not really mean anything. You still need rules to validate data processed by ETL. At this point you need data validation rules again!

A better ETL strategy is to store the ETL business rules in a RULES table by target table, source system. These rules can be in SQL text. This will create a repository of all the rules in a single location which can be called by any ETL process/ auditor at any phase of the project life cycle.

There is also no need to re-write /rethink rules. Any or all of these rules can be made optional, tolerances can be defined, called immediately after the process is run or data can be audited at leisure.

This Data validation /auditing system will basically contain
A table that contains the rules,
A process to call is dynamically and
A table to store the results from the execution of the rules

Benefits:

Rules can be added dynamically with no cange to code.

Rules are stored permanantly.

Tolerance level can be changed with ever changing the code

Biz rules can be added or validated by business experts without worring about the ETL code.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc.

Loading & testing fact/transactional/balances (data), which is valid between dates!

Tuesday, July 25th, 2006
This is going to be a very interesting topic for ETL & Data modelers who  design processes/tables to load fact or transactional data which keeps on changing between dates.   ex: prices of shares, Company ratings, etc.  

variable_bond_interest.JPG

The table above shows an entity in the source system that contains time variant values but they don’t change daily. The values are valid over a period of time; then they change.

 

variable_bond_interest_fct1.JPG
 

1 .What the table structure should be used in the data warehouse? 

Maybe Ralph Kimball or Bill Inmon can come with better data model!  :-) But for ETL  developers or ETL leads the decision is already made so lets look for a solution.

2. What should be the ETL design to load such a structure?

Design A

  • There is one to one relationship between the source row and the target row.
  • There is a CURRENT_FLAG attribute, that means every time the ETL process get a new value it has add a new row with current flag and go to the previous row and retire it. Now this step is a very costly ETL step it will slow down the ETL process.
  • From the report writer issue this model is a major challange to use. Because what if the report wants a rate which is not current. Imagine the complex query.

Design B

  • In this design the sanpshot of the source table is taken every day.
  • The ETL is very easy. But can you imagine the size of fact table when the source which has more than 1 million rows in the source table. (1 million x 365 days = ? rows per year). And what if the change in values are in hours or minutes?
  • But you have a very happy user who can write SQL reports very easily.

Design C

  • Can there be a comprimise. How about using from date (time) – to date (time)!  The report write can simply provide a date (time) and the straight SQL can return a value/row that was valid at that moment.
  • However the ETL is indeed complex as the A model. Because while the current row will be from current date to- infinity. The previous row has to be retired to from date to todays date -1.
  • This kind of ETL coding also creates lots of testing issues as you want to make sure that for nay given date and time only one instance of the row exists (for the primary key).

Which design is better, I have used all depending on the situtation.

3. What should be the unit test plan?

 There are various cases where the ETL can miss and when planning for test cases and your plan should be to precisely test those. Here are some examples of test plans 

a. There should be only one value for a given date/date time 

b. During the initial load when the data is available for multiple days the process should go sequential and create snapshots/ranges correctly. 

c. At any given time there should be only one current row .

d. etc   

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc.