Archive for the ‘ETL Automation’ Category

ETL delta logic & de-normalization of data model.

Tuesday, June 6th, 2006

It is a normal practice in data warehouse to de normalizes (Or once auto corrected as demoralize) as the data model for performance. I am not going to discuss the benefits vs. issues with de-normalization.  As by the time it comes to the ETL guy the fate of the model is already decided.

Let’s look at the model in the source side, which is perfectly normalized. 

      
Now let’s look at the de normalized model on the target side. 

 
Next lets think of delta logic for loading of the dim_employee table. Ideally you would only check changes in the employee table. Then if there is any changes after the last load date time ; then get those rows from ref_employee and do the lookup to get the department & the designation and load it into the target table.
 
The issue with this delta logic is that it has not considered the effect of de normalization of employee table on the target side. If you carefully look at the two de normalized attributes dept_name and emp_designation_desc, the ETL process will miss any changes in the parent tables, so only new employees or updated employee will get the new definition of department & designation. And any employee that has not been updated in the source side will still have the same dept_name & emp_designation_desc. This is wrong.

The reason it is wrong is the ETL delta logic only picked the row from the employee table when it changed and ignored the changes in the dept & designation tables. The truth of the matter is, ” For any de normalized target table data (affected rows) should be re-captured from the source, any time there is change in the driving/core table as well as when there is change in any parent tables to which the driving table refers to.” In this case, even if there is change in department or designation table, all the rows affected on the employee tables should be re-processed.

It might seem very simple, but ETL developers/designers/modelers always miss this point. Also once developed it is very difficult to catch.

The next question is how you would catch the affected rows. Well there are ways to write SQL that combine the three tables (in this case) and treat them as one single entity and the pull rows based on the any update_dttm greater than the last ETL run. Figure out the SQL…Â