Archive for the ‘Data Profiling’ Category

For every rule there is an exception; for each exception there are more exceptions…

Sunday, May 13th, 2007

To implement an ETL process there are many steps that are followed. One such step is creating a mapping document. This mapping document describes the data mapping between the source systems and the target and the rules of data transformation.

Ex. Table / column map between source and target, rules to identify unique rows, not null attributes, unique values, and range of a attributes, transformations rules, etc.


Without going into further details of the document, lets analyze the very next step. It seems obvious and natural to start development of the of the ETL process. The ETL developer is all fired up and comes up with a design document and starts developing, few days time the code is ready for data loading.
But unexpectedly (?) the code starts having issues every few days. Issues are found and fixed. And then it fails again. What’s happening? Analysis was done properly; rules were chalked out & implemented according to the mapping document. But why are issues popping up?  Was something missed?
Maybe not! Isn’t it, normal to have more issues in the initial lifetime of the processes?
 
Maybe Yes! You have surely missed ‘Source System Data Profiling’. The business analyst has told you rules as the how the data is structured in the source system and how it is supposed to behave; but he/she has not told you the ‘buts and ifs’ called as EXCEPTIONS for those rules.
 
To be realistic it is not possible for anyone to just read you all rules and exceptions like a parrot. You have to collaborate and dig the truth. The actual choice is yours, to do data profiling on the source system and try to break all the rules told by the analyst. Or you can choose to wait for the process to go live and then wakeup every night as the load fails.  If you are lucky enough you deal with an unhappy user every morning you go to the office. 
 
Make the right choice; don’t miss ‘Source system data profiling’ before actually righting a single line of code. Question every rule. Try to find exception to the rules. There must be at least 20 tables. One table on an average will have 30 columns; each column will have on an average 100k values. If you make matrix of number of tables * columns * data values, it will give the number of reasons the why your assumptions may be wrong.   It’s like unit testing source data even without loading. There is a reason why machines alone cannot do your job; there is reason why IT jobs are more paying.
 
Remember, ‘for every rule there is an exception; for each exception there are more exceptions…’