For every rule there is an exception; for each exception there are more exceptions…

To implement an ETL process there are many steps that are followed. One such step is creating a mapping document. This mapping document describes the data mapping between the source systems and the target and the rules of data transformation.

Ex. Table / column map between source and target, rules to identify unique rows, not null attributes, unique values, and range of a attributes, transformations rules, etc.


Without going into further details of the document, lets analyze the very next step. It seems obvious and natural to start development of the of the ETL process. The ETL developer is all fired up and comes up with a design document and starts developing, few days time the code is ready for data loading.
But unexpectedly (?) the code starts having issues every few days. Issues are found and fixed. And then it fails again. What’s happening? Analysis was done properly; rules were chalked out & implemented according to the mapping document. But why are issues popping up?  Was something missed?
Maybe not! Isn’t it, normal to have more issues in the initial lifetime of the processes?
 
Maybe Yes! You have surely missed ‘Source System Data Profiling’. The business analyst has told you rules as the how the data is structured in the source system and how it is supposed to behave; but he/she has not told you the ‘buts and ifs’ called as EXCEPTIONS for those rules.
 
To be realistic it is not possible for anyone to just read you all rules and exceptions like a parrot. You have to collaborate and dig the truth. The actual choice is yours, to do data profiling on the source system and try to break all the rules told by the analyst. Or you can choose to wait for the process to go live and then wakeup every night as the load fails.  If you are lucky enough you deal with an unhappy user every morning you go to the office. 
 
Make the right choice; don’t miss ‘Source system data profiling’ before actually righting a single line of code. Question every rule. Try to find exception to the rules. There must be at least 20 tables. One table on an average will have 30 columns; each column will have on an average 100k values. If you make matrix of number of tables * columns * data values, it will give the number of reasons the why your assumptions may be wrong.   It’s like unit testing source data even without loading. There is a reason why machines alone cannot do your job; there is reason why IT jobs are more paying.
 
Remember, ‘for every rule there is an exception; for each exception there are more exceptions…’

One Response to “For every rule there is an exception; for each exception there are more exceptions…”

  1. Anthony close says:

    ETL Question – does the ETL have the capability to fail a process if the expected column that was mapped is not present? Another words I load a source into the system for the ETL to pick up, I have mapped 4 columns to which the new source loaded has changed and 3 out of the 4 columns are present. Does the ETL have a feature to kill the extraction process or woul dit continue extracting? Thanks

Leave a Reply