Archive for the ‘Data Quality’ Category

For every rule there is an exception; for each exception there are more exceptions…

Sunday, May 13th, 2007

To implement an ETL process there are many steps that are followed. One such step is creating a mapping document. This mapping document describes the data mapping between the source systems and the target and the rules of data transformation.

Ex. Table / column map between source and target, rules to identify unique rows, not null attributes, unique values, and range of a attributes, transformations rules, etc.


Without going into further details of the document, lets analyze the very next step. It seems obvious and natural to start development of the of the ETL process. The ETL developer is all fired up and comes up with a design document and starts developing, few days time the code is ready for data loading.
But unexpectedly (?) the code starts having issues every few days. Issues are found and fixed. And then it fails again. What’s happening? Analysis was done properly; rules were chalked out & implemented according to the mapping document. But why are issues popping up?  Was something missed?
Maybe not! Isn’t it, normal to have more issues in the initial lifetime of the processes?
 
Maybe Yes! You have surely missed ‘Source System Data Profiling’. The business analyst has told you rules as the how the data is structured in the source system and how it is supposed to behave; but he/she has not told you the ‘buts and ifs’ called as EXCEPTIONS for those rules.
 
To be realistic it is not possible for anyone to just read you all rules and exceptions like a parrot. You have to collaborate and dig the truth. The actual choice is yours, to do data profiling on the source system and try to break all the rules told by the analyst. Or you can choose to wait for the process to go live and then wakeup every night as the load fails.  If you are lucky enough you deal with an unhappy user every morning you go to the office. 
 
Make the right choice; don’t miss ‘Source system data profiling’ before actually righting a single line of code. Question every rule. Try to find exception to the rules. There must be at least 20 tables. One table on an average will have 30 columns; each column will have on an average 100k values. If you make matrix of number of tables * columns * data values, it will give the number of reasons the why your assumptions may be wrong.   It’s like unit testing source data even without loading. There is a reason why machines alone cannot do your job; there is reason why IT jobs are more paying.
 
Remember, ‘for every rule there is an exception; for each exception there are more exceptions…’

Multiple executions ETL process against same set of data.

Thursday, June 15th, 2006

Every ETL designer, developer & tester should always ask this question…”What will happen, if I run the ETL process multiple times, against the same data set?”

Answer: 1. I get the same result set.
Answer: 2. I get multiple result set.

If you go back to the original article on What is ETL & What ETL is not! You will immediately come to the conclusion that Answer 2 is incorrect, as ETL is not allowed to create data.

Why will the process run more than once against the same set of data? Many reasons, example most common being operators mistake, accidental kickoff, old set of data file remaining in the directory, staging table loaded more than once, intentional rerun of ETL process after correction of some data in source data set, etc. Without going into further details, I would advise ETL folks to always include in your process ways to prevent it from happening by one or more combinations of following methods…
1. Identify the primary key (logical/physical) and put update else insert logic.
2. Deleting the target data set before processing again (based on logical/physical primary key)
3. Preventing occurrences of multiple runs by flagging processed dates
4. Marking processed records with processed flags after commit
5. Prevention of multiple loads in the staging area
6. identifying duplicate records in stage area before the data gets processed
7. more…

So do these experiments in the development or test environment run the ETL process more than once, check the result! If you get the result 2 (copies of rows, with no way to distinguish or retiring the old rows)
The designer or the developer is wrong & if the process as passed the QA or testing then the tester is wrong.

Bottom line:
A test case to check multiple runs is must in life cycle of an ETL process.