Multiple executions ETL process against same set of data.

Every ETL designer, developer & tester should always ask this question…”What will happen, if I run the ETL process multiple times, against the same data set?”

Answer: 1. I get the same result set.
Answer: 2. I get multiple result set.

If you go back to the original article on What is ETL & What ETL is not! You will immediately come to the conclusion that Answer 2 is incorrect, as ETL is not allowed to create data.

Why will the process run more than once against the same set of data? Many reasons, example most common being operators mistake, accidental kickoff, old set of data file remaining in the directory, staging table loaded more than once, intentional rerun of ETL process after correction of some data in source data set, etc. Without going into further details, I would advise ETL folks to always include in your process ways to prevent it from happening by one or more combinations of following methods…
1. Identify the primary key (logical/physical) and put update else insert logic.
2. Deleting the target data set before processing again (based on logical/physical primary key)
3. Preventing occurrences of multiple runs by flagging processed dates
4. Marking processed records with processed flags after commit
5. Prevention of multiple loads in the staging area
6. identifying duplicate records in stage area before the data gets processed
7. more…

So do these experiments in the development or test environment run the ETL process more than once, check the result! If you get the result 2 (copies of rows, with no way to distinguish or retiring the old rows)
The designer or the developer is wrong & if the process as passed the QA or testing then the tester is wrong.

Bottom line:
A test case to check multiple runs is must in life cycle of an ETL process.

2 Responses to “Multiple executions ETL process against same set of data.”

  1. Nehru says:

    This article is pretty good however I wold like to know some detailed like different approaches for ETL testing pros and cons for the same. We are looking for automation testing for ETL . Its really appreciate if you have multiple approacehs and suggestions.

    Thanks & Regards
    Nehru
    ETL Lead

  2. Suddi says:

    correctly said, checking for duplicates data in the target table is the essential part of testing DWH / ETL. But from your 7 points i have other thought:
    1. There will be performance issues if we go ahead with this solution as all the existing records will go with Update logic whether it is updated or not.
    2. Again this approach fails when the target data set is huge.
    3,4,5,6 This is correct approach as per my understanding.
    For more DWH testing related stuff Please go through teh below link:

    http://xtremthink.blogspot.com/search?updated-max=2011-08-24T11%3A20%3A00-07%3A00&max-results=1

Leave a Reply