Archive for the ‘Data Staging’ Category

Data staging table / area design.

Wednesday, October 11th, 2006

This could be long topic of discussion.  Following are the main issues I would like to discuss on staging table /database design.

1. Why staging area is needed?

Unlike OLTP systems that create their own data through an user interface data warehouses source their data from other systems. There is physical data movement from source database to data warehouse database. Staging area is primarily designed to serve as intermediate resting place for data before it is processed and integrated into the target data warehouse. This staging are serves many purpose above and beyond the primary function
a. The data is most consistent with the source. It is devoid of any transformation or has only minor format changes.
b. The staging area in a relation database can be read/ scanned/ queried using SQL without the need of logging into the source system or reading files (text/xml/binary).
c. It is a prime location for validating data quality from source or auditing and tracking down data issues.
d. Staging area acts as a repository for historical data if not truncated
e. Etc.

2. What is the difference between staging area as compared to other areas of data warehouse?

a. Normally tables in any relational database are relational. Normally tables are not stand alone. Tables have relationship with at least one or more tables. But the staging area greatly differs in this aspect. The tables are random in nature. They are more batch oriented. They are staged in the hope that in the next phase of load there will be a process that will identify the relationship with other tables and during such load a relationship will be established.
 

3. What should the staging table look like?

Staging Table Format

a. The key shown is a meaningless surrogate key but still it has been added the reason being; as may times the data coming from a source has no unique identifier or some times the unique identifier is a composite key; in such cases when data issue is found with any of the row it is very difficult to identify the particular row or even mention it. When a unique row num is assigned to each row in the staging table it becomes really easy to reference it.

b. Various dates have added to the table; please refer date discussion here.

c. The data type has been kept as string because this data type ensures that a bad format or wrong data type row will be at least populated in the stage table for further analysis or follow-up.

d. Source system column has bee added to keep a data reference so that next process step can use this value and can have dynamic behavior during process based on the source system. Also it supports reuse of table, data partitioning etc.

e. Note the table has source as table qualifier as prefix this distinguishes the table from other source system. Example customer from another system called MKT.

d. Other columns can be added example processed flag to indicate if the row has been processed by the down stream application. It also provides incremental restart abilities for the down stream process.  Also exception flag can be added to the table to indicate that while processing the table an exception or error was raised hence the row is not processed.

4. Which design to choose?

a. Should the table be truncated and loaded?
b. Should the table will be append only?
c. Should default data type to be left as alpha numeric string (VARCHAR)?
d. Should constraints be enforced?
e. Should there be a primary Key?

It is normally based on the situation but if not sure or you don’t want to think then design suggested here should more than suffice your requirements.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc. 

Multiple executions ETL process against same set of data.

Thursday, June 15th, 2006

Every ETL designer, developer & tester should always ask this question…”What will happen, if I run the ETL process multiple times, against the same data set?”

Answer: 1. I get the same result set.
Answer: 2. I get multiple result set.

If you go back to the original article on What is ETL & What ETL is not! You will immediately come to the conclusion that Answer 2 is incorrect, as ETL is not allowed to create data.

Why will the process run more than once against the same set of data? Many reasons, example most common being operators mistake, accidental kickoff, old set of data file remaining in the directory, staging table loaded more than once, intentional rerun of ETL process after correction of some data in source data set, etc. Without going into further details, I would advise ETL folks to always include in your process ways to prevent it from happening by one or more combinations of following methods…
1. Identify the primary key (logical/physical) and put update else insert logic.
2. Deleting the target data set before processing again (based on logical/physical primary key)
3. Preventing occurrences of multiple runs by flagging processed dates
4. Marking processed records with processed flags after commit
5. Prevention of multiple loads in the staging area
6. identifying duplicate records in stage area before the data gets processed
7. more…

So do these experiments in the development or test environment run the ETL process more than once, check the result! If you get the result 2 (copies of rows, with no way to distinguish or retiring the old rows)
The designer or the developer is wrong & if the process as passed the QA or testing then the tester is wrong.

Bottom line:
A test case to check multiple runs is must in life cycle of an ETL process.