Data staging table / area design.

This could be long topic of discussion.  Following are the main issues I would like to discuss on staging table /database design.

1. Why staging area is needed?

Unlike OLTP systems that create their own data through an user interface data warehouses source their data from other systems. There is physical data movement from source database to data warehouse database. Staging area is primarily designed to serve as intermediate resting place for data before it is processed and integrated into the target data warehouse. This staging are serves many purpose above and beyond the primary function
a. The data is most consistent with the source. It is devoid of any transformation or has only minor format changes.
b. The staging area in a relation database can be read/ scanned/ queried using SQL without the need of logging into the source system or reading files (text/xml/binary).
c. It is a prime location for validating data quality from source or auditing and tracking down data issues.
d. Staging area acts as a repository for historical data if not truncated
e. Etc.

2. What is the difference between staging area as compared to other areas of data warehouse?

a. Normally tables in any relational database are relational. Normally tables are not stand alone. Tables have relationship with at least one or more tables. But the staging area greatly differs in this aspect. The tables are random in nature. They are more batch oriented. They are staged in the hope that in the next phase of load there will be a process that will identify the relationship with other tables and during such load a relationship will be established.
 

3. What should the staging table look like?

Staging Table Format

a. The key shown is a meaningless surrogate key but still it has been added the reason being; as may times the data coming from a source has no unique identifier or some times the unique identifier is a composite key; in such cases when data issue is found with any of the row it is very difficult to identify the particular row or even mention it. When a unique row num is assigned to each row in the staging table it becomes really easy to reference it.

b. Various dates have added to the table; please refer date discussion here.

c. The data type has been kept as string because this data type ensures that a bad format or wrong data type row will be at least populated in the stage table for further analysis or follow-up.

d. Source system column has bee added to keep a data reference so that next process step can use this value and can have dynamic behavior during process based on the source system. Also it supports reuse of table, data partitioning etc.

e. Note the table has source as table qualifier as prefix this distinguishes the table from other source system. Example customer from another system called MKT.

d. Other columns can be added example processed flag to indicate if the row has been processed by the down stream application. It also provides incremental restart abilities for the down stream process.  Also exception flag can be added to the table to indicate that while processing the table an exception or error was raised hence the row is not processed.

4. Which design to choose?

a. Should the table be truncated and loaded?
b. Should the table will be append only?
c. Should default data type to be left as alpha numeric string (VARCHAR)?
d. Should constraints be enforced?
e. Should there be a primary Key?

It is normally based on the situation but if not sure or you don’t want to think then design suggested here should more than suffice your requirements.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc. 

One Response to “Data staging table / area design.”


Leave a Reply

Your email address will not be published. Required fields are marked *