Archive for the ‘Data Architecture’ Category

Pitfalls of type II dimension

Tuesday, September 15th, 2009

Type II dimension has been popularized by R Kimball. it has become so popular that in any interview related to data warehouse, the interviewer will surely ask to explain the concepts. And chances are that if you don’t know; they will laugh at your ignorance and reject you.
Here’s your chance to laugh at them..
If you read this article, probably you will end up knowing something more than them. This is not because you will find the definition of type II dimension, but for an entirely different reason.

To be continued…

Difference between Reference Data and Master Data

Thursday, June 19th, 2008

It is not unusual for people to use ‘Reference Data’ and ‘Master Data’ interchangeably without understanding the differences.
Lets try to understand the differences with an example of sales transaction.

A sales transaction contains information like….
Store,
Products Sold,
Sales Person,
Store Name,
Sales Date,
Customer,
Price,
Quantity,
etc.

Attributes from the above example can be separated into two types: Factual (transactional) and Dimensional information
Price and Quantity are measurable attributes of a transaction.
Store, Products Sold, Sales Person, Store Name, Sales Date, and Customer are dimensional attributes of a transaction.

We can see that the dimensional data is already embedded in the transaction. And with dimensional attributes we can successfully complete the transaction.Dimensional data that directly participates in a transaction is master data.

But is the list of dimensional attributes in the transaction complete? 

Asking few analytical questions can help us discover the answer. 
     -What is the Male to Female ratio of customers doing purchase at the store?
     -What type of products are customers buying? Ex: Electronic, Computers, Toys
     -What type of Store is it?  Ex: Web store, Brick & Mortar, Telesales, Catalog Sales

The above questions cannot be answered by attributes in the transaction. These dimensional data is missing in the transactions.  This missing dimensional data that does not directly participate in transaction but are attributes of the dimension is reference data.

Why it is important for an ETL person to understand the differences? Well once the  ‘Reference Data Management’ (RDM) was popular then suddenly in last few years there is this new word ‘Master Data Management’ (MDM). These words mean different things and they have significant implication on how they are managed. But that will be a topic of discussion for some future post!  I hope this article will help clear atleast some confusion.

 

 

Type II dimension: What is expiry date? Does it come from Source or is it ETL date?

Tuesday, February 6th, 2007

I assume everyone is aware of type II dimension concept.  In type II dimension when we get an updated row we retire old row and insert new one. But what do you populate  in the from_date —to_date attributes. ETL create/update datetime or when that particular row was retired in the source system?

The question can be asked in a different way… When a employee changes from one department  to another department; While populating in the data warehouse type II dimension EMPLOYEE table… Do you use the actual date of his transfer or the ETL date when you are populating the table?

What are you doing in your ETL code?

Hard references to database objects, Synonyms and ETL

Wednesday, September 6th, 2006

1. Any time hard references are made to a database object it is an invitation to trouble. Especially in ETL where the object (table /view /procedure) can belong to schema.

2. Oracle has –> Database server name –> Schema –> Object

3. SCHEMA_NAME.TABLE_NAME is a bad way of referencing objects from the ETL architecture point of view as well as from data warehouse architecture point of view. The reason being any hard coding of the objects will cause trouble when they are moved from one schema to another.

4. An alternative could be to log in with the same SCHEMA_NAME where the objects exists.  But this will create a security hazard.  As ETL processes in will be in the same schema as that of the data warehouse structures and processes.

5. Also refereeing a object directly can also be issue because if the table name is changed or the column name is changes it will result in immediate code change.

6. Much argument can made that it’s not really an issue but just a little work or little more work. But what most people don’t understand is it is not how much work but when this work can come up. Most likely when you are migrating/ production run, or when the team is really at short of time.

7. Hard coded references also reduce flexibility… what if ETL wants to read data from some other schema to test an ETL process? What then?

Ok! What’s the solution then… 

hardcodedref.JPG

 

A better way would be to have a dedicated schema for ETL user. And all the objects tables /views /procedures can be references by public synonyms. This way any object can be replaced by a new reference, temporary object, view or private synonym without; ever changing the code. It’s a much better way there are many more advantages but of course there will be more maintenance.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc.

Loading & testing fact/transactional/balances (data), which is valid between dates!

Tuesday, July 25th, 2006
This is going to be a very interesting topic for ETL & Data modelers who  design processes/tables to load fact or transactional data which keeps on changing between dates.   ex: prices of shares, Company ratings, etc.  

variable_bond_interest.JPG

The table above shows an entity in the source system that contains time variant values but they don’t change daily. The values are valid over a period of time; then they change.

 

variable_bond_interest_fct1.JPG
 

1 .What the table structure should be used in the data warehouse? 

Maybe Ralph Kimball or Bill Inmon can come with better data model!  :-) But for ETL  developers or ETL leads the decision is already made so lets look for a solution.

2. What should be the ETL design to load such a structure?

Design A

  • There is one to one relationship between the source row and the target row.
  • There is a CURRENT_FLAG attribute, that means every time the ETL process get a new value it has add a new row with current flag and go to the previous row and retire it. Now this step is a very costly ETL step it will slow down the ETL process.
  • From the report writer issue this model is a major challange to use. Because what if the report wants a rate which is not current. Imagine the complex query.

Design B

  • In this design the sanpshot of the source table is taken every day.
  • The ETL is very easy. But can you imagine the size of fact table when the source which has more than 1 million rows in the source table. (1 million x 365 days = ? rows per year). And what if the change in values are in hours or minutes?
  • But you have a very happy user who can write SQL reports very easily.

Design C

  • Can there be a comprimise. How about using from date (time) – to date (time)!  The report write can simply provide a date (time) and the straight SQL can return a value/row that was valid at that moment.
  • However the ETL is indeed complex as the A model. Because while the current row will be from current date to- infinity. The previous row has to be retired to from date to todays date -1.
  • This kind of ETL coding also creates lots of testing issues as you want to make sure that for nay given date and time only one instance of the row exists (for the primary key).

Which design is better, I have used all depending on the situtation.

3. What should be the unit test plan?

 There are various cases where the ETL can miss and when planning for test cases and your plan should be to precisely test those. Here are some examples of test plans 

a. There should be only one value for a given date/date time 

b. During the initial load when the data is available for multiple days the process should go sequential and create snapshots/ranges correctly. 

c. At any given time there should be only one current row .

d. etc   

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc.