Archive for the ‘Data Modeling’ Category

Type II Dimension or a Fake Fact Table!

Tuesday, January 10th, 2012

You are sales manager of North-East region in 2010 to Oct 23 2011 and did total sales of 1 million. The table that stores your information looks like….
Dim Sales Person
Sales Person Key =99
Sales Person Name = Tony Peterson
Sales Person Designation = Sales Manager
Sales Region = North-East
Phone = 203 203 9999
Date Of Birth = 9/9/1977

Sales fact
Transaction Date Key
Sales Person Key
Product Key
Product Qty
Product $Amount

As of Oct 23 management asks for the sales amount by sales person and region. The result would look like: Tony Peterson, North-East, and 1 Million.
From 24th Oct 2012 there is change and you have been transferred to South-East region. So some IT guy updated the sales region in Sales Person table to South-East.
Sales Person
Sales Person Key =99
Sales Person Name = Tony Peterson
Sales Person Designation = Sales Manager
Sales Region = South-East
Phone = 203 203 9999
Date Of Birth = 9/9/1977

As you do your sales next few month you did a sales of .4 million in South-East. On Jan 2012 management again asked for the sales figure by employee and region. What will the report look like?
Tony Peterson, South-East, and 1.4 Million.
Is the report correct? The answer is obvious total sales figure is incorrect for South-East as it will show 1.4 millions. You only did .4 million in the South-East and not 1.4. The reason it is incorrect is due to the fact that somebody updated the dimension record with the current region.

Now that we know the problem and exact cause of it, we need to look for a solution. By this time most you will say …this is no big deal it’s a classic issue of slowly changing dimension. Let me create a type II dimension… So I can preserve history of transfer.
Dim Sales Person II
Sales Person Key =99, 100
Sales Person SSN=000-00-9999, => 000-00-9999
Sales Person Name = Tony Peterson, => Tony Peterson,
Sales Person Designation= Sales Manager, => Sales Manager
Sales Region = North-East, => South-East
Phone = 203 203 9999,=> 203 203 9999
Date Of Birth = 9/9/1977, => 9/9/1977

Now since there are two records for same sales person you can join with sales facts and you can separate the regions. The correct facts are as follows.
Tony Peterson, North-East, and 1.0 Million
Tony Peterson, South-East, and 0.4 Million

Observe the Type II dimension carefully. You will note that
– The key identifier (PK) for the Sales person will keep changing. Note it changed from 99 to 100.
– Instead of having 1 one row per salesman now it has two or even more with time.
– The Sales Region is time variant fact about the sales person
– We are indeed storing time variant facts in type II dimension.
So Type II dimension is really more of a fact table. So is storing time variants facts a good design at all? Why not store the facts in a fact table? Read other articles on pitfalls of Type II dimension.

Pitfalls of type II dimension

Tuesday, September 15th, 2009

Type II dimension has been popularized by R Kimball. it has become so popular that in any interview related to data warehouse, the interviewer will surely ask to explain the concepts. And chances are that if you don’t know; they will laugh at your ignorance and reject you.
Here’s your chance to laugh at them..
If you read this article, probably you will end up knowing something more than them. This is not because you will find the definition of type II dimension, but for an entirely different reason.

To be continued…

To clearly explain the pitfalls of Type II dimension, let’s take an example. In the example, there are three tables, that is, DIM_Instrument, FACT_Trade, and FACT_Settlement. Each of the tables contains data as shown below:

DIM_Instrument table
DIM_Instrument

In the Dim_Instrument table, the property of instrument IBM changes from X to Y. So to maintain type II dimensions, a new entry is added in the table by updating the status of current entry to obsolete (denoted by ‘O’) and adding date in the TODT column as well. In the new entry, the ToDT column is NULL and the status is current (denoted by ‘C’).

FACT_Trade table
FACT_Trade

The trade table contains information about just one trade and that was executed on April 29th, 2011, which means it was processed for InstrumentKey ‘1’ as Instrumentkey ‘2’ did not exist on April 29th, 2011.

FACT_Settlement table
FACT_Settlement

Generally, it takes three days for a trade to settle. So the trade that was executed on 29th April 2011, got settled only on 2-MAY-2011. During this period the property of instrument ‘IBM’ changed on May 1st 2011, a new entry was made in the DIM_Instrument table for Instrument IBM, which incremented the instrumentKey to 2.

Now in the settlement table, the instrument key against the same trade is different, which can cause the issues/concerns that are described using various scenarios.

Scenario 1:
Get data for each trade and settlement for the current instrument, using the following query:

SELECT T.INSTRUMENTKEY, T.TradeDate as Sec_Date
From
DIM_INSTRUMENT I, FACT_TRADE T
WHERE
T.InstrumentKey=I.InstrumentKey
and I.Status=’C’
UNION ALL
SELECT S.INSTRUMENTKEY, S.SettlementDt as Sec_Date
From
DIM_INSTRUMENT I, FACT_SETTLEMENT S
WHERE
S.InstrumentKey=I.InstrumentKey
and I.Status=’C’;

Scenario1

The output does not show any data from Fact_Trade table. Because the current record in Dim_instrument is with InstrumentKey as ‘2’ and the Fact_Trade table does not contain InstrumentKey ‘2’.

It happens because the instrumentkey is changed between trade execution and trade settlement. Though, the settlement is done against the trade processed on April 29th, 2011 but with the change in InstrumentKey in between, there is no way to retrieve it.

Scenario 2:

Rather than querying data for current instrument, get data for all the instrument records. You can do so by using the following query:

SELECT T.INSTRUMENTKEY, T.TradeDate as Sec_Date
From
DIM_INSTRUMENT I, FACT_TRADE T
WHERE
T.InstrumentKey=I.InstrumentKey
UNION ALL
SELECT S.INSTRUMENTKEY, S.SettlementDt as Sec_Date
From
DIM_INSTRUMENT I, FACT_SETTLEMENT S
WHERE
S.InstrumentKey=I.InstrumentKey

The above query returns the following output:
Scenario2OP1

If you analyze the output closely, it returns the trade information for InstrumentKey ‘1’ but no settlement information for InstrumentKey ‘1’. Similarly for InstrumentKey ‘2’ there is no trade information but the settlement information. This output can be best viewed as the following:
Scenario2OP2

For an instrument when trade exists, settlement does not exist and vice versa.

Scenario 3:
Maintain a relationship between two fact tables by maintaining a common key (i.e. TradeID) and join both the tables based on this common key. In this case, the TradeID uniquely identifies the trade and settlement data that refer to the same trade.
But this scenario results in fact-to-fact relationship, which is not recommended because of the following reasons:

  • Fact tables being the largest tables adversely impact the performance of the join.
  • It is not a practical scenario when multiple fact tables need to be joined because doing so complicates the design and is not always possible to achieve.

All the above scenarios highlight various pitfalls of Type II dimension. All the scenarios and examples used throughout this article are purely for demonstrating the highlighted pitfalls.

Loading & testing fact/transactional/balances (data), which is valid between dates!

Tuesday, July 25th, 2006
This is going to be a very interesting topic for ETL & Data modelers who design processes/tables to load fact or transactional data which keeps on changing between dates. ex: prices of shares, Company ratings, etc.

variable_bond_interest.JPG

The table above shows an entity in the source system that contains time variant values but they don’t change daily. The values are valid over a period of time; then they change. variable_bond_interest_fct1.JPG

1 .What the table structure should be used in the data warehouse?

Maybe Ralph Kimball or Bill Inmon can come with better data model! :-) But for ETL developers or ETL leads the decision is already made so lets look for a solution.

2. What should be the ETL design to load such a structure?

Design A

  • There is one to one relationship between the source row and the target row.
  • There is a CURRENT_FLAG attribute, that means every time the ETL process get a new value it has add a new row with current flag and go to the previous row and retire it. Now this step is a very costly ETL step it will slow down the ETL process.
  • From the report writer issue this model is a major challange to use. Because what if the report wants a rate which is not current. Imagine the complex query.

Design B

  • In this design the sanpshot of the source table is taken every day.
  • The ETL is very easy. But can you imagine the size of fact table when the source which has more than 1 million rows in the source table. (1 million x 365 days = ? rows per year). And what if the change in values are in hours or minutes?
  • But you have a very happy user who can write SQL reports very easily.

Design C

  • Can there be a comprimise. How about using from date (time) – to date (time)! The report write can simply provide a date (time) and the straight SQL can return a value/row that was valid at that moment.
  • However the ETL is indeed complex as the A model. Because while the current row will be from current date to- infinity. The previous row has to be retired to from date to todays date -1.
  • This kind of ETL coding also creates lots of testing issues as you want to make sure that for nay given date and time only one instance of the row exists (for the primary key).

Which design is better, I have used all depending on the situtation.

3. What should be the unit test plan?

There are various cases where the ETL can miss and when planning for test cases and your plan should be to precisely test those. Here are some examples of test plans

a. There should be only one value for a given date/date time

b. During the initial load when the data is available for multiple days the process should go sequential and create snapshots/ranges correctly.

c. At any given time there should be only one current row.

d. etc

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, etc.

Aggregate aware queries

Monday, June 19th, 2006

Imagine a scenario table having 20 million transactions for 10 years and a user writing a query to get data aggregated at monthly level. The best guess for the time required to return the result set, even after extensive database tuning can be in multiples of 10 minutes or even an hour.

What is the solution?
There are three solutions based on the same concept of pre-aggregation, but the utilization is quite different.
1. Write different queries in advance to hit different tables
2. Have a reporting tool, which is aggregate aware (Client side). The reporting tool itself will decide ehich tbales to hit and rewrite it’s own queries.
3. Have a database that supports transparent query re-write. The database it self takes a simple query and re-writes against aggregrate tables.

Here’s an Oracle version of query re-write based on point 3 in this article.

 More details to come………… 

Types data elements and entities (Tables) for ETL.

Saturday, June 3rd, 2006

It is important for an ETL developer to understand the types of tables and data, to intelligently design ETL processes. Once the common types objects are understood, reusable templates for ETL can be developed, regardless of business logic. This will greatly improve the efficiency of an ETL developer.

1. Reference data

2. Dimensional data (master data)

3. Transactional data

4. Transactions

5. Balances

6. Summary/Aggregations

7. Snapshots

8. Staging

9. Out triggers/mini dimensions

10. Log tables

11. Meta data tables

12. Security tables

13. Configuration tables