Type II dimension has been popularized by R Kimball. it has become so popular that in any interview related to data warehouse, the interviewer will surely ask to explain the concepts. And chances are that if you don’t know; they will laugh at your ignorance and reject you.
Here’s your chance to laugh at them..
If you read this article, probably you will end up knowing something more than them. This is not because you will find the definition of type II dimension, but for an entirely different reason.
To be continued…
To clearly explain the pitfalls of Type II dimension, let’s take an example. In the example, there are three tables, that is, DIM_Instrument, FACT_Trade, and FACT_Settlement. Each of the tables contains data as shown below:
DIM_Instrument table
In the Dim_Instrument table, the property of instrument IBM changes from X to Y. So to maintain type II dimensions, a new entry is added in the table by updating the status of current entry to obsolete (denoted by ‘O’) and adding date in the TODT column as well. In the new entry, the ToDT column is NULL and the status is current (denoted by ‘C’).
FACT_Trade table
The trade table contains information about just one trade and that was executed on April 29th, 2011, which means it was processed for InstrumentKey ‘1’ as Instrumentkey ‘2’ did not exist on April 29th, 2011.
FACT_Settlement table
Generally, it takes three days for a trade to settle. So the trade that was executed on 29th April 2011, got settled only on 2-MAY-2011. During this period the property of instrument ‘IBM’ changed on May 1st 2011, a new entry was made in the DIM_Instrument table for Instrument IBM, which incremented the instrumentKey to 2.
Now in the settlement table, the instrument key against the same trade is different, which can cause the issues/concerns that are described using various scenarios.
Scenario 1:
Get data for each trade and settlement for the current instrument, using the following query:
SELECT T.INSTRUMENTKEY, T.TradeDate as Sec_Date
From
DIM_INSTRUMENT I, FACT_TRADE T
WHERE
T.InstrumentKey=I.InstrumentKey
and I.Status=’C’
UNION ALL
SELECT S.INSTRUMENTKEY, S.SettlementDt as Sec_Date
From
DIM_INSTRUMENT I, FACT_SETTLEMENT S
WHERE
S.InstrumentKey=I.InstrumentKey
and I.Status=’C’;
The output does not show any data from Fact_Trade table. Because the current record in Dim_instrument is with InstrumentKey as ‘2’ and the Fact_Trade table does not contain InstrumentKey ‘2’.
It happens because the instrumentkey is changed between trade execution and trade settlement. Though, the settlement is done against the trade processed on April 29th, 2011 but with the change in InstrumentKey in between, there is no way to retrieve it.
Scenario 2:
Rather than querying data for current instrument, get data for all the instrument records. You can do so by using the following query:
SELECT T.INSTRUMENTKEY, T.TradeDate as Sec_Date
From
DIM_INSTRUMENT I, FACT_TRADE T
WHERE
T.InstrumentKey=I.InstrumentKey
UNION ALL
SELECT S.INSTRUMENTKEY, S.SettlementDt as Sec_Date
From
DIM_INSTRUMENT I, FACT_SETTLEMENT S
WHERE
S.InstrumentKey=I.InstrumentKey
The above query returns the following output:
If you analyze the output closely, it returns the trade information for InstrumentKey ‘1’ but no settlement information for InstrumentKey ‘1’. Similarly for InstrumentKey ‘2’ there is no trade information but the settlement information. This output can be best viewed as the following:
For an instrument when trade exists, settlement does not exist and vice versa.
Scenario 3:
Maintain a relationship between two fact tables by maintaining a common key (i.e. TradeID) and join both the tables based on this common key. In this case, the TradeID uniquely identifies the trade and settlement data that refer to the same trade.
But this scenario results in fact-to-fact relationship, which is not recommended because of the following reasons:
- Fact tables being the largest tables adversely impact the performance of the join.
- It is not a practical scenario when multiple fact tables need to be joined because doing so complicates the design and is not always possible to achieve.
All the above scenarios highlight various pitfalls of Type II dimension. All the scenarios and examples used throughout this article are purely for demonstrating the highlighted pitfalls.
not able to find complete article.
I am looking forward to develop a migration tool….i.e converting jobs from informatica to datastage! Just want to know whther such kind of tools are available in market?
Pls complete this incomplete article.. 🙁
What are your thoughts after reading the complete article?
The problem mentioned is very real life. Even I faced this issue and I have a woork around for that. It is should be taken care in the data model. Add one PARENT_INSTRUMENT_KEY attribute in the DIM_INSTRUMENT table to track that both 1 and 2 are under the PARENT_INSTRUMENT_KEY. For eg. 1, A1; 2, A1… N, AN.
Friend can you explain what is Type II dimension? When we talk about type II dimension it is a concept woven around the with the help of tools. As Sourav mentioned it can be corrected at data model.
Problem u mentioned is clearly data model issue not concepts. What do u say?
Sai, I’m glad you mentioned this. If we look carefully, this article tried to bring into notice the pitfalls of Type II dimensions. As mentioned earlier, type II requires the primary key change for the same identity to maintain the history. Once the primary key changes, we can very well imagine what kind of results it can produce.
As far as the solution is concerned, it can be implemented the way you want. It should not be assumed that there is no solution to the issues reported here. In fact, the solution for this has to be implemented at the Data Model level.
Sometimes the problem and the solution are so closely woven that we prefer not to look at them separately.
Hope it clarifies…
hi,
what is the dif b/w the etl developer and informatica deveoper
Damu,
ETL is a concept to move , copy or transform data from one system to another for various reasons. Informatica is a tool that implements the concept. There are many other tools like Informatica. E.g. Data Stage, Abinito, etc. You can also implement ETL without tools using PLSQL, Java, Perl…and so on.
—Sandesh
ETLGuru.com
The scenarios in this article are data modeling pitfalls.
In your example, fact tables Fact_Trade and Fact_Settlement do have clear and unumbigious relationship – Trade MUST be settled and Settlement fact cannot exist without trade. Fact_settlement is, for all practical purposes, a late arriving fact. And since it provides a measure for earlier occuring fact – trade, it should be utilizing dimensional attributes with effective dates matching the date of the record in Fact_Trade. In your example: Dim_Instrument columns FromDT and ToDT. That is the reason for effective dates columns in the type 2 dimension.
Further, both fact tables represent different parts of the same business process and have the same grain. As you correctly mentioned in scenario 3: “It is not a practical scenario when multiple fact tables need to be joined because doing so complicates the design and is not always possible to achieve.†Here is a design simplification: combine both fact tables into one. Add measures like Trade (default to 1) and Settlement (default to 0). That way, you can consistently reference the same dimensional key, easily count number of trades that have not been settled yet, and answer a lot of other questions without joining two fact tables.
Conclusion: Type 2 dimensions are solid and reliable way of tracking historical changes. If you have pitfalls in your star schema, start examining your data model.
Just a thought – We could do a lookup on DIM_Instrument to populate InstrumentKey in table FACT_Settlement using TradeDate instead of SettlmentDate from source tables. This will populate correct InstrumentKey in FACT_Settlement and resolve the issue. For that we will have to join source tables for Trade and Settlement correctly and pull required fields.
Please provide your experts comments on same.
Thanks!!!