Archive for the ‘ETL Basics’ Category

Difference between Reference Data and Master Data

Thursday, June 19th, 2008

It is not unusual for people to use ‘Reference Data’ and ‘Master Data’ interchangeably without understanding the differences.
Lets try to understand the differences with an example of sales transaction.

A sales transaction contains information like….
Store,
Products Sold,
Sales Person,
Store Name,
Sales Date,
Customer,
Price,
Quantity,
etc.

Attributes from the above example can be separated into two types: Factual (transactional) and Dimensional information
Price and Quantity are measurable attributes of a transaction.
Store, Products Sold, Sales Person, Store Name, Sales Date, and Customer are dimensional attributes of a transaction.

We can see that the dimensional data is already embedded in the transaction. And with dimensional attributes we can successfully complete the transaction.Dimensional data that directly participates in a transaction is master data.

But is the list of dimensional attributes in the transaction complete? 

Asking few analytical questions can help us discover the answer. 
     -What is the Male to Female ratio of customers doing purchase at the store?
     -What type of products are customers buying? Ex: Electronic, Computers, Toys
     -What type of Store is it?  Ex: Web store, Brick & Mortar, Telesales, Catalog Sales

The above questions cannot be answered by attributes in the transaction. These dimensional data is missing in the transactions.  This missing dimensional data that does not directly participate in transaction but are attributes of the dimension is reference data.

Why it is important for an ETL person to understand the differences? Well once the  ‘Reference Data Management’ (RDM) was popular then suddenly in last few years there is this new word ‘Master Data Management’ (MDM). These words mean different things and they have significant implication on how they are managed. But that will be a topic of discussion for some future post!  I hope this article will help clear atleast some confusion.

 

 

Lazy links to ETL basics.

Thursday, June 15th, 2006

Ok check this on Wikipedia

http://web.archive.org/web/20070523031241/http://en.wikipedia.org/wiki/ETL

I will add more..

What ETL is not?

Wednesday, May 10th, 2006

ETL should not be confused with a data creation process. It never creates new data. If a list of hundred employees is being loaded, one more employee cannot be added to the list and make it hundred and one. Or if last name of customer is absent an arbitrary last name cannot be substituted.

Data warehouses are not OLTP systems. Duplication of calculations in Source system & the data warehouse should not be attempted, as the numbers will be very difficult to match during QA. Also in future the process in the source system can change that will result in asynchronous data.

ETL cannot change the meaning of data. For example for sex  ‘M’ and ‘F’ in source system sex flag to ‘1’ and ‘2’ is used in the Data Warehouse respectively. This is OK because this does not change the business meaning of the data. It only has changed the representation of the data.

What Is ETL?

Monday, April 24th, 2006

ETL is the automated and auditable data acquisition process from source system that involves one or more sub processes of data extraction, data transportation, data transformation, data consolidation, data integration, data loading and data cleaning.

Disscussion:

Source System can be any application or data store that creates or stores data and acts as a data source to other systems. If data is created in such a source system it can as well called Data Factory.

Automation is critical or the very purpose of ETL will be defeated. ETL no good if processes need to be manually scheduled, executed or manually monitored.

Data quality is a critical factor for the success of data warehousing projects. If data is of inadequate quality, then the knowledge workers who query the data warehouse and the decision makers who receive the information cannot trust the results. In order to obtain clean and reliable data, it is imperative to focus on data quality. While many data warehouse projects do take data quality into consideration, it is often given a delayed afterthought. Even QA after ETL is not good enough the Quality process needs to be incorporated in the ETL process itself.. Data quality has to be maintained for individual records or even small bits of information to ensure accuracy of complete database. One rotten tomato can ruin whole basket. It has to be ensured that the data is consistently updated to represent the source system. The accuracy of individual element of data is important so is the integrity of data across various pieces of information.

Extraction is first major step in physical implementation of ETL. Extraction initiates or triggers further downstream processes.

Needless to say, once data is extracted it has to be hauled and transported to target, because the physical location of the source system might be different from the target warehouse. Also many aspects like security, encryption, bandwidth, and transportation time have to be considered which will be discussed in the upcoming chapters.

Transformation is the series of tasks that prepares the data for loading into the warehouse. Once data is secured, you have worry about its format or structure. Because it will be not be in the format needed for the target. Example the grain level, data type, might be different. Data cannot be used as it is. Some rules and functions need to be applied to transform the data. There are many types of transformation that occur on the incoming data, which will be discussed, latter on.

One of the purposes of ETL is to consolidate the data in a central repository or to bring it at one logical or physical place. Data can be consolidated from similar systems, different subject areas, etc.

ETL must support data integration for the data coming from multiple sources and data coming at different times. This has to be seamless operation. This will avoid overwriting existing data, creating duplicate data or even worst simply unable to load the data in the target.

Loading part of the process is critical to integration and consolidation. Loading process decides the modality of how the data is added in the warehouse or simply rejected. Methods like addition, Updating or deleting are executed at this step. What happens to the existing data? Should the old data be deleted because of new information? Or should the data be archived? Should the data be treated as additional data to the existing one?

The incoming data might not be perfect. New York City might be shown in India or the Taj Mahal in France. Data has to be cleaned before it goes into the database.

Data can be loaded with lots of care. Does that that means data loaded in the Warehouse correct? What is the confidence level in the data? A data auditing process can only establish the confidence level. This auditing process normally happens after the loading of data.

Data Derivation is a process adding derived data values from reference or transactional data with simple string or mathimatical functions functions. Example Income – expense = profit. NY = New York. But ideally no business rules should be invloved & complex or time variant calcuations should be avoided.

Data enrichment is adding/combining external data values, rules to enrich the information already existing in the data. Example if we can get a list that provides a relationship between Zip Code,City and State, then if a address field has Zip code 06905 it be safely assumed and address can be enriched by doing a lookup on this table to get Zip Code 06905 –> City Stamford –> State CT.