Archive for the ‘ETL Interface Architecture (TM)’ Category

How ETL-IA is beneficial to you?

Wednesday, July 20th, 2011

As the previous ETL-IA blog post mentions that an addition of a source, adds an additional ETL process. It means that the time and effort that it takes to build the data warehouse is directly proportional to the number of sources. The more the number of sources, the more the warehouse system would cost.

An enterprise data warehouse contains a unified data model to obtain integrated data. Therefore, a unified entity is defined for various business entities. For example, there will be tables defined for various business entities such as employee, customer, and sales transaction. Furthermore, regardless of the source systems or the number of source systems, the attributes of business entity remain same. Therefore, in an ideal scenario in a unified data model, a unified process should populate such unified tables. Can this ideal scenario be translated into a meaningful solution in an imperfect world? Can the time and money spent on a project be independent of the number of sources?

Effort and Cost chart in Traditional ETL v/s ETL-IA
Can we achieve something that is represented by ETL-IA represented by Red line? The ETL-IA approach indicates that the cost and effort of the project does not linearly increase with the increase in sources. It happens because the process that ultimately loads into data warehouse is reusable. Doesn’t matter how many sources you add, the cost and effort becomes non-linear after the initial development cost. The above chart displays how the ETL-IA is better to the traditional ETL development methodology.
If we look back and analyze the traditional code, what is the reusability of the code from source to source; may be no reusability or very limited.

Traditional ETL Methodology

Traditional Approach

Another Traditional ETL Methodology

Traditional Approach

Problems with Traditional Approaches
There are many problems with both the above ETL methodologies. The first methodology does not offer any reusability. It requires a separate process every time a new source is added, which is very costly solution to implement. Unfortunately, most of the enterprises are spending $s on this methodology.
And the problem with the second approach is that it contains the logic for the entire source at one place. If something happens, the whole system is at risk. And also, the troubleshooting becomes more difficult in such a design or lack of design.

Then what is the solution?
The ETL-IA specifies that the target is in unified format and designed for data integration. Then why not having only one process irrespective of the number of sources and their type? ETL-IA designs the target process that is source independent. The whole solution is based on this theory, which is further implemented using the object oriented programming concepts such as interface and controller objects.
After the implementation of ETL-IA, the ETL process facilitates the addition of new source as plug and play device. Thus, the more the number of sources added to the system, the more the $s savings.

Stay tuned for more information on ETL-IA.

ETL Interface Architecture (ETL-IA) tm

Saturday, December 23rd, 2006

An enterprise data warehouse has by definition has more then one data source. In simple language a table employee in the EDW will have multiple sources like HR dept, Sales dept, and so on. Also if the company is a group of companies then same department can repeat across various companies.

Let’s say the DIM_EMPLOYEE table was designed by the data modeler and then he asks the ETL team to develop an ETL process to populate it. Very easy, write an ETL process to read from the source and load it into the target table.
Now after few days they ask that they have identified another source, should the ETL team start all over again and repeat the whole process? Is this efficient? Or is there a better way?

Well, welcome to my world of “ETL Interface Architecture ™ (ETL-IA)”

With this ETL-IA addition of 3X sources does not result in 3X work. Instead additional work added is just to handle the uniqueness of the each source. Thus saving time and money.

The question is how?

I will be discussing this in details on the website. A detailed information on implementation of process is the core of the book ETLGuru.com: ETL Strategies and Solutions for Data Warehouse.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, Teradata, etc.

Please click here to leave comments or Questions

Data integration basics-Dimension Confirmation.

Thursday, May 25th, 2006

The purpose of this topic is to establish the basics for design of ETL processes. With out the understanding an ETL process for data integration cannot be designed or developed. 

A database can have multiple sources. Multiple sources may contain a data set of entirely different subject areas, but some data set will intersect. Example Sales data and salary data will have employee as the common set. 

Data Set Intersection.JPG 

Between two or more sources, the subjects, entities or even attributes can be common. So can we integrate the data easily? Mathematically it seems very easy, but the real world is not just about numbers or exact same string values. Every thing can be similar or same but may be not represented in the exact same manner in all the sources. The differences in representation of same information and facts between two or more sources can create some of the most interesting challenges in Data Integration.

Data integration: -The first step in data integration is identification of common elements
1. Identify the common entities.
Example, Employee dimension can come from Sales system, Payroll System, etc. Products can come from manufacturing, sales, purchase etc. Once the common entity is identified its definition should be standardized. Example, does employee include fulltime employees as well as temporary workers?
 
2. Identify the common attributes.
What are the attributes that are common to employee, 1st name, 2nd name, last name, date of joining, etc? Each attribute should be defined.

3. Identify the common values
Same information can be represented in different forms in multiple source system. Example, male sex, can be represented as ‘M’ or ‘1′ or ‘male’ or some thing else by each source system. A common representation must be decided (example, ‘Male’). Also if necessary a finite set of values should be established. Example the employee sex = (‘Male’,’ Female’) and it will not allow more than these two values.

The second step is the identification of Data Steward who will own the responsibility and ownership for particular set data elements.

The third step is to design an ETL process to integrate the data into the target.  This is the most important area in the implementation of an ETL process for data integration. This topic will be discussed in more detailed under its own heading.

The final fourth step is to establish a process of maintenance, review & reporting of such elements.