Type II Dimension or a Fake Fact Table!

You are sales manager of North-East region in 2010 to Oct 23 2011 and did total sales of 1 million. The table that stores your information looks like…. Dim Sales Person Sales Person Key =99 Sales Person Name = Tony Peterson Sales Person Designation = Sales Manager Sales Region = North-East Phone = 203 203 […]

How ETL-IA is beneficial to you?

As the previous ETL-IA blog post mentions that an addition of a source, adds an additional ETL process. It means that the time and effort that it takes to build the data warehouse is directly proportional to the number of sources. The more the number of sources, the more the warehouse system would cost. An […]

Testing in ETL, Data-Centric Projects

Testing is an investigation process that is conducted to check the quality of the product. Product can either be an applications or data. The quality of the data can only be determined by checking data against some existing standards by following a set of processes. By doing so, you find out the symptoms in form […]

Pitfalls of type II dimension

Type II dimension has been popularized by R Kimball. it has become so popular that in any interview related to data warehouse, the interviewer will surely ask to explain the concepts. And chances are that if you don’t know; they will laugh at your ignorance and reject you. Here’s your chance to laugh at them.. […]

Difference between Reference Data and Master Data

It is not unusual for people to use ‘Reference Data’ and ‘Master Data’ interchangeably without understanding the differences. Lets try to understand the differences with an example of sales transaction. A sales transaction contains information like…. Store, Products Sold, Sales Person, Store Name, Sales Date, Customer, Price, Quantity, etc. Attributes from the above example can be separated […]

ETLGuru.com: ETL Strategies and Solutions for Data Warehouse -Contents of ETL Book

Section A ,The Beginning 01 ETL The Basics 02 ETL Strategy Section B, Analysis 03 Target Systems Analysis 04 Source Systems Analysis 05 Source Target Mapping- Part I 06 Understanding Data Quality 07 Data Profiling Section C, Develop Part I 08 Understanding Data Patterns for ETL 09 Simple ETL Development Section D,  ETL Architecture & Design 10 ETL & Data Integration 11 ETL-IA (Interface Architecture) 12 ETL-IA Implementation 13 Designing […]

Slower Development Databases and Servers

It is a normal trend in it to buy the most powerful machine for production usage. Example you production database box will have 16 CPUs your QA 8 and your Development box 4 or 6. Similar ratio is maintained on ETL servers, Application servers, and Hard Disk performance. Logic being the production environment is critical […]

Informatica FAQ’s Additional

What are the different types of Lookups? The different types of lookups are static Lookup, dynamic lookup, static persistent Lookup  What is Test Load in informatica? If we want to test the execution of the mapping without loading any data in the target database then we check the option of test load. Why is parameterization […]

Find me the attribute!……….. What attribute?

An attribute with a context is a meaningless attribute. (Even if it has a definition associated with it). One of the interesting phases in ETL is the data mapping exercise. By the time data-mapping phase starts the target model and source model is already defined. But sometimes a report or excel file is pulled out […]

For every rule there is an exception; for each exception there are more exceptions…

To implement an ETL process there are many steps that are followed. One such step is creating a mapping document. This mapping document describes the data mapping between the source systems and the target and the rules of data transformation. Ex. Table / column map between source and target, rules to identify unique rows, not […]

Type II dimension: What is expiry date? Does it come from Source or is it ETL date?

I assume everyone is aware of type II dimension concept.  In type II dimension when we get an updated row we retire old row and insert new one. But what do you populate  in the from_date —to_date attributes. ETL create/update datetime or when that particular row was retired in the source system? The question can be […]

ETL Strategy for Reversing wrong file/data load.

We all know ‘garbage in, garbage out’. But what happens when the data provider (source system) realizes that they have provided incorrect data to the data warehouse? And now they want to correct the mistake by providing you new data! By then ETL has already loaded the data. The data is inserted, updated & deleted! […]

ETL Interface Architecture (ETL-IA) tm

An enterprise data warehouse has by definition has more then one data source. In simple language a table employee in the EDW will have multiple sources like HR dept, Sales dept, and so on. Also if the company is a group of companies then same department can repeat across various companies. Let’s say the DIM_EMPLOYEE […]

Reference data Management & Reference Data Hub

This is not a new concept, but it is fast becoming a specialized area in Data Management. Example in a company employee list & details are managed in central repository and distributed it all over the organization. Similar thing can happen to customer list/details, stock list/ details etc. What is the relationship of such repositories […]

Readers welcome to write at ETLGuru.com

Readers, I will be focusing on ETL BOOK, any reader wishes to write /contribute to the site are welcome. Let me know if anyone is interested. I will review, edit and publish. However this does not mean that I will not write here. In fact I  think that the website & the ETL book will be […]

Informatica interview questions & FAQs

What is a source qualifier? What is a surrogate key? What is difference between Mapplet and reusable transformation? What is DTM session? What is a Mapplet? What is a look up function? What is default transformation for the look up function? What is difference between a connected look up and unconnected look up? What is […]

ETLGuru.com: ETL Strategies and Solutions for Data Warehouse……..ETL book by Sandesh Gawande

Ok! I am finally planning to publish ETL book with detailed information on all aspects related to ETL and Data Integration. The ETL Book will be used by Data Warehouse Managers, Data Warehouse Architects & ETL Leads. It will also contain solutions for ETL developers. The Book will be independent of ETL tools like Informatica, DataStage, etc. The good […]

Oracle recycle bin flashback table feature

When you do ‘Select * from tab;’ some times, you will be surprised to see some tables with garbage names. Welcome to the world of Oracle Recycle Bin Feature. Because of this feature, Oracle saves the dropped table in recycle bins until you clear it. 1. Empty recycle bin use the command: PURGE RECYCLEBIN; 2. […]

Data staging table / area design.

This could be long topic of discussion.  Following are the main issues I would like to discuss on staging table /database design. 1. Why staging area is needed? Unlike OLTP systems that create their own data through an user interface data warehouses source their data from other systems. There is physical data movement from source […]

ETL Startegy to store data validation rules

Every time there is movement of data the results have to be tested against the expected results. For every ETL process, test conditions for testing data are defined before/during design and development phase itself.  Some that are missed can be added later on. Various test conditions are used to validate data when the ETL process is migrated from […]

Hard references to database objects, Synonyms and ETL

1. Any time hard references are made to a database object it is an invitation to trouble. Especially in ETL where the object (table /view /procedure) can belong to schema. 2. Oracle has –> Database server name –> Schema –> Object 3. SCHEMA_NAME.TABLE_NAME is a bad way of referencing objects from the ETL architecture point […]

Loading & testing fact/transactional/balances (data), which is valid between dates!

This is going to be a very interesting topic for ETL & Data modelers who design processes/tables to load fact or transactional data which keeps on changing between dates. ex: prices of shares, Company ratings, etc. The table above shows an entity in the source system that contains time variant values but they don’t change […]

Introduction to Error and exception management.

ETL is all about transportation, transformation and organizing of data. Of anytime something moves (as a matter of fact even if you are perfectly stationary and items around moves) accidents are bound to happen. So any ETL specialist believes that their code is perfect and nothing can happen obviously lives in a fool’s paradise. The […]

Aggregate aware queries

Imagine a scenario table having 20 million transactions for 10 years and a user writing a query to get data aggregated at monthly level. The best guess for the time required to return the result set, even after extensive database tuning can be in multiples of 10 minutes or even an hour. What is the […]

Lazy links to ETL basics.

Ok check this on Wikipedia http://web.archive.org/web/20070523031241/http://en.wikipedia.org/wiki/ETL I will add more..

Talk to me!

Dear Guest Readers, This is the third month of ETL Guru. I am getting visitors from all over the world (USA, India, France, Germany, Australia, Canada). Based on my understanding I will continue to write about data integration. But it will be only from my point of view. Also I never know if the reader […]

Multiple executions ETL process against same set of data.

Every ETL designer, developer & tester should always ask this question…”What will happen, if I run the ETL process multiple times, against the same data set?” Answer: 1. I get the same result set. Answer: 2. I get multiple result set. If you go back to the original article on What is ETL & What […]

ETL delta logic & de-normalization of data model.

It is a normal practice in data warehouse to de normalizes (Or once auto corrected as demoralize) as the data model for performance. I am not going to discuss the benefits vs. issues with de-normalization.  As by the time it comes to the ETL guy the fate of the model is already decided. Let’s look […]

Types data elements and entities (Tables) for ETL.

It is important for an ETL developer to understand the types of tables and data, to intelligently design ETL processes. Once the common types objects are understood, reusable templates for ETL can be developed, regardless of business logic. This will greatly improve the efficiency of an ETL developer. 1. Reference data 2. Dimensional data (master […]

Data integration basics-Dimension Confirmation.

The purpose of this topic is to establish the basics for design of ETL processes. With out the understanding an ETL process for data integration cannot be designed or developed.  A database can have multiple sources. Multiple sources may contain a data set of entirely different subject areas, but some data set will intersect. Example […]

ETL and the importance of dates.

ETL is all about data movement, batch processes and so on. The biggest criterias for data movement are based on dates. Dates like ETL_CREATE_DTTM, ETL_UPDATE_DTTM, SRC_SYS_CREATE_DTTM, SRC_SYS_UPDATE_DTTM, BUSINESS_CREATE_DATE, BUSINESS_UPDATE_DATE Proper understanding of dates is one of the most essential requirements, while designing, developing, testing, and scheduling ETL process. Lets discuss each of these in details.. […]

ETL Data file processing questions, you must ask!

Many times file processing is taken quite lightly these days. The reason is either ignorance or sheer carelessness. Another reason being most of them being from RDBMS background they are just not aware of the intricacies of file management. So wakeup…. Pre process Before processing the file have you checked if the contents in header/trailer […]

What ETL performance (processing speed) is good enough?

Every ETL developer / Data Warehouse Manager wants to know if the ETL processing speed  (Usually measured in NUMBER OF ROWS PROCESSED PER SECOND on target side)is good enough? What is the industry standard? I can give a politically correct answer like… Well it depends…  But I would rather be blunt and say that usually […]

What ETL is not?

ETL should not be confused with a data creation process. It never creates new data. If a list of hundred employees is being loaded, one more employee cannot be added to the list and make it hundred and one. Or if last name of customer is absent an arbitrary last name cannot be substituted. Data […]

Why Informatica sequences & Sybase/SQL server identity columns, should not be used?

Informatica provides the sequence object to create surrogate keys during load. These object is also sharable within various mappings hence can be used in parallel. But I will recommend never using it.  Here’s the Reason why…. 1. MIGRATION ISSUE:  The sequence is stored in the Informatica repository. That means it is disconnected from the target […]

Simulating Oracle Sequences in Sybase & SQL Server

Programmatic control is lost when identity columns are used in Sybase and SQL Server. I do not recommend using Identity columns to create surrogate keys during ETL process. There are many more reasons for that. Oracle has the sequence feature which is used extensively by Oracle programmers. I have no clue why other vendors are […]

What Is ETL?

ETL is the automated and auditable data acquisition process from source system that involves one or more sub processes of data extraction, data transportation, data transformation, data consolidation, data integration, data loading and data cleaning. Disscussion: Source System can be any application or data store that creates or stores data and acts as a data […]

Slow Running ETL process (read side) & Table Statistics (Oracle).

Sometimes an ETL process runs considerably slow speed. During test for the small result set it might fly but when a million rows are applied the performance takes a nosedive. There can be many reasons for slow ETL process.  The process can be slow because read, transformation, load. Lets eliminate the transformation and load for […]

Shell Scripts for Oracle

Here’s a lazy post for shell scripts. Links In future I will add some specalised scripts for ETL.