Difference between Reference Data and Master Data

It is not unusual for people to use ‘Reference Data’ and ‘Master Data’ interchangeably without understanding the differences. Lets try to understand the differences with an example of sales transaction. A sales transaction contains information like…. Store, Products Sold, Sales Person, Store Name, Sales Date, Customer, Price, Quantity, etc. Attributes from the above example can be separated […]

For every rule there is an exception; for each exception there are more exceptions…

To implement an ETL process there are many steps that are followed. One such step is creating a mapping document. This mapping document describes the data mapping between the source systems and the target and the rules of data transformation. Ex. Table / column map between source and target, rules to identify unique rows, not […]

Type II dimension: What is expiry date? Does it come from Source or is it ETL date?

I assume everyone is aware of type II dimension concept.  In type II dimension when we get an updated row we retire old row and insert new one. But what do you populate  in the from_date —to_date attributes. ETL create/update datetime or when that particular row was retired in the source system? The question can be […]

ETL Strategy for Reversing wrong file/data load.

We all know ‘garbage in, garbage out’. But what happens when the data provider (source system) realizes that they have provided incorrect data to the data warehouse? And now they want to correct the mistake by providing you new data! By then ETL has already loaded the data. The data is inserted, updated & deleted! […]

ETL Interface Architecture (ETL-IA) tm

An enterprise data warehouse has by definition has more then one data source. In simple language a table employee in the EDW will have multiple sources like HR dept, Sales dept, and so on. Also if the company is a group of companies then same department can repeat across various companies. Let’s say the DIM_EMPLOYEE […]

Reference data Management & Reference Data Hub

This is not a new concept, but it is fast becoming a specialized area in Data Management. Example in a company employee list & details are managed in central repository and distributed it all over the organization. Similar thing can happen to customer list/details, stock list/ details etc. What is the relationship of such repositories […]

Readers welcome to write at ETLGuru.com

Readers, I will be focusing on ETL BOOK, any reader wishes to write /contribute to the site are welcome. Let me know if anyone is interested. I will review, edit and publish. However this does not mean that I will not write here. In fact I  think that the website & the ETL book will be […]

Informatica interview questions & FAQs

What is a source qualifier? What is a surrogate key? What is difference between Mapplet and reusable transformation? What is DTM session? What is a Mapplet? What is a look up function? What is default transformation for the look up function? What is difference between a connected look up and unconnected look up? What is […]

ETLGuru.com: ETL Strategies and Solutions for Data Warehouse……..ETL book by Sandesh Gawande

Ok! I am finally planning to publish ETL book with detailed information on all aspects related to ETL and Data Integration. The ETL Book will be used by Data Warehouse Managers, Data Warehouse Architects & ETL Leads. It will also contain solutions for ETL developers. The Book will be independent of ETL tools like Informatica, DataStage, etc. The good […]

Oracle recycle bin flashback table feature

When you do ‘Select * from tab;’ some times, you will be surprised to see some tables with garbage names. Welcome to the world of Oracle Recycle Bin Feature. Because of this feature, Oracle saves the dropped table in recycle bins until you clear it. 1. Empty recycle bin use the command: PURGE RECYCLEBIN; 2. […]

Data staging table / area design.

This could be long topic of discussion.  Following are the main issues I would like to discuss on staging table /database design. 1. Why staging area is needed? Unlike OLTP systems that create their own data through an user interface data warehouses source their data from other systems. There is physical data movement from source […]

ETL Startegy to store data validation rules

Every time there is movement of data the results have to be tested against the expected results. For every ETL process, test conditions for testing data are defined before/during design and development phase itself.  Some that are missed can be added later on. Various test conditions are used to validate data when the ETL process is migrated from […]

Hard references to database objects, Synonyms and ETL

1. Any time hard references are made to a database object it is an invitation to trouble. Especially in ETL where the object (table /view /procedure) can belong to schema. 2. Oracle has –> Database server name –> Schema –> Object 3. SCHEMA_NAME.TABLE_NAME is a bad way of referencing objects from the ETL architecture point […]

Loading & testing fact/transactional/balances (data), which is valid between dates!

This is going to be a very interesting topic for ETL & Data modelers who design processes/tables to load fact or transactional data which keeps on changing between dates. ex: prices of shares, Company ratings, etc. The table above shows an entity in the source system that contains time variant values but they don’t change […]

Aggregate aware queries

Imagine a scenario table having 20 million transactions for 10 years and a user writing a query to get data aggregated at monthly level. The best guess for the time required to return the result set, even after extensive database tuning can be in multiples of 10 minutes or even an hour. What is the […]

Talk to me!

Dear Guest Readers, This is the third month of ETL Guru. I am getting visitors from all over the world (USA, India, France, Germany, Australia, Canada). Based on my understanding I will continue to write about data integration. But it will be only from my point of view. Also I never know if the reader […]

Multiple executions ETL process against same set of data.

Every ETL designer, developer & tester should always ask this question…”What will happen, if I run the ETL process multiple times, against the same data set?” Answer: 1. I get the same result set. Answer: 2. I get multiple result set. If you go back to the original article on What is ETL & What […]

ETL delta logic & de-normalization of data model.

It is a normal practice in data warehouse to de normalizes (Or once auto corrected as demoralize) as the data model for performance. I am not going to discuss the benefits vs. issues with de-normalization.  As by the time it comes to the ETL guy the fate of the model is already decided. Let’s look […]

Types data elements and entities (Tables) for ETL.

It is important for an ETL developer to understand the types of tables and data, to intelligently design ETL processes. Once the common types objects are understood, reusable templates for ETL can be developed, regardless of business logic. This will greatly improve the efficiency of an ETL developer. 1. Reference data 2. Dimensional data (master […]

Data integration basics-Dimension Confirmation.

The purpose of this topic is to establish the basics for design of ETL processes. With out the understanding an ETL process for data integration cannot be designed or developed.  A database can have multiple sources. Multiple sources may contain a data set of entirely different subject areas, but some data set will intersect. Example […]

ETL and the importance of dates.

ETL is all about data movement, batch processes and so on. The biggest criterias for data movement are based on dates. Dates like ETL_CREATE_DTTM, ETL_UPDATE_DTTM, SRC_SYS_CREATE_DTTM, SRC_SYS_UPDATE_DTTM, BUSINESS_CREATE_DATE, BUSINESS_UPDATE_DATE Proper understanding of dates is one of the most essential requirements, while designing, developing, testing, and scheduling ETL process. Lets discuss each of these in details.. […]

What ETL performance (processing speed) is good enough?

Every ETL developer / Data Warehouse Manager wants to know if the ETL processing speed  (Usually measured in NUMBER OF ROWS PROCESSED PER SECOND on target side)is good enough? What is the industry standard? I can give a politically correct answer like… Well it depends…  But I would rather be blunt and say that usually […]

What ETL is not?

ETL should not be confused with a data creation process. It never creates new data. If a list of hundred employees is being loaded, one more employee cannot be added to the list and make it hundred and one. Or if last name of customer is absent an arbitrary last name cannot be substituted. Data […]

Simulating Oracle Sequences in Sybase & SQL Server

Programmatic control is lost when identity columns are used in Sybase and SQL Server. I do not recommend using Identity columns to create surrogate keys during ETL process. There are many more reasons for that. Oracle has the sequence feature which is used extensively by Oracle programmers. I have no clue why other vendors are […]

Slow Running ETL process (read side) & Table Statistics (Oracle).

Sometimes an ETL process runs considerably slow speed. During test for the small result set it might fly but when a million rows are applied the performance takes a nosedive. There can be many reasons for slow ETL process.  The process can be slow because read, transformation, load. Lets eliminate the transformation and load for […]