Archive for the ‘Uncategorized’ Category

Difference between Reference Data and Master Data

Thursday, June 19th, 2008

It is not unusual for people to use ‘Reference Data’ and ‘Master Data’ interchangeably without understanding the differences.
Lets try to understand the differences with an example of sales transaction.

A sales transaction contains information like….
Store,
Products Sold,
Sales Person,
Store Name,
Sales Date,
Customer,
Price,
Quantity,
etc.

Attributes from the above example can be separated into two types: Factual (transactional) and Dimensional information
Price and Quantity are measurable attributes of a transaction.
Store, Products Sold, Sales Person, Store Name, Sales Date, and Customer are dimensional attributes of a transaction.

We can see that the dimensional data is already embedded in the transaction. And with dimensional attributes we can successfully complete the transaction.Dimensional data that directly participates in a transaction is master data.

But is the list of dimensional attributes in the transaction complete? 

Asking few analytical questions can help us discover the answer. 
     -What is the Male to Female ratio of customers doing purchase at the store?
     -What type of products are customers buying? Ex: Electronic, Computers, Toys
     -What type of Store is it?  Ex: Web store, Brick & Mortar, Telesales, Catalog Sales

The above questions cannot be answered by attributes in the transaction. These dimensional data is missing in the transactions.  This missing dimensional data that does not directly participate in transaction but are attributes of the dimension is reference data.

Why it is important for an ETL person to understand the differences? Well once the  ‘Reference Data Management’ (RDM) was popular then suddenly in last few years there is this new word ‘Master Data Management’ (MDM). These words mean different things and they have significant implication on how they are managed. But that will be a topic of discussion for some future post!  I hope this article will help clear atleast some confusion.

 

 

For every rule there is an exception; for each exception there are more exceptions…

Sunday, May 13th, 2007

To implement an ETL process there are many steps that are followed. One such step is creating a mapping document. This mapping document describes the data mapping between the source systems and the target and the rules of data transformation.

Ex. Table / column map between source and target, rules to identify unique rows, not null attributes, unique values, and range of a attributes, transformations rules, etc.


Without going into further details of the document, lets analyze the very next step. It seems obvious and natural to start development of the of the ETL process. The ETL developer is all fired up and comes up with a design document and starts developing, few days time the code is ready for data loading.
But unexpectedly (?) the code starts having issues every few days. Issues are found and fixed. And then it fails again. What’s happening? Analysis was done properly; rules were chalked out & implemented according to the mapping document. But why are issues popping up?  Was something missed?
Maybe not! Isn’t it, normal to have more issues in the initial lifetime of the processes?
 
Maybe Yes! You have surely missed ‘Source System Data Profiling’. The business analyst has told you rules as the how the data is structured in the source system and how it is supposed to behave; but he/she has not told you the ‘buts and ifs’ called as EXCEPTIONS for those rules.
 
To be realistic it is not possible for anyone to just read you all rules and exceptions like a parrot. You have to collaborate and dig the truth. The actual choice is yours, to do data profiling on the source system and try to break all the rules told by the analyst. Or you can choose to wait for the process to go live and then wakeup every night as the load fails.  If you are lucky enough you deal with an unhappy user every morning you go to the office. 
 
Make the right choice; don’t miss ‘Source system data profiling’ before actually righting a single line of code. Question every rule. Try to find exception to the rules. There must be at least 20 tables. One table on an average will have 30 columns; each column will have on an average 100k values. If you make matrix of number of tables * columns * data values, it will give the number of reasons the why your assumptions may be wrong.   It’s like unit testing source data even without loading. There is a reason why machines alone cannot do your job; there is reason why IT jobs are more paying.
 
Remember, ‘for every rule there is an exception; for each exception there are more exceptions…’

Type II dimension: What is expiry date? Does it come from Source or is it ETL date?

Tuesday, February 6th, 2007

I assume everyone is aware of type II dimension concept.  In type II dimension when we get an updated row we retire old row and insert new one. But what do you populate  in the from_date —to_date attributes. ETL create/update datetime or when that particular row was retired in the source system?

The question can be asked in a different way… When a employee changes from one department  to another department; While populating in the data warehouse type II dimension EMPLOYEE table… Do you use the actual date of his transfer or the ETL date when you are populating the table?

What are you doing in your ETL code?

ETL Strategy for Reversing wrong file/data load.

Friday, January 19th, 2007

We all know ‘garbage in, garbage out’. But what happens when the data provider (source system) realizes that they have provided incorrect data to the data warehouse? And now they want to correct the mistake by providing you new data! By then ETL has already loaded the data. The data is inserted, updated & deleted! Simple inserts is easy, but what about integrated data because of updates? How do you recover from such situation? Can you actually recover? What should be the ETL strategy?

Has your Data warehouse Architect/Manager/ETL Lead come up with an ETL Strategy?

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, Teradata, etc.

ETL Interface Architecture (ETL-IA) tm

Saturday, December 23rd, 2006

An enterprise data warehouse has by definition has more then one data source. In simple language a table employee in the EDW will have multiple sources like HR dept, Sales dept, and so on. Also if the company is a group of companies then same department can repeat across various companies.

Let’s say the DIM_EMPLOYEE table was designed by the data modeler and then he asks the ETL team to develop an ETL process to populate it. Very easy, write an ETL process to read from the source and load it into the target table.
Now after few days they ask that they have identified another source, should the ETL team start all over again and repeat the whole process? Is this efficient? Or is there a better way?

Well, welcome to my world of “ETL Interface Architecture ™ (ETL-IA)”

With this ETL-IA addition of 3X sources does not result in 3X work. Instead additional work added is just to handle the uniqueness of the each source. Thus saving time and money.

The question is how?

I will be discussing this in details on the website. A detailed information on implementation of process is the core of the book ETLGuru.com: ETL Strategies and Solutions for Data Warehouse.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, Teradata, etc.

Please click here to leave comments or Questions

Reference data Management & Reference Data Hub

Friday, December 22nd, 2006

This is not a new concept, but it is fast becoming a specialized area in Data Management. Example in a company employee list & details are managed in central repository and distributed it all over the organization. Similar thing can happen to customer list/details, stock list/ details etc.

What is the relationship of such repositories with data warehouse? Are they part of data warehouse? How can they be integrated into the data warehouse? And many more questions.
 
I will add more on this article. But meanwhile here is a link to read some practical example.

Readers welcome to write at ETLGuru.com

Tuesday, December 5th, 2006

Readers,

I will be focusing on ETL BOOK, any reader wishes to write /contribute to the site are welcome. Let me know if anyone is interested. I will review, edit and publish.

However this does not mean that I will not write here. In fact I  think that the website & the ETL book will be complementary.

Sandesh

Informatica interview questions & FAQs

Tuesday, November 21st, 2006
What is a source qualifier?
What is a surrogate key?
What is difference between Mapplet and reusable transformation?
What is DTM session?
What is a Mapplet?
What is a look up function? What is default transformation for the look up function?
What is difference between a connected look up and unconnected look up?
What is up date strategy and what are the options for update strategy?
What is subject area?
What is the difference between truncate and delete statements?
What kind of Update strategies are normally used (Type 1, 2 & 3) & what are the differences?
What is the exact syntax of an update strategy?
What are bitmap indexes and how and why are they used?
What is bulk bind? How does it improve performance?
What are the different ways to filter rows using Informatica transformations?
What is referential Integrity error? How do you rectify it?
What is DTM process?
What is target load order?
What exactly is a shortcut and how do you use it?
What is a shared folder?
What are the different transformations where you can use a SQL override?
What is the difference between a Bulk and Normal mode and where exactly is it defined?
What is the difference between Local & Global repository?
What are data driven sessions?
What are the common errors while running a Informatica session?
What are worklets and what is their use?
What is change data capture?
What exactly is tracing level?
What is the difference between constraints based load ordering and target load plan?
What is a deployment group and what is its use?
When and how a partition is defined using Informatica?
How do you improve performance in an Update strategy?
How do you validate all the mappings in the repository at once?
How can you join two or more tables without using the source qualifier override SQL or a Joiner transformation?
How can you define a transformation? What are different types of transformations in Informatica?
How many repositories can be created in Informatica?
How many minimum groups can be defined in a Router transformation?
How do you define partitions in Informatica?
How can you improve performance in an Aggregator transformation?
How does the Informatica know that the input is sorted?
How many worklets can be defined within a workflow?
How do you define a parameter file? Give an example of its use.
If you join two or more tables and then pull out about two columns from each table into the source qualifier and then just pull out one column from the source qualifier into an Expression transformation and then do a ‘generate SQL’ in the source qualifier how many columns will show up in the generated SQL.
In a Type 1 mapping with one source and one target table what is the minimum number of update strategy transformations to be used?
At what levels can you define parameter files and what is the order?
In a session log file where can you find the reader and the writer details?
For joining three heterogeneous tables how many joiner transformations are required?
Can you look up a flat file using Informatica?
While running a session what default files are created?
Describe the use of Materialized views and how are they different from a normal view.

Contributed by Mukherjee, Saibal (ETL Consultant)

Many readers are asking “Where’s the answer?” Well it will take some time before I get time to write it… But there is no reason to get upset… The informatica help files should have all of these answers! :-)

ETLGuru.com: ETL Strategies and Solutions for Data Warehouse……..ETL book by Sandesh Gawande

Friday, November 3rd, 2006

Ok! I am finally planning to publish ETL book with detailed information on all aspects related to ETL and Data Integration. The ETL Book will be used by Data Warehouse Managers, Data Warehouse Architects & ETL Leads. It will also contain solutions for ETL developers. The Book will be independent of ETL tools like Informatica, DataStage, etc.

The good news is that readers will not have to wait for the completion of all chapters. I will make individual completed chapters available so that readers do not have to wait for others to be complete. Once all the chapters are complete readers can return individual chapters and get complete ETL Book free or paying the difference. The ETL forum will have a thread for each chapter so that reader can discuss/ recommend/ suggest. This will make it a complete ETL package.

Comments welcome! Like on chapters, title of the book, what you would like see in the contents, etc. I will also offer money back guarantee if you dislike the book.  For competing ETL books, I would suggest you buy it (The Data Warehouse ETL Toolkit) only if the bookstore near you permits for returns, so that you can compare yourself and decide.

Please wait for link to purchase ETLGuru.com Book /Chapters… Coming soon… First few chapters before Nov 30 2006.  Dec 30 2006.

Oracle recycle bin flashback table feature

Saturday, October 14th, 2006
When you do ‘Select * from tab;’ some times, you will be surprised to see some tables with garbage names. Welcome to the world of Oracle Recycle Bin Feature. Because of this feature, Oracle saves the dropped table in recycle bins until you clear it.
1. Empty recycle bin use the command:
PURGE RECYCLEBIN;
2. To drop the table without storing the table in recycle bin use:
DROP TABLE employee PURGE;
3. To restore the table from the recycle bin use:
FLASHBACK TABLE employee TO BEFORE DROP;
So don’t forget to clean your database /Schema once in a while.

Data staging table / area design.

Wednesday, October 11th, 2006

This could be long topic of discussion.  Following are the main issues I would like to discuss on staging table /database design.

1. Why staging area is needed?

Unlike OLTP systems that create their own data through an user interface data warehouses source their data from other systems. There is physical data movement from source database to data warehouse database. Staging area is primarily designed to serve as intermediate resting place for data before it is processed and integrated into the target data warehouse. This staging are serves many purpose above and beyond the primary function
a. The data is most consistent with the source. It is devoid of any transformation or has only minor format changes.
b. The staging area in a relation database can be read/ scanned/ queried using SQL without the need of logging into the source system or reading files (text/xml/binary).
c. It is a prime location for validating data quality from source or auditing and tracking down data issues.
d. Staging area acts as a repository for historical data if not truncated
e. Etc.

2. What is the difference between staging area as compared to other areas of data warehouse?

a. Normally tables in any relational database are relational. Normally tables are not stand alone. Tables have relationship with at least one or more tables. But the staging area greatly differs in this aspect. The tables are random in nature. They are more batch oriented. They are staged in the hope that in the next phase of load there will be a process that will identify the relationship with other tables and during such load a relationship will be established.
 

3. What should the staging table look like?

Staging Table Format

a. The key shown is a meaningless surrogate key but still it has been added the reason being; as may times the data coming from a source has no unique identifier or some times the unique identifier is a composite key; in such cases when data issue is found with any of the row it is very difficult to identify the particular row or even mention it. When a unique row num is assigned to each row in the staging table it becomes really easy to reference it.

b. Various dates have added to the table; please refer date discussion here.

c. The data type has been kept as string because this data type ensures that a bad format or wrong data type row will be at least populated in the stage table for further analysis or follow-up.

d. Source system column has bee added to keep a data reference so that next process step can use this value and can have dynamic behavior during process based on the source system. Also it supports reuse of table, data partitioning etc.

e. Note the table has source as table qualifier as prefix this distinguishes the table from other source system. Example customer from another system called MKT.

d. Other columns can be added example processed flag to indicate if the row has been processed by the down stream application. It also provides incremental restart abilities for the down stream process.  Also exception flag can be added to the table to indicate that while processing the table an exception or error was raised hence the row is not processed.

4. Which design to choose?

a. Should the table be truncated and loaded?
b. Should the table will be append only?
c. Should default data type to be left as alpha numeric string (VARCHAR)?
d. Should constraints be enforced?
e. Should there be a primary Key?

It is normally based on the situation but if not sure or you don’t want to think then design suggested here should more than suffice your requirements.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc. 

ETL Startegy to store data validation rules

Wednesday, September 6th, 2006

Every time there is movement of data the results have to be tested against the expected results. For every ETL process, test conditions for testing data are defined before/during design and development phase itself.  Some that are missed can be added later on.

Various test conditions are used to validate data when the ETL process is migrated from DEV-to->QA-to->PRD. These test conditions are can exists in the developer’s/tester’s mind /documented in word or excel. With time the test conditions either lost ignored or scattered all around to be really useful.

In production if the ETL process runs successfully without error is a good thing. But it does not really mean anything. You still need rules to validate data processed by ETL. At this point you need data validation rules again!

A better ETL strategy is to store the ETL business rules in a RULES table by target table, source system. These rules can be in SQL text. This will create a repository of all the rules in a single location which can be called by any ETL process/ auditor at any phase of the project life cycle.

There is also no need to re-write /rethink rules. Any or all of these rules can be made optional, tolerances can be defined, called immediately after the process is run or data can be audited at leisure.

This Data validation /auditing system will basically contain
A table that contains the rules,
A process to call is dynamically and
A table to store the results from the execution of the rules

Benefits:

Rules can be added dynamically with no cange to code.

Rules are stored permanantly.

Tolerance level can be changed with ever changing the code

Biz rules can be added or validated by business experts without worring about the ETL code.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc.

Hard references to database objects, Synonyms and ETL

Wednesday, September 6th, 2006

1. Any time hard references are made to a database object it is an invitation to trouble. Especially in ETL where the object (table /view /procedure) can belong to schema.

2. Oracle has –> Database server name –> Schema –> Object

3. SCHEMA_NAME.TABLE_NAME is a bad way of referencing objects from the ETL architecture point of view as well as from data warehouse architecture point of view. The reason being any hard coding of the objects will cause trouble when they are moved from one schema to another.

4. An alternative could be to log in with the same SCHEMA_NAME where the objects exists.  But this will create a security hazard.  As ETL processes in will be in the same schema as that of the data warehouse structures and processes.

5. Also refereeing a object directly can also be issue because if the table name is changed or the column name is changes it will result in immediate code change.

6. Much argument can made that it’s not really an issue but just a little work or little more work. But what most people don’t understand is it is not how much work but when this work can come up. Most likely when you are migrating/ production run, or when the team is really at short of time.

7. Hard coded references also reduce flexibility… what if ETL wants to read data from some other schema to test an ETL process? What then?

Ok! What’s the solution then… 

hardcodedref.JPG

 

A better way would be to have a dedicated schema for ETL user. And all the objects tables /views /procedures can be references by public synonyms. This way any object can be replaced by a new reference, temporary object, view or private synonym without; ever changing the code. It’s a much better way there are many more advantages but of course there will be more maintenance.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc.

Loading & testing fact/transactional/balances (data), which is valid between dates!

Tuesday, July 25th, 2006
This is going to be a very interesting topic for ETL & Data modelers who design processes/tables to load fact or transactional data which keeps on changing between dates. ex: prices of shares, Company ratings, etc.

variable_bond_interest.JPG

The table above shows an entity in the source system that contains time variant values but they don’t change daily. The values are valid over a period of time; then they change. variable_bond_interest_fct1.JPG

1 .What the table structure should be used in the data warehouse?

Maybe Ralph Kimball or Bill Inmon can come with better data model! :-) But for ETL developers or ETL leads the decision is already made so lets look for a solution.

2. What should be the ETL design to load such a structure?

Design A

  • There is one to one relationship between the source row and the target row.
  • There is a CURRENT_FLAG attribute, that means every time the ETL process get a new value it has add a new row with current flag and go to the previous row and retire it. Now this step is a very costly ETL step it will slow down the ETL process.
  • From the report writer issue this model is a major challange to use. Because what if the report wants a rate which is not current. Imagine the complex query.

Design B

  • In this design the sanpshot of the source table is taken every day.
  • The ETL is very easy. But can you imagine the size of fact table when the source which has more than 1 million rows in the source table. (1 million x 365 days = ? rows per year). And what if the change in values are in hours or minutes?
  • But you have a very happy user who can write SQL reports very easily.

Design C

  • Can there be a comprimise. How about using from date (time) – to date (time)! The report write can simply provide a date (time) and the straight SQL can return a value/row that was valid at that moment.
  • However the ETL is indeed complex as the A model. Because while the current row will be from current date to- infinity. The previous row has to be retired to from date to todays date -1.
  • This kind of ETL coding also creates lots of testing issues as you want to make sure that for nay given date and time only one instance of the row exists (for the primary key).

Which design is better, I have used all depending on the situtation.

3. What should be the unit test plan?

There are various cases where the ETL can miss and when planning for test cases and your plan should be to precisely test those. Here are some examples of test plans

a. There should be only one value for a given date/date time

b. During the initial load when the data is available for multiple days the process should go sequential and create snapshots/ranges correctly.

c. At any given time there should be only one current row.

d. etc

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, etc.

Aggregate aware queries

Monday, June 19th, 2006

Imagine a scenario table having 20 million transactions for 10 years and a user writing a query to get data aggregated at monthly level. The best guess for the time required to return the result set, even after extensive database tuning can be in multiples of 10 minutes or even an hour.

What is the solution?
There are three solutions based on the same concept of pre-aggregation, but the utilization is quite different.
1. Write different queries in advance to hit different tables
2. Have a reporting tool, which is aggregate aware (Client side). The reporting tool itself will decide ehich tbales to hit and rewrite it’s own queries.
3. Have a database that supports transparent query re-write. The database it self takes a simple query and re-writes against aggregrate tables.

Here’s an Oracle version of query re-write based on point 3 in this article.

 More details to come………… 

Talk to me!

Thursday, June 15th, 2006

Dear Guest Readers,

This is the third month of ETL Guru. I am getting visitors from all over the world (USA, India, France, Germany, Australia, Canada). Based on my understanding I will continue to write about data integration. But it will be only from my point of view. Also I never know if the reader is really interested in those articles or if you really understand it.

Hence If you have any questions, suggestions, requests please take initiative and please respond to the articles (you can comment in each article where it says ‘no comment’ or comments…

OR Please put comment in the guest book (– Click here)

OR Go to the ETL forum & Raise questions/Discuss here (– Click here)

Sandesh

NOTE: ETL Guru is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, etc

Multiple executions ETL process against same set of data.

Thursday, June 15th, 2006

Every ETL designer, developer & tester should always ask this question…”What will happen, if I run the ETL process multiple times, against the same data set?”

Answer: 1. I get the same result set.
Answer: 2. I get multiple result set.

If you go back to the original article on What is ETL & What ETL is not! You will immediately come to the conclusion that Answer 2 is incorrect, as ETL is not allowed to create data.

Why will the process run more than once against the same set of data? Many reasons, example most common being operators mistake, accidental kickoff, old set of data file remaining in the directory, staging table loaded more than once, intentional rerun of ETL process after correction of some data in source data set, etc. Without going into further details, I would advise ETL folks to always include in your process ways to prevent it from happening by one or more combinations of following methods…
1. Identify the primary key (logical/physical) and put update else insert logic.
2. Deleting the target data set before processing again (based on logical/physical primary key)
3. Preventing occurrences of multiple runs by flagging processed dates
4. Marking processed records with processed flags after commit
5. Prevention of multiple loads in the staging area
6. identifying duplicate records in stage area before the data gets processed
7. more…

So do these experiments in the development or test environment run the ETL process more than once, check the result! If you get the result 2 (copies of rows, with no way to distinguish or retiring the old rows)
The designer or the developer is wrong & if the process as passed the QA or testing then the tester is wrong.

Bottom line:
A test case to check multiple runs is must in life cycle of an ETL process.

ETL delta logic & de-normalization of data model.

Tuesday, June 6th, 2006

It is a normal practice in data warehouse to de normalizes (Or once auto corrected as demoralize) as the data model for performance. I am not going to discuss the benefits vs. issues with de-normalization.  As by the time it comes to the ETL guy the fate of the model is already decided.

Let’s look at the model in the source side, which is perfectly normalized. 

      
Now let’s look at the de normalized model on the target side. 

 
Next lets think of delta logic for loading of the dim_employee table. Ideally you would only check changes in the employee table. Then if there is any changes after the last load date time ; then get those rows from ref_employee and do the lookup to get the department & the designation and load it into the target table.
 
The issue with this delta logic is that it has not considered the effect of de normalization of employee table on the target side. If you carefully look at the two de normalized attributes dept_name and emp_designation_desc, the ETL process will miss any changes in the parent tables, so only new employees or updated employee will get the new definition of department & designation. And any employee that has not been updated in the source side will still have the same dept_name & emp_designation_desc. This is wrong.

The reason it is wrong is the ETL delta logic only picked the row from the employee table when it changed and ignored the changes in the dept & designation tables. The truth of the matter is, ” For any de normalized target table data (affected rows) should be re-captured from the source, any time there is change in the driving/core table as well as when there is change in any parent tables to which the driving table refers to.” In this case, even if there is change in department or designation table, all the rows affected on the employee tables should be re-processed.

It might seem very simple, but ETL developers/designers/modelers always miss this point. Also once developed it is very difficult to catch.

The next question is how you would catch the affected rows. Well there are ways to write SQL that combine the three tables (in this case) and treat them as one single entity and the pull rows based on the any update_dttm greater than the last ETL run. Figure out the SQL… 

Types data elements and entities (Tables) for ETL.

Saturday, June 3rd, 2006

It is important for an ETL developer to understand the types of tables and data, to intelligently design ETL processes. Once the common types objects are understood, reusable templates for ETL can be developed, regardless of business logic. This will greatly improve the efficiency of an ETL developer.

1. Reference data

2. Dimensional data (master data)

3. Transactional data

4. Transactions

5. Balances

6. Summary/Aggregations

7. Snapshots

8. Staging

9. Out triggers/mini dimensions

10. Log tables

11. Meta data tables

12. Security tables

13. Configuration tables

 

Data integration basics-Dimension Confirmation.

Thursday, May 25th, 2006

The purpose of this topic is to establish the basics for design of ETL processes. With out the understanding an ETL process for data integration cannot be designed or developed. 

A database can have multiple sources. Multiple sources may contain a data set of entirely different subject areas, but some data set will intersect. Example Sales data and salary data will have employee as the common set. 

Data Set Intersection.JPG 

Between two or more sources, the subjects, entities or even attributes can be common. So can we integrate the data easily? Mathematically it seems very easy, but the real world is not just about numbers or exact same string values. Every thing can be similar or same but may be not represented in the exact same manner in all the sources. The differences in representation of same information and facts between two or more sources can create some of the most interesting challenges in Data Integration.

Data integration: -The first step in data integration is identification of common elements
1. Identify the common entities.
Example, Employee dimension can come from Sales system, Payroll System, etc. Products can come from manufacturing, sales, purchase etc. Once the common entity is identified its definition should be standardized. Example, does employee include fulltime employees as well as temporary workers?
 
2. Identify the common attributes.
What are the attributes that are common to employee, 1st name, 2nd name, last name, date of joining, etc? Each attribute should be defined.

3. Identify the common values
Same information can be represented in different forms in multiple source system. Example, male sex, can be represented as ‘M’ or ‘1′ or ‘male’ or some thing else by each source system. A common representation must be decided (example, ‘Male’). Also if necessary a finite set of values should be established. Example the employee sex = (‘Male’,’ Female’) and it will not allow more than these two values.

The second step is the identification of Data Steward who will own the responsibility and ownership for particular set data elements.

The third step is to design an ETL process to integrate the data into the target.  This is the most important area in the implementation of an ETL process for data integration. This topic will be discussed in more detailed under its own heading.

The final fourth step is to establish a process of maintenance, review & reporting of such elements.

ETL and the importance of dates.

Wednesday, May 24th, 2006

ETL is all about data movement, batch processes and so on. The biggest criterias for data movement are based on dates. Dates like

ETL_CREATE_DTTM,

ETL_UPDATE_DTTM,

SRC_SYS_CREATE_DTTM,

SRC_SYS_UPDATE_DTTM,

BUSINESS_CREATE_DATE,

BUSINESS_UPDATE_DATE

Proper understanding of dates is one of the most essential requirements, while designing, developing, testing, and scheduling ETL process.

Lets discuss each of these in details..

ETL Create Date Time: -This is the date on which the row is loaded in the data warehouse. It has nothing to do with the business or business logic. It simply documents the exact date and time, the row as populated in the database by the ETL process. Once time stam is added it is never updated for the life of the database. It defines the first time the row arrived in the database. Also it is important to note that during a batch load the date and time at the begining of the process should be stored in the variable & same exact date and time should be used for one instance of the batch process. Why… think about it, more details will be offered soon.

ETL Update Date Time: – It is same as ETL Create Date Time except that, it may be populated during inserts, but unlike create timestamp above, it is updated every time any value in the row is updated. It give the information, as to the last time the database row was updated.

Source System Create Date Time: – This is the date time on which the incoming row was created in the source system. Some times due to combination of rows from different tables, or summerization of rows will make it difficult to find out the create date. But a best guess can be used. Also some times the source may not have the create date time at all, in that case the system date on the source may be used.

Source System Update Date Time: -This is similar to source create timestamp. But only applies to the update date time.

Now lets focus on some of the least understood & most misunderstood date concept on dates. Some times it takes so much time to explain this at worksite that it prompted me to write this article.

Business Date: -This is the official business date to which the the data/row belongs. This has nothing to do with the system date or clock time on the source application; or when it was loaded in the data warehouse.

Example 1: If a sales order was booked on the 1/1/2006 and was not entered into the system until system date becomes 1/4/2006 still the business date will be 1/1/2006.

Example 2: Trade are happening during the day 22/12/2006 and the batch process starts within the application at 8 pm and even after the stytem clock passess midnight into next day 23/12/2006; still the data belongs to 22nd and not to 23rd. This is normaly controlled by a function that holds the business date and is only changed when the triggering element is instructs to change the business date to new one. This also helps because the data can be processed for any historical date just by controlling the business date in the system.

Uses of these six types of dates? Think….

Delta processing, Batch processing, Back dated processing, Auditing, etc.

What ETL performance (processing speed) is good enough?

Sunday, May 14th, 2006

Every ETL developer / Data Warehouse Manager wants to know if the ETL processing speed  (Usually measured in NUMBER OF ROWS PROCESSED PER SECOND on target side)is good enough? What is the industry standard? I can give a politically correct answer like… Well it depends… 

But I would rather be blunt and say that usually for INSERTs, 1000 and for UPDATEs 300 rows/sec should be good enough. If you don’t have special requirements then any average around these numbers should make you comfortable. 

What ETL is not?

Wednesday, May 10th, 2006

ETL should not be confused with a data creation process. It never creates new data. If a list of hundred employees is being loaded, one more employee cannot be added to the list and make it hundred and one. Or if last name of customer is absent an arbitrary last name cannot be substituted.

Data warehouses are not OLTP systems. Duplication of calculations in Source system & the data warehouse should not be attempted, as the numbers will be very difficult to match during QA. Also in future the process in the source system can change that will result in asynchronous data.

ETL cannot change the meaning of data. For example for sex  ‘M’ and ‘F’ in source system sex flag to ‘1’ and ‘2’ is used in the Data Warehouse respectively. This is OK because this does not change the business meaning of the data. It only has changed the representation of the data.

Simulating Oracle Sequences in Sybase & SQL Server

Wednesday, April 26th, 2006

Programmatic control is lost when identity columns are used in Sybase and SQL Server. I do not recommend using Identity columns to create surrogate keys during ETL process. There are many more reasons for that. Oracle has the sequence feature which is used extensively by Oracle programmers. I have no clue why other vendors are not providing the same. This custom code has been used extensively by me and thoroughly tested. I ran multiple processes simultaneously to check if there is deadlock and also made sure that the process returns different sequences to different client process.

Notes: –

1. The table should have ‘ROW LEVEL LOCKING’

2. The sequence generator process is stateless (See more details in Object Oriented Programming)

3. Create one row for each target table in the sequence master table. Do not try to use one sequence for multiple tables. It will work but probably is not a good idea.

Step 1: -Create a table with following structure.

CREATE TABLE sequence_master (
sequence_nm varchar (55) NOT NULL ,
sequence_num integer NOT NULL
)
GO

Step 2: -Create a stored procedure that will return the next sequence.

CREATE PROCEDURE p_get_next_sequence
@sequence_name varchar(100)
AS
BEGIN
DECLARE @sequence_num INTEGER
— Returns an error if sequence row is entered into the table.
SET @sequence_num = -1

UPDATE sequence_master
SET @sequence_num = sequence_num = sequence_num + 1
WHERE Sequence_name = @sequence_name

RETURN @sequence_num
END
GO

 

Slow Running ETL process (read side) & Table Statistics (Oracle).

Sunday, April 23rd, 2006

Sometimes an ETL process runs considerably slow speed. During test for the small result set it might fly but when a million rows are applied the performance takes a nosedive. There can be many reasons for slow ETL process.  The process can be slow because read, transformation, load. Lets eliminate the transformation and load for the sake of discussion.

For ETL process to be slow on read side here are some reasons .1. No indexes on joins and/or ‘where clause’ 2. Query badly written. 3. Source not analyzed.  Out of these three lets rule out 1 & 2.
In the past most of the databases had RULE based optimizer in the INIT.ORA file, but with new development and specially Data warehouses ‘CHOOSE’ optimizer is preferred. With ‘CHOOSE’ option the query uses COST based optimizer if the statistics are available for the tables in the query.

There are two methods to gather statistics 1. DBMS_STATS package, 2. ANALYZE command. Oracle does not recommend ANALYZE command going forward for various reasons (its a command, cannot analyze external tables, it gathers stats that are not essential, inaccurate stats on partitioned tables and indexes, in future ANALYZE will not support cost-based optimizer, no Monitoring for stale statistics, etc.)

DBMS_STATS.GATHER_SCHEMA_STATS (ownname =>’DWH’, options =>’GATHER AUTO’);
This Package will gather all necessary statistics automatically (there is no need to write a process to check for tables that have stale or no stats). Oracle implicitly determines which objects need new statistics, and determines how to gather those statistics. So once it is put in the ETL flow you can sleep at home and everybody will be happy.

Bottom Line:
After major data load the tables should be analyzed with an automated process. Also please do not use ANALYZE , as in future it will not collect statistics needed for COB. The code can be found here.

1This package should be used as the part of ETL workflow to make sure as batch processes are run the benefit of statistics are available to the next process in the que.

Also ETL programmers can call this package within the ETL process/procedure in the begening so that if the tables required for process are not analyzed they will be analyzed automatically. Also if the procedure is going to modify lot of data in the table then the package can be called just before the end of the procedure.