Archive for the ‘ETL Strategy’ Category

Slower Development Databases and Servers

Friday, November 2nd, 2007

It is a normal trend in it to buy the most powerful machine for production usage. Example you production database box will have 16 CPUs your QA 8 and your Development box 4 or 6. Similar ratio is maintained on ETL servers, Application servers, and Hard Disk performance.

Logic being the production environment is critical for end user experience.
Agreed! That the response time is critical for end users; however that does not mean you buy slower machine for your development. Imagine the time wasted while a resource is sitting in front of the machine, waiting for the query to return with data. A developer usually spends all his time working in the development database. So imagine a development box slower by 50% then production. That means a developer is loosing 50% of his time waiting for his result. It’s not just about his time but also about his focus and concentration that he will loose because his experience is bad with the system. It will be not an exaggeration to say that he will loose 4 hours from an 8 hours day.

In US total cost on an ETL developer is about $100 an hour. So the company is loosing $400 per developer per day. You do the math for 5 to 10 developers working for 1 year.

I am not saying that Production and Development systems must be same, but I firmly believe the difference should be more in the reliability and availability of the servers rather then on the performance difference between the servers.

I thought of this article while waiting for a query to return from a development box. It took exactly half the time as the production server and I remembered that the management is already discussing of upgrading the development server to improve the performance. 

Type II dimension: What is expiry date? Does it come from Source or is it ETL date?

Tuesday, February 6th, 2007

I assume everyone is aware of type II dimension concept.  In type II dimension when we get an updated row we retire old row and insert new one. But what do you populate  in the from_date —to_date attributes. ETL create/update datetime or when that particular row was retired in the source system?

The question can be asked in a different way… When a employee changes from one department  to another department; While populating in the data warehouse type II dimension EMPLOYEE table… Do you use the actual date of his transfer or the ETL date when you are populating the table?

What are you doing in your ETL code?

ETL Strategy for Reversing wrong file/data load.

Friday, January 19th, 2007

We all know ‘garbage in, garbage out’. But what happens when the data provider (source system) realizes that they have provided incorrect data to the data warehouse? And now they want to correct the mistake by providing you new data! By then ETL has already loaded the data. The data is inserted, updated & deleted! Simple inserts is easy, but what about integrated data because of updates? How do you recover from such situation? Can you actually recover? What should be the ETL strategy?

Has your Data warehouse Architect/Manager/ETL Lead come up with an ETL Strategy?

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, Teradata, etc.

Loading & testing fact/transactional/balances (data), which is valid between dates!

Tuesday, July 25th, 2006
This is going to be a very interesting topic for ETL & Data modelers who design processes/tables to load fact or transactional data which keeps on changing between dates. ex: prices of shares, Company ratings, etc.

variable_bond_interest.JPG

The table above shows an entity in the source system that contains time variant values but they don’t change daily. The values are valid over a period of time; then they change. variable_bond_interest_fct1.JPG

1 .What the table structure should be used in the data warehouse?

Maybe Ralph Kimball or Bill Inmon can come with better data model! :-) But for ETL developers or ETL leads the decision is already made so lets look for a solution.

2. What should be the ETL design to load such a structure?

Design A

  • There is one to one relationship between the source row and the target row.
  • There is a CURRENT_FLAG attribute, that means every time the ETL process get a new value it has add a new row with current flag and go to the previous row and retire it. Now this step is a very costly ETL step it will slow down the ETL process.
  • From the report writer issue this model is a major challange to use. Because what if the report wants a rate which is not current. Imagine the complex query.

Design B

  • In this design the sanpshot of the source table is taken every day.
  • The ETL is very easy. But can you imagine the size of fact table when the source which has more than 1 million rows in the source table. (1 million x 365 days = ? rows per year). And what if the change in values are in hours or minutes?
  • But you have a very happy user who can write SQL reports very easily.

Design C

  • Can there be a comprimise. How about using from date (time) – to date (time)! The report write can simply provide a date (time) and the straight SQL can return a value/row that was valid at that moment.
  • However the ETL is indeed complex as the A model. Because while the current row will be from current date to- infinity. The previous row has to be retired to from date to todays date -1.
  • This kind of ETL coding also creates lots of testing issues as you want to make sure that for nay given date and time only one instance of the row exists (for the primary key).

Which design is better, I have used all depending on the situtation.

3. What should be the unit test plan?

There are various cases where the ETL can miss and when planning for test cases and your plan should be to precisely test those. Here are some examples of test plans

a. There should be only one value for a given date/date time

b. During the initial load when the data is available for multiple days the process should go sequential and create snapshots/ranges correctly.

c. At any given time there should be only one current row.

d. etc

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, etc.

Introduction to Error and exception management.

Monday, July 3rd, 2006

ETL is all about transportation, transformation and organizing of data. Of anytime something moves (as a matter of fact even if you are perfectly stationary and items around moves) accidents are bound to happen. So any ETL specialist believes that their code is perfect and nothing can happen obviously lives in a fool’s paradise.

The next obvious thing is to design to manage accidents, like making a safer car or a factory. And as an ETL specialist if you don’t do it you are no different then others. As in any country there are laws for accidents and accident due to criminal negligence. Later being the worst.

How many times I have seen people putting ETL code into production without actually designing processes to prevent, manage or report accidents. Writing code is one thing writing production worthy code is another. Do ask yourself or your developers, “Is the code production worthy?”

Next

ERRORS:

A programmatic error that causes the the program to fail or makes the program run for uncontrolled time frame.
EXCEPTIONS: A program/code written to handle expected or unexpected errors gracefully so that the program continues run with logging the error and bypassing the erroneous conditions or even logging the error and gracefully exiting with error message.

More detailed description will come with topic…. ‘Unhandled exceptions results in Errors’.

Note: The topic on error and exceptions is relevant to Informatica, Data Stage, Abinitio, Oracle warehouse builder, PLSQL, SQLLDR, Transact SQL or any ETL other tools.

ETL and the importance of dates.

Wednesday, May 24th, 2006

ETL is all about data movement, batch processes and so on. The biggest criterias for data movement are based on dates. Dates like

ETL_CREATE_DTTM,

ETL_UPDATE_DTTM,

SRC_SYS_CREATE_DTTM,

SRC_SYS_UPDATE_DTTM,

BUSINESS_CREATE_DATE,

BUSINESS_UPDATE_DATE

Proper understanding of dates is one of the most essential requirements, while designing, developing, testing, and scheduling ETL process.

Lets discuss each of these in details..

ETL Create Date Time: -This is the date on which the row is loaded in the data warehouse. It has nothing to do with the business or business logic. It simply documents the exact date and time, the row as populated in the database by the ETL process. Once time stam is added it is never updated for the life of the database. It defines the first time the row arrived in the database. Also it is important to note that during a batch load the date and time at the begining of the process should be stored in the variable & same exact date and time should be used for one instance of the batch process. Why… think about it, more details will be offered soon.

ETL Update Date Time: – It is same as ETL Create Date Time except that, it may be populated during inserts, but unlike create timestamp above, it is updated every time any value in the row is updated. It give the information, as to the last time the database row was updated.

Source System Create Date Time: – This is the date time on which the incoming row was created in the source system. Some times due to combination of rows from different tables, or summerization of rows will make it difficult to find out the create date. But a best guess can be used. Also some times the source may not have the create date time at all, in that case the system date on the source may be used.

Source System Update Date Time: -This is similar to source create timestamp. But only applies to the update date time.

Now lets focus on some of the least understood & most misunderstood date concept on dates. Some times it takes so much time to explain this at worksite that it prompted me to write this article.

Business Date: -This is the official business date to which the the data/row belongs. This has nothing to do with the system date or clock time on the source application; or when it was loaded in the data warehouse.

Example 1: If a sales order was booked on the 1/1/2006 and was not entered into the system until system date becomes 1/4/2006 still the business date will be 1/1/2006.

Example 2: Trade are happening during the day 22/12/2006 and the batch process starts within the application at 8 pm and even after the stytem clock passess midnight into next day 23/12/2006; still the data belongs to 22nd and not to 23rd. This is normaly controlled by a function that holds the business date and is only changed when the triggering element is instructs to change the business date to new one. This also helps because the data can be processed for any historical date just by controlling the business date in the system.

Uses of these six types of dates? Think….

Delta processing, Batch processing, Back dated processing, Auditing, etc.