Archive for May, 2006

Data integration basics-Dimension Confirmation.

Thursday, May 25th, 2006

The purpose of this topic is to establish the basics for design of ETL processes. With out the understanding an ETL process for data integration cannot be designed or developed. 

A database can have multiple sources. Multiple sources may contain a data set of entirely different subject areas, but some data set will intersect. Example Sales data and salary data will have employee as the common set. 

Data Set Intersection.JPG 

Between two or more sources, the subjects, entities or even attributes can be common. So can we integrate the data easily? Mathematically it seems very easy, but the real world is not just about numbers or exact same string values. Every thing can be similar or same but may be not represented in the exact same manner in all the sources. The differences in representation of same information and facts between two or more sources can create some of the most interesting challenges in Data Integration.

Data integration: -The first step in data integration is identification of common elements
1. Identify the common entities.
Example, Employee dimension can come from Sales system, Payroll System, etc. Products can come from manufacturing, sales, purchase etc. Once the common entity is identified its definition should be standardized. Example, does employee include fulltime employees as well as temporary workers?
 
2. Identify the common attributes.
What are the attributes that are common to employee, 1st name, 2nd name, last name, date of joining, etc? Each attribute should be defined.

3. Identify the common values
Same information can be represented in different forms in multiple source system. Example, male sex, can be represented as ‘M’ or ‘1′ or ‘male’ or some thing else by each source system. A common representation must be decided (example, ‘Male’). Also if necessary a finite set of values should be established. Example the employee sex = (‘Male’,’ Female’) and it will not allow more than these two values.

The second step is the identification of Data Steward who will own the responsibility and ownership for particular set data elements.

The third step is to design an ETL process to integrate the data into the target.  This is the most important area in the implementation of an ETL process for data integration. This topic will be discussed in more detailed under its own heading.

The final fourth step is to establish a process of maintenance, review & reporting of such elements.

ETL and the importance of dates.

Wednesday, May 24th, 2006

ETL is all about data movement, batch processes and so on. The biggest criterias for data movement are based on dates. Dates like

ETL_CREATE_DTTM,

ETL_UPDATE_DTTM,

SRC_SYS_CREATE_DTTM,

SRC_SYS_UPDATE_DTTM,

BUSINESS_CREATE_DATE,

BUSINESS_UPDATE_DATE

Proper understanding of dates is one of the most essential requirements, while designing, developing, testing, and scheduling ETL process.

Lets discuss each of these in details..

ETL Create Date Time: -This is the date on which the row is loaded in the data warehouse. It has nothing to do with the business or business logic. It simply documents the exact date and time, the row as populated in the database by the ETL process. Once time stam is added it is never updated for the life of the database. It defines the first time the row arrived in the database. Also it is important to note that during a batch load the date and time at the begining of the process should be stored in the variable & same exact date and time should be used for one instance of the batch process. Why… think about it, more details will be offered soon.

ETL Update Date Time: – It is same as ETL Create Date Time except that, it may be populated during inserts, but unlike create timestamp above, it is updated every time any value in the row is updated. It give the information, as to the last time the database row was updated.

Source System Create Date Time: – This is the date time on which the incoming row was created in the source system. Some times due to combination of rows from different tables, or summerization of rows will make it difficult to find out the create date. But a best guess can be used. Also some times the source may not have the create date time at all, in that case the system date on the source may be used.

Source System Update Date Time: -This is similar to source create timestamp. But only applies to the update date time.

Now lets focus on some of the least understood & most misunderstood date concept on dates. Some times it takes so much time to explain this at worksite that it prompted me to write this article.

Business Date: -This is the official business date to which the the data/row belongs. This has nothing to do with the system date or clock time on the source application; or when it was loaded in the data warehouse.

Example 1: If a sales order was booked on the 1/1/2006 and was not entered into the system until system date becomes 1/4/2006 still the business date will be 1/1/2006.

Example 2: Trade are happening during the day 22/12/2006 and the batch process starts within the application at 8 pm and even after the stytem clock passess midnight into next day 23/12/2006; still the data belongs to 22nd and not to 23rd. This is normaly controlled by a function that holds the business date and is only changed when the triggering element is instructs to change the business date to new one. This also helps because the data can be processed for any historical date just by controlling the business date in the system.

Uses of these six types of dates? Think….

Delta processing, Batch processing, Back dated processing, Auditing, etc.

ETL Data file processing questions, you must ask!

Wednesday, May 17th, 2006

Many times file processing is taken quite lightly these days. The reason is either ignorance or sheer carelessness. Another reason being most of them being from RDBMS background they are just not aware of the intricacies of file management.

So wakeup….

Pre process
Before processing the file have you checked if the contents in header/trailer records actually match the file content?

Did the checksum of the file match to file on remote server?

Does the file contain delta, repeat/all rows?

Does the file contain delta, repeat/all rows? Are there flags for update, delete insert?
Does the file contain flags for update, delete insert?
Does the file transfer protocol supports support restart capabilities?

Have you asked data file envelope ( ex source name, date)information in the contents of data file or in the file name?

Have you asked for a header or trailer record? If yes is it in the data file itself or another file?

How do you know that the file is ready to be pulled?

How do you know there are no binary characters in the ASCII file?

If the file has to be pulled; What is the remote server name, IP address, user id and password?

If the file in same format is received from multiple sources how do you distinguish files from various source systems? (Content or by name or by location)

If the same file is processed in multiple schedule how do you distinguish between say a daily file or monthly file or yearly file?

If the file is being pulled from remote server then, will a backup of the file will be maintained on the remote server?

Is the remote file server being identified by IP address rather thana name? In that case, how does the system handle failover & change of server?

Is timestamp appended to the file name?

What are the contact details of the main and backup person?

What file transport protocol should be used?

What is the delimiter of file?

What is the expected size of file?

What is the file type binary or ASCII or ABCEDIC or something else?

What is the format of file multi line/XML/etc?

What is the format of the file?

What is the frequency of the file? Continues / daily / weekly bimonthly etc?

What is the location of the file?

What is the name of file?

What is the name of file? Is timestamp appended to the file name?

What is the source of the file?

What kind of access it provided ftp/secured/proprietary?

Who is the main contact person and backup contact for the file?

Will the file be overwritten daily in the source (remote) server?

Will the file be pushed or pulled?

In process

How do you know that the file that is being process is not a old file?

How will the file moved during processing?

What is the archive policy?

How should the process deal with deletes?

Can you avoid complete processing of file by taking a delta when compare to the previous days file?

Is the file compressed after load?

How do you know that all the file movement actually happened and not failed in any steps?

Did you check for the exit code after every step?

Did you convert the file columns in correct format?

What will you do if some of the columns are not in correct format?

Will you skip the row, replace the value or stop the process?

After how many fails will you terminate the process?

What happens if the load fails in between half load?

How will you identify the half loaded records?

How will you deal with the remaining records corrected manually?

Will you rerun the whole file to fix the problem?

What will happen if during the rerun the file is not completely processed?

Post process

Did the rows table match to that of file?

Does the Checksum match?

Did column shift occurred in the file?

Have you stored the result of the load into a table?

 

What ETL performance (processing speed) is good enough?

Sunday, May 14th, 2006

Every ETL developer / Data Warehouse Manager wants to know if the ETL processing speed  (Usually measured in NUMBER OF ROWS PROCESSED PER SECOND on target side)is good enough? What is the industry standard? I can give a politically correct answer like… Well it depends… 

But I would rather be blunt and say that usually for INSERTs, 1000 and for UPDATEs 300 rows/sec should be good enough. If you don’t have special requirements then any average around these numbers should make you comfortable. 

What ETL is not?

Wednesday, May 10th, 2006

ETL should not be confused with a data creation process. It never creates new data. If a list of hundred employees is being loaded, one more employee cannot be added to the list and make it hundred and one. Or if last name of customer is absent an arbitrary last name cannot be substituted.

Data warehouses are not OLTP systems. Duplication of calculations in Source system & the data warehouse should not be attempted, as the numbers will be very difficult to match during QA. Also in future the process in the source system can change that will result in asynchronous data.

ETL cannot change the meaning of data. For example for sex  ‘M’ and ‘F’ in source system sex flag to ‘1’ and ‘2’ is used in the Data Warehouse respectively. This is OK because this does not change the business meaning of the data. It only has changed the representation of the data.

Why Informatica sequences & Sybase/SQL server identity columns, should not be used?

Sunday, May 7th, 2006

Informatica provides the sequence object to create surrogate keys during load. These object is also sharable within various mappings hence can be used in parallel. But I will recommend never using it. 
Here’s the Reason why….

1. MIGRATION ISSUE:  The sequence is stored in the Informatica repository. That means it is disconnected from the target database. So during migration of code between environments the sequence has to be reset manually. Bigger problem arise when the tables with data is brought from production environment to QA environment; the mapping cannot be immediately run because the sequences are out of sync.

2. Sequences belong to the target schema and the database and do not belong to the processes as the table object should define & hold values for the attributes and not something external to the system.

3. At first it might seem that Sybase/SQL server does not support sequences but checkout the post & that will solve the problem.

4. Logically it seems that Informatica calls to oracle procedure will slowdown the ETL process but in real world, I have found it not to be true. Additionally this process is only called when a new reference data/Dimension is added. New reference data is not as volatile as transactions; so the adverse effect is nullified.

5. If Identity columns are used instead, it causes more problems as you loose programmatic control on it. Example any type II dimension changes are become a nightmare to manage.