Archive for April, 2006

Simulating Oracle Sequences in Sybase & SQL Server

Wednesday, April 26th, 2006

Programmatic control is lost when identity columns are used in Sybase and SQL Server. I do not recommend using Identity columns to create surrogate keys during ETL process. There are many more reasons for that. Oracle has the sequence feature which is used extensively by Oracle programmers. I have no clue why other vendors are not providing the same. This custom code has been used extensively by me and thoroughly tested. I ran multiple processes simultaneously to check if there is deadlock and also made sure that the process returns different sequences to different client process.

Notes: –

1. The table should have ‘ROW LEVEL LOCKING’

2. The sequence generator process is stateless (See more details in Object Oriented Programming)

3. Create one row for each target table in the sequence master table. Do not try to use one sequence for multiple tables. It will work but probably is not a good idea.

Step 1: -Create a table with following structure.

CREATE TABLE sequence_master (
sequence_nm varchar (55) NOT NULL ,
sequence_num integer NOT NULL
)
GO

Step 2: -Create a stored procedure that will return the next sequence.

CREATE PROCEDURE p_get_next_sequence
@sequence_name varchar(100)
AS
BEGIN
DECLARE @sequence_num INTEGER
— Returns an error if sequence row is entered into the table.
SET @sequence_num = -1

UPDATE sequence_master
SET @sequence_num = sequence_num = sequence_num + 1
WHERE Sequence_name = @sequence_name

RETURN @sequence_num
END
GO

 

What Is ETL?

Monday, April 24th, 2006

ETL is the automated and auditable data acquisition process from source system that involves one or more sub processes of data extraction, data transportation, data transformation, data consolidation, data integration, data loading and data cleaning.

Disscussion:

Source System can be any application or data store that creates or stores data and acts as a data source to other systems. If data is created in such a source system it can as well called Data Factory.

Automation is critical or the very purpose of ETL will be defeated. ETL no good if processes need to be manually scheduled, executed or manually monitored.

Data quality is a critical factor for the success of data warehousing projects. If data is of inadequate quality, then the knowledge workers who query the data warehouse and the decision makers who receive the information cannot trust the results. In order to obtain clean and reliable data, it is imperative to focus on data quality. While many data warehouse projects do take data quality into consideration, it is often given a delayed afterthought. Even QA after ETL is not good enough the Quality process needs to be incorporated in the ETL process itself.. Data quality has to be maintained for individual records or even small bits of information to ensure accuracy of complete database. One rotten tomato can ruin whole basket. It has to be ensured that the data is consistently updated to represent the source system. The accuracy of individual element of data is important so is the integrity of data across various pieces of information.

Extraction is first major step in physical implementation of ETL. Extraction initiates or triggers further downstream processes.

Needless to say, once data is extracted it has to be hauled and transported to target, because the physical location of the source system might be different from the target warehouse. Also many aspects like security, encryption, bandwidth, and transportation time have to be considered which will be discussed in the upcoming chapters.

Transformation is the series of tasks that prepares the data for loading into the warehouse. Once data is secured, you have worry about its format or structure. Because it will be not be in the format needed for the target. Example the grain level, data type, might be different. Data cannot be used as it is. Some rules and functions need to be applied to transform the data. There are many types of transformation that occur on the incoming data, which will be discussed, latter on.

One of the purposes of ETL is to consolidate the data in a central repository or to bring it at one logical or physical place. Data can be consolidated from similar systems, different subject areas, etc.

ETL must support data integration for the data coming from multiple sources and data coming at different times. This has to be seamless operation. This will avoid overwriting existing data, creating duplicate data or even worst simply unable to load the data in the target.

Loading part of the process is critical to integration and consolidation. Loading process decides the modality of how the data is added in the warehouse or simply rejected. Methods like addition, Updating or deleting are executed at this step. What happens to the existing data? Should the old data be deleted because of new information? Or should the data be archived? Should the data be treated as additional data to the existing one?

The incoming data might not be perfect. New York City might be shown in India or the Taj Mahal in France. Data has to be cleaned before it goes into the database.

Data can be loaded with lots of care. Does that that means data loaded in the Warehouse correct? What is the confidence level in the data? A data auditing process can only establish the confidence level. This auditing process normally happens after the loading of data.

Data Derivation is a process adding derived data values from reference or transactional data with simple string or mathimatical functions functions. Example Income – expense = profit. NY = New York. But ideally no business rules should be invloved & complex or time variant calcuations should be avoided.

Data enrichment is adding/combining external data values, rules to enrich the information already existing in the data. Example if we can get a list that provides a relationship between Zip Code,City and State, then if a address field has Zip code 06905 it be safely assumed and address can be enriched by doing a lookup on this table to get Zip Code 06905 –> City Stamford –> State CT.

 

Slow Running ETL process (read side) & Table Statistics (Oracle).

Sunday, April 23rd, 2006

Sometimes an ETL process runs considerably slow speed. During test for the small result set it might fly but when a million rows are applied the performance takes a nosedive. There can be many reasons for slow ETL process.  The process can be slow because read, transformation, load. Lets eliminate the transformation and load for the sake of discussion.

For ETL process to be slow on read side here are some reasons .1. No indexes on joins and/or ‘where clause’ 2. Query badly written. 3. Source not analyzed.  Out of these three lets rule out 1 & 2.
In the past most of the databases had RULE based optimizer in the INIT.ORA file, but with new development and specially Data warehouses ‘CHOOSE’ optimizer is preferred. With ‘CHOOSE’ option the query uses COST based optimizer if the statistics are available for the tables in the query.

There are two methods to gather statistics 1. DBMS_STATS package, 2. ANALYZE command. Oracle does not recommend ANALYZE command going forward for various reasons (its a command, cannot analyze external tables, it gathers stats that are not essential, inaccurate stats on partitioned tables and indexes, in future ANALYZE will not support cost-based optimizer, no Monitoring for stale statistics, etc.)

DBMS_STATS.GATHER_SCHEMA_STATS (ownname =>’DWH’, options =>’GATHER AUTO’);
This Package will gather all necessary statistics automatically (there is no need to write a process to check for tables that have stale or no stats). Oracle implicitly determines which objects need new statistics, and determines how to gather those statistics. So once it is put in the ETL flow you can sleep at home and everybody will be happy.

Bottom Line:
After major data load the tables should be analyzed with an automated process. Also please do not use ANALYZE , as in future it will not collect statistics needed for COB. The code can be found here.

1This package should be used as the part of ETL workflow to make sure as batch processes are run the benefit of statistics are available to the next process in the que.

Also ETL programmers can call this package within the ETL process/procedure in the begening so that if the tables required for process are not analyzed they will be analyzed automatically. Also if the procedure is going to modify lot of data in the table then the package can be called just before the end of the procedure.

Shell Scripts for Oracle

Sunday, April 23rd, 2006

Here’s a lazy post for shell scripts. Links In future I will add some specalised scripts for ETL.