Archive for the ‘ETL Performance Tuning’ Category

Slower Development Databases and Servers

Friday, November 2nd, 2007

It is a normal trend in it to buy the most powerful machine for production usage. Example you production database box will have 16 CPUs your QA 8 and your Development box 4 or 6. Similar ratio is maintained on ETL servers, Application servers, and Hard Disk performance.

Logic being the production environment is critical for end user experience.
Agreed! That the response time is critical for end users; however that does not mean you buy slower machine for your development. Imagine the time wasted while a resource is sitting in front of the machine, waiting for the query to return with data. A developer usually spends all his time working in the development database. So imagine a development box slower by 50% then production. That means a developer is loosing 50% of his time waiting for his result. It’s not just about his time but also about his focus and concentration that he will loose because his experience is bad with the system. It will be not an exaggeration to say that he will loose 4 hours from an 8 hours day.

In US total cost on an ETL developer is about $100 an hour. So the company is loosing $400 per developer per day. You do the math for 5 to 10 developers working for 1 year.

I am not saying that Production and Development systems must be same, but I firmly believe the difference should be more in the reliability and availability of the servers rather then on the performance difference between the servers.

I thought of this article while waiting for a query to return from a development box. It took exactly half the time as the production server and I remembered that the management is already discussing of upgrading the development server to improve the performance. 

Aggregate aware queries

Monday, June 19th, 2006

Imagine a scenario table having 20 million transactions for 10 years and a user writing a query to get data aggregated at monthly level. The best guess for the time required to return the result set, even after extensive database tuning can be in multiples of 10 minutes or even an hour.

What is the solution?
There are three solutions based on the same concept of pre-aggregation, but the utilization is quite different.
1. Write different queries in advance to hit different tables
2. Have a reporting tool, which is aggregate aware (Client side). The reporting tool itself will decide ehich tbales to hit and rewrite it’s own queries.
3. Have a database that supports transparent query re-write. The database it self takes a simple query and re-writes against aggregrate tables.

Here’s an Oracle version of query re-write based on point 3 in this article.

 More details to come………… 

What ETL performance (processing speed) is good enough?

Sunday, May 14th, 2006

Every ETL developer / Data Warehouse Manager wants to know if the ETL processing speed  (Usually measured in NUMBER OF ROWS PROCESSED PER SECOND on target side)is good enough? What is the industry standard? I can give a politically correct answer like… Well it depends… 

But I would rather be blunt and say that usually for INSERTs, 1000 and for UPDATEs 300 rows/sec should be good enough. If you don’t have special requirements then any average around these numbers should make you comfortable. 

Slow Running ETL process (read side) & Table Statistics (Oracle).

Sunday, April 23rd, 2006

Sometimes an ETL process runs considerably slow speed. During test for the small result set it might fly but when a million rows are applied the performance takes a nosedive. There can be many reasons for slow ETL process.  The process can be slow because read, transformation, load. Lets eliminate the transformation and load for the sake of discussion.

For ETL process to be slow on read side here are some reasons .1. No indexes on joins and/or ‘where clause’ 2. Query badly written. 3. Source not analyzed.  Out of these three lets rule out 1 & 2.
In the past most of the databases had RULE based optimizer in the INIT.ORA file, but with new development and specially Data warehouses ‘CHOOSE’ optimizer is preferred. With ‘CHOOSE’ option the query uses COST based optimizer if the statistics are available for the tables in the query.

There are two methods to gather statistics 1. DBMS_STATS package, 2. ANALYZE command. Oracle does not recommend ANALYZE command going forward for various reasons (its a command, cannot analyze external tables, it gathers stats that are not essential, inaccurate stats on partitioned tables and indexes, in future ANALYZE will not support cost-based optimizer, no Monitoring for stale statistics, etc.)

DBMS_STATS.GATHER_SCHEMA_STATS (ownname =>’DWH’, options =>’GATHER AUTO’);
This Package will gather all necessary statistics automatically (there is no need to write a process to check for tables that have stale or no stats). Oracle implicitly determines which objects need new statistics, and determines how to gather those statistics. So once it is put in the ETL flow you can sleep at home and everybody will be happy.

Bottom Line:
After major data load the tables should be analyzed with an automated process. Also please do not use ANALYZE , as in future it will not collect statistics needed for COB. The code can be found here.

1This package should be used as the part of ETL workflow to make sure as batch processes are run the benefit of statistics are available to the next process in the que.

Also ETL programmers can call this package within the ETL process/procedure in the begening so that if the tables required for process are not analyzed they will be analyzed automatically. Also if the procedure is going to modify lot of data in the table then the package can be called just before the end of the procedure.