Archive for the ‘Database Tuning’ Category

Aggregate aware queries

Monday, June 19th, 2006

Imagine a scenario table having 20 million transactions for 10 years and a user writing a query to get data aggregated at monthly level. The best guess for the time required to return the result set, even after extensive database tuning can be in multiples of 10 minutes or even an hour.

What is the solution?
There are three solutions based on the same concept of pre-aggregation, but the utilization is quite different.
1. Write different queries in advance to hit different tables
2. Have a reporting tool, which is aggregate aware (Client side). The reporting tool itself will decide ehich tbales to hit and rewrite it’s own queries.
3. Have a database that supports transparent query re-write. The database it self takes a simple query and re-writes against aggregrate tables.

Here’s an Oracle version of query re-write based on point 3 in this article.

 More details to come………… 

What ETL performance (processing speed) is good enough?

Sunday, May 14th, 2006

Every ETL developer / Data Warehouse Manager wants to know if the ETL processing speed  (Usually measured in NUMBER OF ROWS PROCESSED PER SECOND on target side)is good enough? What is the industry standard? I can give a politically correct answer like… Well it depends… 

But I would rather be blunt and say that usually for INSERTs, 1000 and for UPDATEs 300 rows/sec should be good enough. If you don’t have special requirements then any average around these numbers should make you comfortable. 

Slow Running ETL process (read side) & Table Statistics (Oracle).

Sunday, April 23rd, 2006

Sometimes an ETL process runs considerably slow speed. During test for the small result set it might fly but when a million rows are applied the performance takes a nosedive. There can be many reasons for slow ETL process.  The process can be slow because read, transformation, load. Lets eliminate the transformation and load for the sake of discussion.

For ETL process to be slow on read side here are some reasons .1. No indexes on joins and/or ‘where clause’ 2. Query badly written. 3. Source not analyzed.  Out of these three lets rule out 1 & 2.
In the past most of the databases had RULE based optimizer in the INIT.ORA file, but with new development and specially Data warehouses ‘CHOOSE’ optimizer is preferred. With ‘CHOOSE’ option the query uses COST based optimizer if the statistics are available for the tables in the query.

There are two methods to gather statistics 1. DBMS_STATS package, 2. ANALYZE command. Oracle does not recommend ANALYZE command going forward for various reasons (its a command, cannot analyze external tables, it gathers stats that are not essential, inaccurate stats on partitioned tables and indexes, in future ANALYZE will not support cost-based optimizer, no Monitoring for stale statistics, etc.)

DBMS_STATS.GATHER_SCHEMA_STATS (ownname =>’DWH’, options =>’GATHER AUTO’);
This Package will gather all necessary statistics automatically (there is no need to write a process to check for tables that have stale or no stats). Oracle implicitly determines which objects need new statistics, and determines how to gather those statistics. So once it is put in the ETL flow you can sleep at home and everybody will be happy.

Bottom Line:
After major data load the tables should be analyzed with an automated process. Also please do not use ANALYZE , as in future it will not collect statistics needed for COB. The code can be found here.

1This package should be used as the part of ETL workflow to make sure as batch processes are run the benefit of statistics are available to the next process in the que.

Also ETL programmers can call this package within the ETL process/procedure in the begening so that if the tables required for process are not analyzed they will be analyzed automatically. Also if the procedure is going to modify lot of data in the table then the package can be called just before the end of the procedure.