Archive for the ‘File Management’ Category

ETL Strategy for Reversing wrong file/data load.

Friday, January 19th, 2007

We all know ‘garbage in, garbage out’. But what happens when the data provider (source system) realizes that they have provided incorrect data to the data warehouse? And now they want to correct the mistake by providing you new data! By then ETL has already loaded the data. The data is inserted, updated & deleted! Simple inserts is easy, but what about integrated data because of updates? How do you recover from such situation? Can you actually recover? What should be the ETL strategy?

Has your Data warehouse Architect/Manager/ETL Lead come up with an ETL Strategy?

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, Teradata, etc.

ETL Data file processing questions, you must ask!

Wednesday, May 17th, 2006

Many times file processing is taken quite lightly these days. The reason is either ignorance or sheer carelessness. Another reason being most of them being from RDBMS background they are just not aware of the intricacies of file management.

So wakeup….

Pre process
Before processing the file have you checked if the contents in header/trailer records actually match the file content?

Did the checksum of the file match to file on remote server?

Does the file contain delta, repeat/all rows?

Does the file contain delta, repeat/all rows? Are there flags for update, delete insert?
Does the file contain flags for update, delete insert?
Does the file transfer protocol supports support restart capabilities?

Have you asked data file envelope ( ex source name, date)information in the contents of data file or in the file name?

Have you asked for a header or trailer record? If yes is it in the data file itself or another file?

How do you know that the file is ready to be pulled?

How do you know there are no binary characters in the ASCII file?

If the file has to be pulled; What is the remote server name, IP address, user id and password?

If the file in same format is received from multiple sources how do you distinguish files from various source systems? (Content or by name or by location)

If the same file is processed in multiple schedule how do you distinguish between say a daily file or monthly file or yearly file?

If the file is being pulled from remote server then, will a backup of the file will be maintained on the remote server?

Is the remote file server being identified by IP address rather thana name? In that case, how does the system handle failover & change of server?

Is timestamp appended to the file name?

What are the contact details of the main and backup person?

What file transport protocol should be used?

What is the delimiter of file?

What is the expected size of file?

What is the file type binary or ASCII or ABCEDIC or something else?

What is the format of file multi line/XML/etc?

What is the format of the file?

What is the frequency of the file? Continues / daily / weekly bimonthly etc?

What is the location of the file?

What is the name of file?

What is the name of file? Is timestamp appended to the file name?

What is the source of the file?

What kind of access it provided ftp/secured/proprietary?

Who is the main contact person and backup contact for the file?

Will the file be overwritten daily in the source (remote) server?

Will the file be pushed or pulled?

In process

How do you know that the file that is being process is not a old file?

How will the file moved during processing?

What is the archive policy?

How should the process deal with deletes?

Can you avoid complete processing of file by taking a delta when compare to the previous days file?

Is the file compressed after load?

How do you know that all the file movement actually happened and not failed in any steps?

Did you check for the exit code after every step?

Did you convert the file columns in correct format?

What will you do if some of the columns are not in correct format?

Will you skip the row, replace the value or stop the process?

After how many fails will you terminate the process?

What happens if the load fails in between half load?

How will you identify the half loaded records?

How will you deal with the remaining records corrected manually?

Will you rerun the whole file to fix the problem?

What will happen if during the rerun the file is not completely processed?

Post process

Did the rows table match to that of file?

Does the Checksum match?

Did column shift occurred in the file?

Have you stored the result of the load into a table?