ETL is the automated and auditable data acquisition process from source system that involves one or more sub processes of data extraction, data transportation, data transformation, data consolidation, data integration, data loading and data cleaning.
Disscussion:
Source System can be any application or data store that creates or stores data and acts as a data source to other systems. If data is created in such a source system it can as well called Data Factory.
Automation is critical or the very purpose of ETL will be defeated. ETL no good if processes need to be manually scheduled, executed or manually monitored.
Data quality is a critical factor for the success of data warehousing projects. If data is of inadequate quality, then the knowledge workers who query the data warehouse and the decision makers who receive the information cannot trust the results. In order to obtain clean and reliable data, it is imperative to focus on data quality. While many data warehouse projects do take data quality into consideration, it is often given a delayed afterthought. Even QA after ETL is not good enough the Quality process needs to be incorporated in the ETL process itself.. Data quality has to be maintained for individual records or even small bits of information to ensure accuracy of complete database. One rotten tomato can ruin whole basket. It has to be ensured that the data is consistently updated to represent the source system. The accuracy of individual element of data is important so is the integrity of data across various pieces of information.
Extraction is first major step in physical implementation of ETL. Extraction initiates or triggers further downstream processes.
Needless to say, once data is extracted it has to be hauled and transported to target, because the physical location of the source system might be different from the target warehouse. Also many aspects like security, encryption, bandwidth, and transportation time have to be considered which will be discussed in the upcoming chapters.
Transformation is the series of tasks that prepares the data for loading into the warehouse. Once data is secured, you have worry about its format or structure. Because it will be not be in the format needed for the target. Example the grain level, data type, might be different. Data cannot be used as it is. Some rules and functions need to be applied to transform the data. There are many types of transformation that occur on the incoming data, which will be discussed, latter on.
One of the purposes of ETL is to consolidate the data in a central repository or to bring it at one logical or physical place. Data can be consolidated from similar systems, different subject areas, etc.
ETL must support data integration for the data coming from multiple sources and data coming at different times. This has to be seamless operation. This will avoid overwriting existing data, creating duplicate data or even worst simply unable to load the data in the target.
Loading part of the process is critical to integration and consolidation. Loading process decides the modality of how the data is added in the warehouse or simply rejected. Methods like addition, Updating or deleting are executed at this step. What happens to the existing data? Should the old data be deleted because of new information? Or should the data be archived? Should the data be treated as additional data to the existing one?
The incoming data might not be perfect. New York City might be shown in India or the Taj Mahal in France. Data has to be cleaned before it goes into the database.
Data can be loaded with lots of care. Does that that means data loaded in the Warehouse correct? What is the confidence level in the data? A data auditing process can only establish the confidence level. This auditing process normally happens after the loading of data.
Data Derivation is a process adding derived data values from reference or transactional data with simple string or mathimatical functions functions. Example Income – expense = profit. NY = New York. But ideally no business rules should be invloved & complex or time variant calcuations should be avoided.
Data enrichment is adding/combining external data values, rules to enrich the information already existing in the data. Example if we can get a list that provides a relationship between Zip Code,City and State, then if a address field has Zip code 06905 it be safely assumed and address can be enriched by doing a lookup on this table to get Zip Code 06905 –> City Stamford –> State CT.