Archive for the ‘ETL Testing’ Category

Testing in ETL, Data-Centric Projects

Thursday, May 19th, 2011

Testing is an investigation process that is conducted to check the quality of the product. Product can either be an applications or data. The quality of the data can only be determined by checking data against some existing standards by following a set of processes. By doing so, you find out the symptoms in form of invalid/incorrect data that happened because of erroneous processes. So the data-centric testing results in achieving high quality data by getting the erroneous processes fixed.

This article highlights the types of data-centric testing approaches and discovers what each testing approach uncovers.

Types of Testing

  • Application Testing: The focus of this article is data-centric testing so we’ll not discuss application testing here.
  • Data-Centric Testing: Data-centric testing revolves around testing quality of the data. The objective of the data-centric testing is to ensure valid and correct data is in the system. Following are the couple of reasons that cause the requirement of performing data-centric testing:
    • ETL Processes/Data Movement: When you apply ETL processes on source database, and transform and load data in the target database
    • System Migration/Upgrade: When you migrate your database from one database to another or you upgrade an existing system where the database is currently running.

Data-Centric Testing
The data-centric testing validates data using the following approaches:

  • Technical Testing: Technical testing ensures that the data is moved, copied, or loaded from the source system to target system correctly and completely. Technical testing is performed by comparing the target data against the source data. Following is a list of functions that can be performed under technical testing:
    • Checksum Comparison: The data-centric testing makes use of checksum approach to discover errors. Checksum can be performed on source and target databases in n number of ways such as counting the number of rows, and adding the data of a column. Later the result of checksum calculated on source database is compared against the checksum calculated on the target database.
      For example, row count compares the number of rows in the target database with the number of corresponding rows in the source database. Or the target database may contain the summarized annual data for monthly salaries in the source database. So the target database should contain sum of the monthly salaries paid within a year for each year.
    • Domain comparison: The domain list in the target database is compared against the corresponding domain list in the source database. For example, the source system has 100 employees and the target system also has 100 employees but it does not guarantee that the employees in both the source and target domain lists are same unless compared. In Domain comparison, employee names in the target domain list are compared against the employee names in the source domain list.
    • Multi-value comparison: Similar to List comparison, multi-value comparison compares the whole record or the critical columns in a record against the corresponding values in the source system.
      For example, the domain comparison reports that all the employees are same and checksum comparison reports that the salaries on source and target match, but still it does not guarantee of a valid record because the data movement may result in assigning wrong employee name to the salary entry. Such issues are not discovered by Checksum comparison and Domain comparison. The multi-value comparison discovers such issues by comparing the key attributes of each entity in source and target databases.
  • Business Testing: Business testing is done to ensure that the data fulfills the requirements of the business. Data may have been moved, copied, or loaded completely and accurately, and technical testing does not report any issue still there are chances that the system still contains the invalid data. To ensure high quality data, the data is evaluated against the business rules.

    For example, value of certain fields such as salary and commission cannot be less than zero. These types of errors can occur because of data manipulation between the source and target systems. Unless you test data for any such errors, the quality of the data is not guaranteed.
    For every business, the list of the business requirements is different. An exhaustive list of business rules against which the data is compared, ensures high quality data.

  • Reconciliation: Reconciliation ensures that the data in the target system is in agreement with the overall system requirements. Following are the couple of examples of how the reconciliation helps in achieving high quality data:
    • Internal reconciliation: In this type of reconciliation, the data is compared within the system against the corresponding data set. For example shipping would also always be less than or equal to the orders. If the shipping ever exceeds the orders then it means data is invalid.
    • External reconciliation: In this type of reconciliation, data in system is compared against its counterpart in other systems. For example, in a module or in an application, number of employees can never be more than the number of employees in HR Employee database. Because HR Employee database is the master database that keeps record of all the employees. If such a situation occurs where the number of employees anywhere in the system is more than the HR Employee database, then the data is invalid.

My company developed the ICE± product, which targets to accomplish all types of testing mentioned in this article. For more information, visit www.iCEdq.com

ETL Startegy to store data validation rules

Wednesday, September 6th, 2006

Every time there is movement of data the results have to be tested against the expected results. For every ETL process, test conditions for testing data are defined before/during design and development phase itself.  Some that are missed can be added later on.

Various test conditions are used to validate data when the ETL process is migrated from DEV-to->QA-to->PRD. These test conditions are can exists in the developer’s/tester’s mind /documented in word or excel. With time the test conditions either lost ignored or scattered all around to be really useful.

In production if the ETL process runs successfully without error is a good thing. But it does not really mean anything. You still need rules to validate data processed by ETL. At this point you need data validation rules again!

A better ETL strategy is to store the ETL business rules in a RULES table by target table, source system. These rules can be in SQL text. This will create a repository of all the rules in a single location which can be called by any ETL process/ auditor at any phase of the project life cycle.

There is also no need to re-write /rethink rules. Any or all of these rules can be made optional, tolerances can be defined, called immediately after the process is run or data can be audited at leisure.

This Data validation /auditing system will basically contain
A table that contains the rules,
A process to call is dynamically and
A table to store the results from the execution of the rules

Benefits:

Rules can be added dynamically with no cange to code.

Rules are stored permanantly.

Tolerance level can be changed with ever changing the code

Biz rules can be added or validated by business experts without worring about the ETL code.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio,   MS SQL Server, RDB, etc.

Loading & testing fact/transactional/balances (data), which is valid between dates!

Tuesday, July 25th, 2006
This is going to be a very interesting topic for ETL & Data modelers who design processes/tables to load fact or transactional data which keeps on changing between dates. ex: prices of shares, Company ratings, etc.

variable_bond_interest.JPG

The table above shows an entity in the source system that contains time variant values but they don’t change daily. The values are valid over a period of time; then they change. variable_bond_interest_fct1.JPG

1 .What the table structure should be used in the data warehouse?

Maybe Ralph Kimball or Bill Inmon can come with better data model! :-) But for ETL developers or ETL leads the decision is already made so lets look for a solution.

2. What should be the ETL design to load such a structure?

Design A

  • There is one to one relationship between the source row and the target row.
  • There is a CURRENT_FLAG attribute, that means every time the ETL process get a new value it has add a new row with current flag and go to the previous row and retire it. Now this step is a very costly ETL step it will slow down the ETL process.
  • From the report writer issue this model is a major challange to use. Because what if the report wants a rate which is not current. Imagine the complex query.

Design B

  • In this design the sanpshot of the source table is taken every day.
  • The ETL is very easy. But can you imagine the size of fact table when the source which has more than 1 million rows in the source table. (1 million x 365 days = ? rows per year). And what if the change in values are in hours or minutes?
  • But you have a very happy user who can write SQL reports very easily.

Design C

  • Can there be a comprimise. How about using from date (time) – to date (time)! The report write can simply provide a date (time) and the straight SQL can return a value/row that was valid at that moment.
  • However the ETL is indeed complex as the A model. Because while the current row will be from current date to- infinity. The previous row has to be retired to from date to todays date -1.
  • This kind of ETL coding also creates lots of testing issues as you want to make sure that for nay given date and time only one instance of the row exists (for the primary key).

Which design is better, I have used all depending on the situtation.

3. What should be the unit test plan?

There are various cases where the ETL can miss and when planning for test cases and your plan should be to precisely test those. Here are some examples of test plans

a. There should be only one value for a given date/date time

b. During the initial load when the data is available for multiple days the process should go sequential and create snapshots/ranges correctly.

c. At any given time there should be only one current row.

d. etc

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort DMExpress, Sunopsis or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab Initio, MS SQL Server, RDB, etc.