ETL Testing Automation: An Effective & Efficient Approach
Turning data into useful information is vital process to get right business insights. But even before that data must be procured, stored and tested in right manner. Any discrepancy in data stems from errors in ETL process, in turn making it hard to derive insights.
What is ETL Testing?
ETL stands for extract – transform – load and this ETL process is used to demonstrate how the data will be moved or loaded to Target system from Source system. ETL testing ensures whole process is error free.
ETL Testing of the modules is performed in few steps:
- Understanding business requirements and data: This can be done using the Business requirement document, Data profiling scripts, mapping documents and analysing the existing data.
- Test planning
- Test case and testing data preparation
- Test case execution
- Testing closure
For one of our client which is a fintech company and one of the leader in SME lending, I executed my first ETL Testing project
The ETL was a standard history data maintenance process. Our client had the entire data in Salesforce. To execute their daily analytical functions, they needed in-house data. Because hitting Salesforce API multiple times was not very cost-friendly. So, they approached us.
Our team built a daily execution job which would replicate the Salesforce database into MySQL – 1st tier. The complexity was that, Salesforce would dynamically add/remove columns. However, a RDBMS is generally not meant for it. So, we had the 1st tier which was the exact replica of Salesforce system and then the 2nd tier was the Historical Data Maintenance Database which used to hold all the deleted/updated/newly added columns as well as records.
In this article, I would like to share my approach that I used for ETL testing.
The ETL testing cycle is depicted in the following diagram:
Following are some standard test scenarios considered while testing all the ETL modules.
- Change in data types: There are fair chances of changing the data types of a column while loading into the Tier-2 system.
- Number and name of the columns in Tier-1 as well as Tier-2 systems.
- Change in data type length
- Addition of the column in Tier-2 system
- Validations on date fields
- Comparing # of records in Tier-1 and Tier-2 system
- Checking null values
Above test scenarios are commonly used while ETL testing. But in our use case, we had to maintain the historical data for only a few columns. We maintained the historical values using Slowly Changing Dimensions-Type 2. We marked latest records for each ID as “Yes” under “Latest Indicator” column and remaining records were marked as “No” along with their effective start date and end date. Following are few test scenarios which can be used for such use case or whenever any use case involves implementation of SCDs.
- Finding distinct values in Latest Indicator column: After executing the query, the result should return at most two values as ‘Yes’ and/or ‘No’.
- Count of ‘Yes’ in latest indicator column should match with the # of unique records in that table.
- Check for the unique identifier whose count is 1. For such records, latest indicator field should always be ‘Yes’.
- Last record for every unique identifier should have maximum last modified date (As SCD is applied on Last modified date) and for that unique identifier, the Latest indicator should be marked as ‘Yes’.
- Checking the position of columns each time – Newly added column should be added at the end.
- If your implementation includes columns like effective start date and end date, then effective start date should always be less than effective end date.
Our data migration process involved more than 850 tables. So, it was essential for us to automate this entire process. As a part of automation, we created stored procedure using cursors so that the same stored procedure will be executed on all tables at a time. All the test scenarios were added to this stored procedure and the end results were captured inside the result table. This helped us perform the testing efficiently and effectively.
With our automated approach, the estimated manual testing efforts were improved by 90%.