ETL Testing Automation: An Effective and Efficient Approach

ETL Testing Automation

Turning data into useful information is vital for getting the right business insights. But before that, data must be procured, stored, and tested correctly. Any discrepancy in data stems from errors in the ETL process, making it hard to derive insights.

What is ETL Testing?

ETL stands for extract–transform–load, and this ETL process is used to demonstrate how the data will be moved or loaded to the Target system from the Source system. Testing ensures that the whole process is error-free.

ETL Testing of the modules is performed in a few steps:

  • Understanding business requirements and data:  This can be done using the Business requirement document, Data profiling scripts, mapping documents, and analyzing the existing data.
  • Planning Test
  • Test case and testing data preparation
  • Test case execution
  • Testing closure

I executed my first ETL Testing project for one of our clients, which is a fintech company and one of the leaders in SME lending.

The ETL was a standard history data maintenance process. Our client had the entire data in Salesforce. To execute their daily analytical functions, they needed in-house data. They approached us because hitting Salesforce API multiple times could have been more cost-friendly.

Our team built a daily execution job that would replicate the Salesforce database into MySQL – 1st tier. The complexity was that Salesforce would dynamically add/remove columns. However, a RDBMS is generally not meant for it. So, we had the 1st tier, which was the replica of the Salesforce system, and then the 2nd tier was the Historical Data Maintenance Database, which used to hold all the deleted/updated/newly added columns and records.

In this article, I would like to share the approach that I used for ETL testing.

The ETL testing cycle is depicted in the following diagram:

Some standard test scenarios must be considered while testing all the ETL modules.

  • Change in data types: There are fair chances of changing the data types of a column while loading into the Tier-2 system.
  • Number and name of the columns in Tier-1 as well as Tier-2 systems.
  • Change in data type length.
  • Addition of the column in the Tier-2 system.
  • Validations on date fields
  • Comparing # of records in Tier-1 and Tier-2 system
  • Checking null values

ETL testing uses the above test scenarios. But in our use case, we had to maintain the historical data for only a few columns. We maintained the historical values using Slowly Changing Dimensions-Type 2. We marked the latest records for each ID as “Yes” under the “Latest Indicator” column, and the remaining records were labeled as “No” along with their effective start date and end date.

Following are a few test scenarios that can be used for such use cases or whenever any use case involves the implementation of SCDs.

  • Finding distinct values in the Latest Indicator column: After executing the query, the result should return at most two values as ‘Yes’ and/or ‘No.’
  • The count of ‘Yes’ in the latest indicator column should match the # of unique records in that table.
  • Check for the unique identifier whose count is 1. The latest indicator field should always be ‘Yes’ for such records.
  • The last record for every unique identifier should have a maximum last modified date (As SCD is applied on the Last modified date), and for that unique identifier, the Latest indicator should be marked as ‘Yes.’
  • Checking the position of columns each time – Newly added columns should be added at the end.
  • If your implementation includes columns like effective start date and end date, the effective start date should always be less than the effective end date.

Our data migration process involved more than 850 tables. So, it was essential for us to automate this entire process. As a part of automation, we created a stored procedure using cursors so that the same stored procedure will be executed on all tables at a time. All the test scenarios were added to this stored procedure, and the results were captured inside the result table. This helped us perform the testing efficiently and effectively.

Our automated approach improved the estimated manual testing efforts by 90%.