First step of customer delight journey – Data Deduplication

First step of customer delight journey – Data Deduplication
Data Deduplication Using SSIS
Sunny Grover
Posted by on March 7, 2017 in Blog

We see our customers as invited guests to a party, and we are the hosts. It’s our job every day to make every important aspect of the customer experience a little bit better.

– Jeff Bezos

Duplicate Customer data commonly can be a cause for a host of issues. In certain circumstances, there’s even a chance of duplicated data leading to unsatisfied customers.

Let’s take a call center as an example. An outbound caller may phone out to a customer, offering them a particular product. The customer declines, and the caller logs this in the CRM system. However, the business has two separate records for the same customer, perhaps due to a misspelled name. Another sales person phones out to the same customer using the different record resulting in bad customer experience by having the same offer made twice.

It’s easy to see just how much of a problem this is, especially when it comes to customer data duplication. So how can companies effectively overcome the issue?

Data deduplication

Data deduplication refers to the elimination of redundant data. In the deduplication process, duplicate data is deleted or linked together, leaving only one copy of the data to be stored. Fuzzy matching is one of the ways to achieve data deduplication.

What is Fuzzy Matching?

Fuzzy matching, sometimes known as probabilistic record linkage, uses fuzzy logic in order to determine whether there is any similarity between elements of data. What makes fuzzy matching different to traditional database searching is that results for a query are returned based on likely relevance, rather than an exact match. Search words and spellings do not necessarily have to match database records exactly in order to yield results. Data matches may include alternate spellings of a search term or data that may have been laid out in a different format. Fuzzy matching is able to calculate the probability that different records and search terms actually relate to the same thing.

The idea behind data deduplication using Fuzzy Matching is to find records that are approximately the same, rather than an entirely accurate match. For example, a good customer deduplication algorithm should be able to identify that all below customer records belong to the same customer –

Data Deduplication

While this may sound like a complex effort, it is actually easy thanks to the Fuzzy matching available in multiple technologies. Having worked on many customer analytics implementations, I know that deciding the right approach for deduplication is a key to successful customer strategy. In this series of blogs, I will write about various strategies to achieve customer deduplication.

Fuzzy Lookup/Grouping functionality is built into Microsoft SQL Server Integration Services (SSIS).

Fuzzy Lookup allows SSIS to inspect a set of data and compare one or more fields in the dataset. Rather than comparing the field data exactly, Fuzzy Lookup will match strings based on their Levenshtein Distance- giving more accurate results while overcoming misspellings, typos, abbreviations, nicknames, etc.

Typically, customer data can be deduplicated based on their First Name, Last Name, Date of birth, Street address, Phone, SSN and City.

Configuring Fuzzy Lookup is a pretty straight-forward process, just select the reference table, the fields you want to compare and set the similarity threshold(roughly a percentage of matching). This process can involve a little bit of trial and error while you fine-tune the Fuzzy Lookup to identify the records that are potential duplicates without letting through any false positives.

Fuzzy Lookup evaluates the customer records and compares them based on selected fields. Potentially duplicate records then assigned with similarity scores.

After Fuzzy Lookup, a step to check the similarity scores and group count determines which records are potential dupes. The potential duplicates are then exported to an Excel file for review.

When the process is complete, you can view the results in Excel file. Here you can see the unique record number assigned to the record, Master record number, percentage of match to the potential duplicate record and percentage of match for each. The highlighted values in the screenshot show some of the values that were compared and demonstrate how Fuzzy Lookup can identify potential duplicates despite misspellings, nicknames, and partial matches.

Data Deduplication

In my next blog, I will demonstrate use of Big Data technologies to perform customer deduplication.

“Information is not knowledge” Albert Einstein had said once. Being a genius he could conclude the ultimate truth. But we can strive to make information into knowledge by ensuring quality of data!