Is it a match or a duplicate? When trying to match similar records across data sets you’ll run into two similar, but different, concepts: matching and deduplication. Quite simply, deduplicating means finding (technically, eliminating) what is effectively the same record multiple times in a file or database. Matching, on the other hand, means finding what is effectively a record with the same key in different data sets.
As an example, consider the following two data sets. One is a company’s customer database. The other contains purchased demographic data. (FYI, I have obtained permission from this company to show their data)
A quick glance at the customer database indicates that record one, Steve Jones, and record nine, also Steve Jones, are exact duplicates. Similarly you can match the two Steve Jones records with the Steve Jones record in the purchased demographic data.
A closer look at the data, however, shows a couple of complications that your matching algorithms really need to consider to be effective (actually, the list of potential complications in data is endless. Let’s talk about it!).
What Constitutes a Match / Duplicate?
The first, rather obvious question is, exactly what constitutes a match? Look at our customer data again:
As we noted, record one and record nine are most likely duplicates. But what about record three? It has no email address? Does the Steve Jones on record three represent the same person as the Steve Jones on records one and nine? Your matching solution should be able to negotiate questions like this.
Alternatively, what about Stephen Jones on record four? He’s got the same last name and the same email address as the Steve Jones on records one and nine. Is he the same person? Probably. One way to handle situations like this is with a robust synonyms file. Make sure your matching solution can handle synonyms.
Do You Have a Matching / Deduplication Problem?
We’ll talk about more possible matching issues in future posts. For now, If you have a matching or deduplication need, check out our matching product, Golden Record. Then, let’s talk! You can reach me at Benjamin.Taub@Dataspace.com.
Thanks for reading!