Blocking Strategies for Data Matching in Talend Data Integration/Data Quality and Talend MDM

Introduction

The correct selection of a blocking strategy, and in particular, the choice of Blocking Keys is crucial to the success of any data matching project, whether using Talend DQ or MDM. This article will outline the principles that should be used when making such a selection.

 

Data Matching Overview

Data Matching is based on the idea of sorting. It is very difficult to match articles within large ‘piles’ or ‘blocks’. It is much more efficient to sort those data into smaller blocks of similar attributes. It is akin to the ‘matching the socks problem’. It is much easier and quicker to match socks if you sort them out into a number of piles or blocks of say, similar colors, and then match for identical socks within those blocks. The same principle applies to all data matching problems. Data sets should be first sorted into blocks of similar or identical attributes. However, what is important is that these blocks should all be of a similar size, or as close as can be, and that not too many or too few blocks are chosen.

 

So, how do you choose those blocks and how many do you need? One principle of data matching states that you should always match on attributes that are ‘unlikely to change’. It is of less use to match a person based on their address, which is likely to change over time; rather than their name, which is not. All things, whether they be persons or objects, have attributes that are less likely to change than others. For an object, such as a cup, those attributes that identify that particular cup could be size, shape, color, and so on. For a person, they are likely to be first name, surname, sex, date of birth, social security number for example. These attributes are much less likely to change than address or phone number. The first step in selecting blocking keys is to select those attributes which you will block on.

 

In the case of a person, there are likely to be around half a dozen or so attributes you can choose from, and the same number is probably realistic for objects too. This is a good number, not too many and not too few. You can now see this automatically reduces the number of matches that will need to be done from the total into a realistic amount of blocks that need to be handled.

 

Matching and Blocking in Talend Components

In Talend DI and Talend MDM, there are two different but related ways to set up blocking and matching. The sections below will deal with each of these separately.

 

Blocking and Matching in Talend DI/DQ

The following DI Components are used to do data matching:

  • tRecordMatching
  • tFirstnameMatch
  • tFuzzyJoin
  • tMatchGroup
  • tFuzzyMatch
  • tFuzzyUniqRow
  • tGenKey

Each of these components is fully described and detailed in the Talend Component Reference Guide. This article is concerned with the tRecordMatching component.

 

This component joins two tables by doing a match on several columns using a wide variety of comparison algorithms. It compares columns from the main flow with reference columns from the lookup flow and, according to the matching strategy you define, outputs three possible results. These are the matched data, the possible matched data, and the rejected or ‘non-matched’ data. In other words, records that match, those that don't, and those you are uncertain about. The confidence levels of your matching allow you to define matching and non-matching thresholds. On deciding your matching strategy, the user-defined matching scores are critical to determine the match levels that define which one of the three possible outcomes applies. Above the matching threshold, records are matched, below the non-matching threshold records are not matched. Those in between will require manual intervention. Ideally, you want to keep those to a minimum.

 

As an overview, the tRecordMatching component is configured thus:

pic 5.png

The 'Blocking Selection’ columns are in effect the Blocks. There are Matching functions that can be selected within each Block, and there is also a ‘Weight’ assigned to each Matching Function.

 

Matching is done in the way described above in the theory of matching. Records are matched within the various Blocks, and each Match is rated with a weight, or ‘measure of importance or confidence’ in that match. The choice of matching algorithm within each block is varied, and would depend on the field that is being matched. As described above, you should try to choose fields whose values are unlikely to change. The simple example above shows blocking on Firstname and Lastname. For Matching, we have selected ID Number, Date of Birth, and Sex. Slightly more weight is assigned to ID Number over DOB, as we have more confidence in that value. We assign a low weight to Sex as this can easily be mixed up or missed (in this simple dataset). In this example, we therefore put people with the same names into the same blocks, and then we match within those blocks. Note, this is an example. The strategy adopted is very dependent on the dataset being used. Quality of the data is hugely important.

 

At the simplest level, the matching works by assigning a probability of the match for each attribute, it then multiplies that by the weight to build up a total weight for each match.

 

As discussed, there can only be three possible outcomes from any matching Job:

  1. Records that match
  2. Records that don’t match
  3. Records we are uncertain about (and usually need manual checking)

The weights, as you have seen, are used to set those thresholds for matches, non-matches, and uncertain or possible matches. This is done in a Talend Job and is shown below. Here you are matching persons against a list of existing customers to see if they exist already. After matching, there are three different results and these records are sent down three different paths as shown:

pic 4.png

 

From experience, for data containing personal information on people, the best values to be would be (listed in descending order of importance):

  • First name
  • Surname
  • Personal Identifier of some sort, for example NI Number or SSN
  • Date of Birth
  • Sex
  • Last part of the address
  • Phone number/email address

The matching strategy being used is very dependent on the data being used. A fundamental principle is that data used for matching should be of high quality. It should be as standardized and complete as possible. For example, there should be no missing fields if possible. Unstandardized data with missing fields will result in a very low quality of matches. The old adage of 'Garbage In = Garbage Out' applies.

 

Again, the most important thing to remember is that the matching fields you choose should be attributes in the data that are unlikely to change, and therefore uniquely identify the object in question.

 

Blocking and Matching in Talend MDM

In Talend MDM, matching is done by using a built-in algorithm, which is a type of Entity Resolution algorithm called a ‘Swoosh’ Algorithm, technically the ‘T-Swoosh’ variant of that algorithm.

 

Blocking is achieved using blocking keys, which can be defined. Data is matched to reference data on the MDM server, it is then standardized, blocked using blocking keys, and then matched using the T-Swoosh algorithm. Again, three possible outcomes are possible. In MDM we define them as Matches, Unmatched, and Suspects. The diagram below illustrates the MDM matching process.

 

pic 2.png

Blocking keys in MDM are defined for each entity that will need probabilistic matching. To do this, you need to create a new element in the MDM Data Model to store the Blocking Key value. Note: it should be made non-visible in the MDM Web UI, since you don’t want anyone changing it.

 

The next step is to create a trigger to automatically compute a new blocking key when creating/editing a new entity instance. In the example below, a blocking key is generated for the person “John Smith” based on his name and phone number.

pic 3.png

The generated key uses the first letter of his First name, the first three letters of his Surname and the first part of his phone number. The blocking key thus generated is “JSMI0141”.

 

Now of course, there is a big trade off here. Blocking Keys must create groups small enough to lower the amount of comparisons that MDM needs to do, but blocking groups must be big enough not to drop “true positives”. In a nutshell, it is all about Accuracy vs Performance.

 

The choice of blocking key is very dependent on the fields in the data. As already discussed, ideally you should use attributes that are unlikely to change. Name is a good choice, but phone number (as in the example above) is not always the best choice. How often do people change their phone numbers? A better choice would be to include date of birth or sex, if these fields exist. It’s all about what is in the data and selecting the best fields for matching. 

 

Another consideration is not to choose fields that can be nullable. This will seriously affect the performance of your matching, as they will produce keys that are incomplete. Also, do not choose fields that are primary keys in your database.

 

In conclusion, there are a number of considerations to make when selecting a Blocking strategy for your data matching. This article outlined the most important items that need to be considered and should be followed.

Version history
Revision #:
13 of 13
Last update:
‎09-26-2017 11:26 AM
Updated by:
 
Labels (1)
Tags (1)