One Star

[resolved] What is the result of mixing Levenshtein and exact in tMatchGroup?

I need to de duplicate addresses. I want to have exact match on a number of fields (like postal code and house number) but I would like to use Levenshtein on street name (to catch the simplest typos). How should i configure tMatchGroup for this?
I'm guessing at same confidence weight for all fields but what should match interval be? I do not want to group any receords with differences in the "exact" fields, but will match interval of 0.99 stop Levenshtein from working? Any way to specify distance? - I know I could experiment, but it would be nice to know for sure.
Thanks!
Niels
1 ACCEPTED SOLUTION

Accepted Solutions
Employee

Re: [resolved] What is the result of mixing Levenshtein and exact in tMatchGroup?

Hi,
when you have fields that you want with an exact match, I advise you to put them in the blocking key definition. This will partition your data and lower the number of comparisons (hence speed-up the job).
Now, you have several possibilities and you will need a few tries before you are satisfied with your matching settings.
First, do you want the postal code and house number to enter in the distance computation? I would say no if we put them in the blocking key definition, but even in the blocking key definition, you could want to use them for distance computation. But let start without comparing them (it will speed-up the computations).
Then you have the street name field to compare: I would try q-grams which may be more adapted on longer strings than Levenshtein.
The distance computed by this component is between 0 and 1. So, the "match interval" is in fact a "match threshold". If you have an address field only, then 0.85 may be a good value.
You can later decide that records with a distance between 0.85 and 0.95 are not similar enough and remove them (with a tFilterRow for example) or review them manually in the stewardship console.
The weights are chosen by you and put on the fields that identify most a record. A weight is a factor that sets more importance on some fields.
the record distance is the weighted sum of each field distance.
2 REPLIES
Employee

Re: [resolved] What is the result of mixing Levenshtein and exact in tMatchGroup?

Hi,
when you have fields that you want with an exact match, I advise you to put them in the blocking key definition. This will partition your data and lower the number of comparisons (hence speed-up the job).
Now, you have several possibilities and you will need a few tries before you are satisfied with your matching settings.
First, do you want the postal code and house number to enter in the distance computation? I would say no if we put them in the blocking key definition, but even in the blocking key definition, you could want to use them for distance computation. But let start without comparing them (it will speed-up the computations).
Then you have the street name field to compare: I would try q-grams which may be more adapted on longer strings than Levenshtein.
The distance computed by this component is between 0 and 1. So, the "match interval" is in fact a "match threshold". If you have an address field only, then 0.85 may be a good value.
You can later decide that records with a distance between 0.85 and 0.95 are not similar enough and remove them (with a tFilterRow for example) or review them manually in the stewardship console.
The weights are chosen by you and put on the fields that identify most a record. A weight is a factor that sets more importance on some fields.
the record distance is the weighted sum of each field distance.
One Star

Re: [resolved] What is the result of mixing Levenshtein and exact in tMatchGroup?

Hi
Brilliant! I only had postal code as blocking. And super with q-grams.
I really appreciate such a swift answer. Thanks!
Niels