I am trying to use the Match Analysis in the Open Quality studio product to find duplicate customers. It is basic person data with name, organization, email, and street address. I've scrubbed the data the best I can, but the data quality is poor, as there are a number of fields that have nulls.
There are so many parameters to vary in this process it is a little overwhelming. What fields and how long should the block key be? Which fields should be used in the matching rules and what matching functions do I use? What are the individual thresholds and confidence weights for the various fields and what should the overall match threshhold and confidence be set to?
I know I'm not the first person to use this trecordmatching algorithm in Talend for customer data matching, especially since the data is based on basic, common, data fields like name and address. I hate to reinvent the wheel and I know there are people with more statistical knowledge than I on what the settings should be.
Is there a library of matching rules/settings for customer data for Open Studio for Data Quality that is available for viewing or download?
With talend open studio for Data quality product, please have a look at this tFuzzyMacth component which compares a column from the main flow with a reference column from the lookup flow and outputs the main flow data displaying the distance.
Here is a component tRecordMatching which can join two tables by doing a fuzzy match on several columns using a wide variety of comparison algorithms,however, this component will be available in the Palette of Talend Studio on the condition that you have subscribed to one of the Talend Platform products not open source.