I am trying to use the Match Analysis in the Open Quality studio product to find duplicate customers. It is basic person data with name, organization, email, and street address. I've scrubbed the data the best I can, but the data quality is poor, as there are a number of fields that have nulls.
There are so many parameters to vary in this process it is a little overwhelming. What fields and how long should the block key be? Which fields should be used in the matching rules and what matching functions do I use? What are the individual thresholds and confidence weights for the various fields and what should the overall match threshhold and confidence be set to?
I know I'm not the first person to use this trecordmatching algorithm in Talend for customer data matching, especially since the data is based on basic, common, data fields like name and address. I hate to reinvent the wheel and I know there are people with more statistical knowledge than I on what the settings should be.
Is there a library of matching rules/settings for customer data for Open Studio for Data Quality that is available for viewing or download?