Fuzzy Match Talend open studio high performance tips
Hi I just had some experience dealing with large data sets and fuzzy processing so I wanted to share it with the community. Using default fuzzy match (tFuzzyMatch) - which is quite optimized by the way - on large number of records (half a million+) can take days of processing. tFuzzyMatch performs exhaustive fuzzy lookup meaning that it calculates similarity of each record to every other record in table! As I am aware Talend doesn't offer any sort of fuzzy indexing algorithm that might reduce the processing time - of course accuracy will go down as well. Back to exhaustive fuzzy lookup - if you have 0,5 million records fuzzy matching operation will be performed two hundred fifty billion times!! (500 000^2 = 250 000 000 000). I wasn't satisfied with tFuzzyMatch's levenshtein distance algorithm and used custom one which took even more time to process. Tips for decreasing processing time: 1) optimize all custom Java code (eliminate heavy stuff like regular expressions etc.) 2) prepare all main and lookup data out of your fuzzy job (have it as separate jobs - i.e. join data in a separate job to a new table which will be used in fuzzy process) 3) minimize number of components, have only main input, lookup input, fuzzy match (or tMap for custom stuff), filtering component (you only need records with certain similarity) and output components 4) minimize number of rows you are having as input and output - have only ID's and columns that you are going to use with fuzzy process - you can join all other columns after the fuzzy job completes 5) divide main data to separate jobs, if you have 500k records total you can have two jobs each having 250k records with 500k lookup. After two jobs are done you only need to union the results. 6) to engage your processor fully you can execute two jobs simultaneously on one machine 7) for huge data sets you can export jobs as standalone java apps to different machines (use CSV input and output to minimize db installation hassles) - after they are all done - union the results into single table by importing all CSV result files 8) You can concentate 2 columns to one before starting job - this way you will have only one column for fuzzy match (like first_name+last_name)
Re: Fuzzy Match Talend open studio high performance tips
Bump for an old post.... Has there been any progress on a fuzzymatch using more than one field and returning more than one value? Daisychaining multiple fuzzymatch components makes the job uneccessarily complex in my eyes.