Fuzzy match & dynamic join

Four Stars

Fuzzy match & dynamic join

Hello, 
I need help with my project. In my project I have 2 tables: 

CUSTOMERS table (id, name, surname, street, street number, zip code, phone, email) about 100 000 records
USERS table (id, name + surname, street, street number, postal code) about 1 million records

I need to pair with the CUSTOMERS table in the USERS table. 
When a given customer found in the table of USERS , I set him a particular attribute. Problem is, even the street and names are not completely identical in both tables. 
Within the pairing needs to do a fuzzy match, follow these steps: 

postal code must match 
street must have a levenshtein distance value max 3 
The name must have a levenshtein distance value max 5 

It is possible to do Fuzzy match on multiple columns? as adding a condition when equal psc && fuzzy match on the street is max 3 and then come on fuzzy match on the name of the CUSTOMERS ? 
Respectively, I tried to make the columns in both tables as the concatenation psc + street + street number + surname + number. Problem is the 100-000 records seeking one record in a table of size 1 million record. It's awfully slow, it would take about 5 days. 
 
Is there any way to dynamically take all the records from the CUSTOMERS table with the value of a PSC, and to Him do dynamic table inner join USERS with the same psc, and from then on the table to do those little fuzzy join, and iteratively continue like this for all possible values listed psc in the CUSTOMERS table
If you could, can you explain how to do this dynamically, respectively ja stock and set the component?
I use Talend Open Studio for Data Integration and PostgreSQL 9.3
Thanks for your help, I am at a loss.
Four Stars

Re: Fuzzy match & dynamic join

Hi pantolik,
I don't thin that all of your 1 M records needs a fuzzy match... In order to optimize the process what you can do 
- First you have simple inner join using tMap for all of your records in join...
Above step may give you some x records and 1M-x records are rejected due to non match... keep it in separate table of buffer...
- Next step to use this table and again join using your fuzzy logic.
In order to improve the joining performance, I would recommend to create proper indexes on database table
use file system in tMap for storing the lookup data than the memory...
Try this, certainly performance of your job would be enhanced... once done we will try some other techniques as well.
Vaibhav
Four Stars

Re: Fuzzy match & dynamic join

Hi, I did it as follows, but not to restrict users, only around 8%. It would not somehow make iterative sent CUSTOMER into Fuzzy Match based on its postal_code and dynamically on CUSTOMER postal code made a inner join with USER postal code and use it like lookup?
Four Stars

Re: Fuzzy match & dynamic join

Are you getting desired results?... Think of options given in earlier post.
Thanks
Vaibhav
Four Stars

Re: Fuzzy match & dynamic join

There was little acceleration, but not very significant.
Maybe I did not fully understand how you previously thought.
Can you please explain how to connect components for example, at least in text style:
compA ---actionA---> compB ---actionB---> compC 
compD ---actionC---^