Four Stars

How to remove duplicates from two excel tables?

I am fairly new to Talend. My use case is,I have two excel tables with employee data. The columns are name, email, street and phone number. I need to find out the common employees between both the tables based on phone number or street and put the data into a third excel sheet. I can do the above using a tuniqRow and Tunite. However, the phone number could be    of the format , +1 8x9-201-1xx5 in one table and in the second table, it could be 8x9-201-1xx5. the street field could be Main street on one table and Main st in another. How can I deal with that? Should I use a tmap, tregex? and how should I filter out the data? Thank you very much! 

2 REPLIES
Twelve Stars TRF
Twelve Stars

Re: How to remove duplicates from two excel tables?

Hi,

 

You should have some search around tFuzzyMatch component which is here to help for deduplication using Levenshtein, Metaphone or Double Metaphone algorythm.

Probably it could help you to solve this kind of use case.

 

Let us know.


TRF
Four Stars

Re: How to remove duplicates from two excel tables?

Hi TRF, thanks for the reply. I checked out the tFuzzyMatch component and I was able to remove some duplicates using Levenshtein. However, my use case is slightly different. If I have two excel tables with employee details and the phone number is provided as (234)-123-4567 in one table and 2341234567 in another tables, I need a component which can compare both tables and decide both of them are same employee based on a regex or some other kind of logic. Is there anything like that available in Talend? Thanks