Four Stars

find duplicates and similarity in a row

Hello,

 

I want find similarities and duplicates in a row for exemple :

 

hello                                             
hello word
test
test of something
tqstx of something

I need result like : [hello, hello word] [test, test of something, tqstx of something].

 

I tried tFuzzyMatch with different options but there are too much errors or it doesn't find similarities that should be in.

 

Someone would have any ideas ? Thanks you.

 

 

3 REPLIES
Twelve Stars

Re: find duplicates and similarity in a row

Can you explain exactly what you require. In your example it is quite straight forward until you get to ....

test
test of something
tqstx of something

 [test, test of something, tqstx of something]

 

....example. Given the logic shown, "test" could also end up matching "end of the world" (because of the "of") which could also end up matching "hello world" (because of the "world").

 

What you appear to want is combination of character matching, fuzzy matching AND context awareness. I think you might need to go down a route of finding exact matches, fuzzy matches and then getting a human to assess those which are not pretty clear cut.

Rilhia Solutions
Four Stars

Re: find duplicates and similarity in a row

Thanks for answer, here is a real example :
ROW :

 

 

basses jailleres basses jallieres carrefour de cersay claudis claudis foucrelle saint pierre fouquerelle fresnaie fresnaie fresnaie fresnaie saint pierre gare vraire grand beaulieu grand clos hautes jailleres hautes jallieres petites peaux petites vignes petits champs pin berlot pinberlot pont daviette pont de preuil pont de preuil richard richard massais saint hilaire saint hilaire
EGLISE
EGLISE MASSAIS

Correct match  :

 

basses jailleres | "basses jallieres"
claudis | claudis
foucrelle saint pierre | fouquerelle
fresnaie | fresnaie | fresnaie | "fresnaie saint pierre"
grand beaulieu | "grand clos"
hautes jailleres | "hautes jallieres"
pin berlot | pinberlot
pin berlot | pinberlot
richard | "richard massais"
richard | "richard massais"
saint hilaire | "saint hilaire"
saint hilaire | "saint hilaire"

error match  :

carrefour de cersay | "gare vraire"
grand beaulieu | "grand clos"
petites peaux | "petites vignes" | "petits champs"
petites peaux | "petites vignes" | "petits champs"
petites peaux | "petites vignes" | "petits champs"
pont daviette | "pont de preuil" | "pont de preuil"
pont daviette | "pont de preuil" | "pont de preuil"
pont daviette | "pont de preuil" | "pont de preuil"

not found match :

EGLISE
EGLISE MASSAIS

I need maximum 5% of not found and 10% of error.

Maybe use the regex would be more appropriate, it is possible to compare a  row line with all the others using the regex on talend ?

Let me know if you have better idea Smiley Happy.

Twelve Stars

Re: find duplicates and similarity in a row

This is a notoriously hard problem to solve I'm afraid. I do not have "THE" answer as I don't believe there is a perfect answer. There are some easy matches to make, and then there are some that make no sense to me, for example.....

 

foucrelle saint pierre | fouquerelle

Also, how do you break up the words? For example, how does "Richard" match "Richard Massais"? Why doesn't it just match "Richard" or match "Richard Massais saint"? Why isn't "pont de" a legitimate match with "pont de"? It looks like you are using your world knowledge to make these decisions. By "world knowledge" I am talking about your knowledge of context, naming conventions, sentence structure, people, places, etc....just general life experience. You know that Richard is a name and than Massais is likely a surname. Therefore it would seem logical to you that Richard would match Richard Massais. A computer doesn't have that ability (unless you are willing to train a neural network or something....which you can do with Talend). 

 

I think you may need to come up with some heuristics to help with this. For example, how do you group the words? Will there be punctuation or some sort of separator? Will there be quotes surrounding the groups of words? Do you need to have some sort of grouping dictionary? Once a word (or sentence) is matched, can it be matched again on its own or as part of a combination of words? 

I think there need to be more rules or much more intelligent (trained) software to solve this as it stands.

 

Rilhia Solutions