I want find similarities and duplicates in a row for exemple :
hello hello word test test of something tqstx of something
I need result like : [hello, hello word] [test, test of something, tqstx of something].
I tried tFuzzyMatch with different options but there are too much errors or it doesn't find similarities that should be in.
Someone would have any ideas ? Thanks you.
Can you explain exactly what you require. In your example it is quite straight forward until you get to ....
test test of something tqstx of something
[test, test of something, tqstx of something]
....example. Given the logic shown, "test" could also end up matching "end of the world" (because of the "of") which could also end up matching "hello world" (because of the "world").
What you appear to want is combination of character matching, fuzzy matching AND context awareness. I think you might need to go down a route of finding exact matches, fuzzy matches and then getting a human to assess those which are not pretty clear cut.
Thanks for answer, here is a real example :
basses jailleres basses jallieres carrefour de cersay claudis claudis foucrelle saint pierre fouquerelle fresnaie fresnaie fresnaie fresnaie saint pierre gare vraire grand beaulieu grand clos hautes jailleres hautes jallieres petites peaux petites vignes petits champs pin berlot pinberlot pont daviette pont de preuil pont de preuil richard richard massais saint hilaire saint hilaire
Correct match :
basses jailleres | "basses jallieres" claudis | claudis foucrelle saint pierre | fouquerelle fresnaie | fresnaie | fresnaie | "fresnaie saint pierre" grand beaulieu | "grand clos" hautes jailleres | "hautes jallieres" pin berlot | pinberlot pin berlot | pinberlot richard | "richard massais" richard | "richard massais" saint hilaire | "saint hilaire" saint hilaire | "saint hilaire"
error match :
carrefour de cersay | "gare vraire" grand beaulieu | "grand clos" petites peaux | "petites vignes" | "petits champs" petites peaux | "petites vignes" | "petits champs" petites peaux | "petites vignes" | "petits champs" pont daviette | "pont de preuil" | "pont de preuil" pont daviette | "pont de preuil" | "pont de preuil" pont daviette | "pont de preuil" | "pont de preuil"
not found match :
EGLISE EGLISE MASSAIS
I need maximum 5% of not found and 10% of error.
Maybe use the regex would be more appropriate, it is possible to compare a row line with all the others using the regex on talend ?
Let me know if you have better idea .
This is a notoriously hard problem to solve I'm afraid. I do not have "THE" answer as I don't believe there is a perfect answer. There are some easy matches to make, and then there are some that make no sense to me, for example.....
foucrelle saint pierre | fouquerelle
Also, how do you break up the words? For example, how does "Richard" match "Richard Massais"? Why doesn't it just match "Richard" or match "Richard Massais saint"? Why isn't "pont de" a legitimate match with "pont de"? It looks like you are using your world knowledge to make these decisions. By "world knowledge" I am talking about your knowledge of context, naming conventions, sentence structure, people, places, etc....just general life experience. You know that Richard is a name and than Massais is likely a surname. Therefore it would seem logical to you that Richard would match Richard Massais. A computer doesn't have that ability (unless you are willing to train a neural network or something....which you can do with Talend).
I think you may need to come up with some heuristics to help with this. For example, how do you group the words? Will there be punctuation or some sort of separator? Will there be quotes surrounding the groups of words? Do you need to have some sort of grouping dictionary? Once a word (or sentence) is matched, can it be matched again on its own or as part of a combination of words?
I think there need to be more rules or much more intelligent (trained) software to solve this as it stands.