tRecordMatching blocking definition excludes every thing

One Star

tRecordMatching blocking definition excludes every thing

I have a source that needs cleaning street names. I have a reference source with valid combinations of danish postal codes and street names. A street name can of course occur in multiple postal codes. In a sample input I have 80 rows looking like this:
1000;2200;Rantzausgade
First number is an ID, second is postal code and the last is street name.
In a lookup table (database) i have rows like this:
61; 1057;Holbergsgade
Again the first number is an ID (unrelated to the ID in the sample)
In a first attempt I only specify Key definition Input key and Lookup key as street names (exact match to be replaced with some thing fuzzy later).
As it happens with this particular sample, I get 80 rows in the Matches output and 0 rows in the Possible Matches. The matches includes a few samples where the postal codes are actually different (I have added the lookup postal code to the output) but most of them have matching postal code.
Now I add postal code from input and postal code lookup to the blocking definition - and I get 0 rows in both matches and possible matches. Here I expected that I would only get matches when both street name and postal code join. I supposed the blocking definition would work like in tMatchGroup ...
What am I doing wrong? Thanks!
PS I originally posted this under Open Studio,
Employee

Re: tRecordMatching blocking definition excludes every thing

Hi xnmo,
Your expectations are correct. You should get some matches if the postal code and street names are the same.
Instead of choosing "exact match", try another algorithm such as q-gram, Levenshtein or Jaro-Winkler.
If you still don't get any matches or possible matches, I would lower the matching threshold in the advanced settings of the component in order to try to get a few matches and understand what happens.
Maybe, you need to trim your postal codes or street names before doing the match.
If nothing works, are you sure that there are matches? Can you exhibit one example and provide a job that shows the problem?
One Star

Re: tRecordMatching blocking definition excludes every thing

Hi Sebastiao,
I tried various setting without any success. I also trimmed without success. I shouldn't matter though since both database and sample file has an Integer field. I tried to export a 1000 line sample from the database, and switched input to this sample file. And that works - I was lucky enough to get 9 matches.
To me it looks like there is a failure in comparing an integer field from the sample file with an integer field from the database. But it works with integers from two files. Could I be doing something wrong around comparing int with Integer (nah, would give compile errors)? Any ideas how I could dig deeper into this?
Regards Niels
Employee

Re: tRecordMatching blocking definition excludes every thing

Hi Niels,
if it's related to the database type, I would try to convert the integer to a string via the tMap (or tJavaRow) before the entry of the tRecordMatching component.
In the meanwhile, I suggest that you raise an issue in our bugtracker http://www.talendforge.org/bugs/
Please, write there the information about the version of Talend's product, the database from which you get the lookup data, the database schema.
If you can give us a sample job and data that reproduces the problem, then it will make our work easier.
One Star

Re: tRecordMatching blocking definition excludes every thing

Hi Sebastiao,
I have raised the issue with sample jobs and sample data. It has also occurred to me that I could try putting the input data in a MySql table instead of reading it from a file - which is what I need to do for the "real" job.
Thanks for helping!
Niels
Employee

Re: tRecordMatching blocking definition excludes every thing

Thanks for the issue, we'll have a look at it.
The link is http://jira.talendforge.org/browse/TDQ-3609