Five Stars HWY
Five Stars

Stuck on how to proces the result of a 'fuzzy' lookup

Hi All,

 

I am pretty new to the Talend product suite but have some knowledge and experience with other di-tools.

 

In my first tryout I want to develop a simple service which polls a specific folder and for each file which is dropped into that folder executes a 'fuzzy' lookup of the filename against an oracle table with tags. The 'fuzzy' lookup can have 3 reults:

  1. The filename matches none of tags from the oracle table (NO_DATA_FOUND)
  2. The filename matches exactly 1 of the tags (OK)
  3. The filename matchs more than 1 of the tags (TOO_MANY_ROWS)

I want to store the outcome of the 'fuzzy' match in my database using an oracle insert.

To build this job, I have used a tWaitforFile and a tOracleInput step and combined them in a tMap step. In the tMap step I have calculated the fuzzy match by adding a extra column tot the outputstreem which is calculated as "StringHandling.INDEX(row1.FILENAME,row2.B006_TAG)" . This works fine and each line of the tOracleInput is padded with the extra column LookupResult indicating if that tag exits in the filename or not.

 

But based on the new column "LookupResult" I want to determine the overall result (NO_DATA_FOUND, OK or TOO_MANY_ROWS) but after numerous attempts with tAggregateRow and tFilterRow I have run out of options....

Can somebody give me a hint how to proceed in this? See attached my Job and the tMap step.

 

Thnx & regards

1 ACCEPTED SOLUTION

Accepted Solutions
Twelve Stars

Re: Stuck on how to proces the result of a 'fuzzy' lookup

Ok, I think I misunderstood your requirement originally. 

 

First of all, lets look at the process of analysing one file at a time. Remove your tWaitForFile and test the job with a hard coded filename using a tFixedFlowInput. You will need to sort your tAggregateRow out. For that you will need to set a group column I suspect. However, this should be easy to figure out with a hard coded filename though. Your suspicion about the tWaitForFile causing a problem with the tAggregateRow was correct.  

 

Once you have the solution for each file, you will need to enable that for polling. This is a bit of a problem if you want this "job" to be running 24/7. Talend jobs are considered to be "batch" processes. By that I mean they run and then stop. They do not run constantly. For that sort of process you will need to use Talend ESB and use a Route. This is a completely different kettle of fish, but you can still use your solution above with a Route. 

 

Before we go down that path though it probably makes sense for you to test out the hard coded (1 file) test and come back with what you actually need regarding the polling. 

Rilhia Solutions
6 REPLIES
Twelve Stars

Re: Stuck on how to proces the result of a 'fuzzy' lookup

What I'm reading here is that the fuzzy matching is doing as you require, but you are unsure how to group the data to identify whether no match, one match or multiple matches have been found. Is this correct? If so, can you explain how your row is padded "with the extra column LookupResult indicating if that tag exits in the filename or not"?

 

I am assuming that below may be an example of how this is occurring....

 

Col1,   Col2,   Col3,      LookupResult
A1,      B1,      C1,         Match1
A2,      B2,      C2,         Match3
A2,      B2,      C2,         Match9
A1,      B1,      C1,         Match2
A3,      B3,      C3,         Match4
A4,      B4,      C4,         Match6
A5,      B5,      C5,         <NULL>

.....and if the above represents an abstracted version of what you are getting, would below be the right result....

Col1,   Col2,   Col3,      LookupResult       GroupingResult
A1,      B1,      C1,         Match1, Match2  More than 1 match
A2,      B2,      C2,         Match3, Match9  More than 1 match
A3,      B3,      C3,         Match4          One match
A4,      B4,      C4,         Match6          One match
A5,      B5,      C5,         <NULL>          No match

If so, what you need to do is use the tAggregateRow after you have done your lookup (where you should get the data in the first example set I've shown), then group by Col1 (in the above example) and select "First" for the aggregate options for everything else BUT LookupResult, where you would use "Count". This will allow you to connect the output data to a tMap where you can filter by the column you are using for "Count". You will then have single rows where you can identify your Single match, multi match or no match status.

 

There are other requirements I am sure you will have with this which may need a bit of extra thinking about (possibly because of the data loss due to the aggregation), but the above should get you part of the way....so long as my many assumptions are correct.

Rilhia Solutions
Five Stars HWY
Five Stars

Re: Stuck on how to proces the result of a 'fuzzy' lookup

Hi Rhall_2_0

 

Thnx for your help. I tried to implement this but the tAggregate component does not produce any output rows.

 

See my attached screenshots.

 

Do you have a clue?

Twelve Stars

Re: Stuck on how to proces the result of a 'fuzzy' lookup

Have you set an output schema (the schema button at the top of the component tab)? You will need to do this. Also, you will need to group by the key column that you are grouping by. For example, in my example data that might be column Col1.

Rilhia Solutions
Five Stars HWY
Five Stars

Re: Stuck on how to proces the result of a 'fuzzy' lookup

Yep; See my attched screenshots.

 

 I will experiment with the tAggregate component some more

 

Five Stars HWY
Five Stars

Re: Stuck on how to proces the result of a 'fuzzy' lookup

I think  the tWaitforfile component is causing the problem. This component will run infinite.

 

The tAggerate needs all rows to perform the aggregation but does not know when all rows are collected because the tWaitforfile can issue a new row any moment.

 

Is my conclusion correct?

Twelve Stars

Re: Stuck on how to proces the result of a 'fuzzy' lookup

Ok, I think I misunderstood your requirement originally. 

 

First of all, lets look at the process of analysing one file at a time. Remove your tWaitForFile and test the job with a hard coded filename using a tFixedFlowInput. You will need to sort your tAggregateRow out. For that you will need to set a group column I suspect. However, this should be easy to figure out with a hard coded filename though. Your suspicion about the tWaitForFile causing a problem with the tAggregateRow was correct.  

 

Once you have the solution for each file, you will need to enable that for polling. This is a bit of a problem if you want this "job" to be running 24/7. Talend jobs are considered to be "batch" processes. By that I mean they run and then stop. They do not run constantly. For that sort of process you will need to use Talend ESB and use a Route. This is a completely different kettle of fish, but you can still use your solution above with a Route. 

 

Before we go down that path though it probably makes sense for you to test out the hard coded (1 file) test and come back with what you actually need regarding the polling. 

Rilhia Solutions