Iterator over all input row combinations (not memory buffered)

One Star mac
One Star

Iterator over all input row combinations (not memory buffered)

I am new to Talend and looking for a solution for iterating over all row combinations of
input data (e.g. rows of a CSV file or records in an XML file) in order to perform a rather
complex matching (using the tMap component) and subsequent removal of so identified
duplicates. Due to the large amount of data I would like to iterate multiple times over the input data
(eventuelly improve performance with caching).
I have something as the following in mind: Two pointers on the input data, one iterating
from the beginning to the end, the second always from the position of the first pointer
(actually nothing really special...).
An example for only four rows the iterator would give out the combinations
(which then could be used as input for a tMap component):
- row1, row2 (inc, pointer1, reset pointer2)
- row1, row3 (inc. pointer2)
- row1, row4 (inc. pointer2)
- row2, row3 (inc. pointer1, reset pointer2
- row2, row4 (inc. pointer2)
- row3, row4 (inc. pointer1, reset pointer2)
Questions:
- does such a component already exist or can it be easily constructed, e.g. out of tLoop?
- am I overseeing some basic functionality and could the job be done much more easily?
Thanks for your patience,
mac
Community Manager

Re: Iterator over all input row combinations (not memory buffered)

Hello Mac
Can you take some data to explain your request? What are your input data and what are your expected output data?
Best regards
shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star mac
One Star

Re: Iterator over all input row combinations (not memory buffered)

Hi Shong
I actually want to compare/process every row combination. I will make a simple example.
Lets say I have an XML file with construction pieces (here only 4):
<pieces>
<piece>
<id>1</id>
<color>rouge</color>
<height>200</height>
<width>349</width>
</piece>
<piece>
<id>2</id>
<color>azul</color>
<height>243</height>
<width>299</width>
</piece>
<piece>
<id>3</id>
<color>rot</color>
<height>1205</height>
<width>340</width>
</piece>
<piece>
<id>4</id>
<color>bleu</color>
<height>200</height>
<width>39</width>
</piece>
</pieces>
The iterator would now give me as output in every iteration two records to process. Finally, after iterating over all elements, I will have been able to compare each piece with each other. As I described my idea before. the iterator would give me:
piece ID 1 and 2
piece ID 1 and 3
piece ID 1 and 4
piece ID 2 and 3
piece ID 2 and 4
piece ID 3 and 4
The idea behind all is that with every iteration I can process two records and filter them out, correct them or do what ever I want.
In this case, I could for instance translate the colors from different languages to english (e.g. by a lookup table), or decide if height or width are considered equal if they differ by a certain deviation. Like this, I am able to do data cleasing by defining an equality function for filtering out equal pieces.
Thanks for your support!
mac
Community Manager

Re: Iterator over all input row combinations (not memory buffered)

Hello Mac
In Talend, use the 'iterate' link will fit your need. Please see my screenshots.
My forum6435.xml:

<?xml version="1.0" encoding="ISO-8859-15"?>
<root>
<pieces>
<piece>
<id>1</id>
<color>rouge</color>
<height>200</height>
<width>349</width>
</piece>
<piece>
<id>2</id>
<color>azul</color>
<height>243</height>
<width>299</width>
</piece>
<piece>
<id>3</id>
<color>rot</color>
<height>1205</height>
<width>340</width>
</piece>
<piece>
<id>4</id>
<color>bleu</color>
<height>200</height>
<width>39</width>
</piece>
</pieces>
</root>

Hope you understand well on 'iterate' usage.
Let me know if you have any questions!
Best regards
shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star mac
One Star

Re: Iterator over all input row combinations (not memory buffered)

Hi Shong
Great, this looks like that what I am searching for, thank you!
One tiny extension/question:
My given example was unfortunately a little bit imprecise, in respect of the given ID. In my case
this is an ascending value but in reality could be anything. Thus, the ID can't be taken in my case
as a filter condition. It would have to be some sort of row counter (of the input component e.g.
tFileInputXML) which I couldn't find as a property there. I also checked if the schema could be
extended by some sort of "auto increment" value but I also didn't find any information on that
either. Do you suggest working with the tFileRowCount or is there a direct way?
Thanks again.
One Star mac
One Star

Re: Iterator over all input row combinations (not memory buffered)

Don't know if this is the right way to go or if there is a more elegant solution, but I could
get requested behavior by adding the expression:
tos_count_tFileInputXML_2>tos_count_tFileInputXML_1
in the advanced section (Basic Settings) of the tFIlterRow (see screen 3 in the above post)
Community Manager

Re: Iterator over all input row combinations (not memory buffered)

Hello
Do you suggest working with the tFileRowCount or is there a direct way?

Yes, add a new column: id, it is a sequence digit for each row. Please see screenshots.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star mac
One Star

Re: Iterator over all input row combinations (not memory buffered)

Hi Shong
Thanks! Perfect help and your solution helps me getting to know Talend better and better.
However I must bother you again with a problem related to above. Although I considered the User and Developer manual,
I can't see how to bring the outputs (currently written out with a tLogRow) together so I can implement (comparison)
logic based on both inputs.
As far as I understand Talend, a tMap would be a good way to do this, but this cannot handle two input mains. I find
other examples always with a Lookup input but this again mostly in combination with database inputs.
Do I miss some important basic concept in Talend or do I oversee simply something? (blush)
Community Manager

Re: Iterator over all input row combinations (not memory buffered)

Hello
currently written out with a tLogRow

tLogRow is used to print the result on cosole, it just for debug purpose. In real job, you will output the result to file or database, so you can use tFileOutputxxx/txxxSQLOutput to replace tLogRow.
but this cannot handle two input mains. I find
other examples always with a Lookup input

In fact, we can regards lookup flow as main flow. tMap is a very useful and powerful compnent, we can do merge,filter or any data processing on multiple input flow on tMap.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star mac
One Star

Re: Iterator over all input row combinations (not memory buffered)

But why can't I then feed in the output rows directly into the tMap (instead of the tLogRow - yes, I understood it that way, that these components are only for debugging/logging purpose)?
I can attach one main to the tMap but not a second one. If I look at other examples I always see that a second feed is a "Lookup" and not a "Main".
In the documentation I find a note about "Lookup" but not how to handle or obtain these...
Community Manager

Re: Iterator over all input row combinations (not memory buffered)

Hello Mac
But why can't I then feed in the output rows directly into the tMap

tMap is a intermediary component, we can do join/merge/filter or any processing on it, finnaly, we will output the result to output component, like tMysqlOutput, tFileOutputDelimited.
I always see that a second feed is a "Lookup" and not a "Main".

To a tMap, there only exists one main flow and others should be lookup flows.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star mac
One Star

Re: Iterator over all input row combinations (not memory buffered)

Hi Shong
Exactly...this points to my basic question Smiley Happy. How can I convert a "main flow" to a "lookup flow"
(the question may sound ridiculous, sorry, but only explains my basic knowledge... ;-)) ?
Community Manager

Re: Iterator over all input row combinations (not memory buffered)

Hello
Right click on 'main' flow and select' set this connetion as lookup' option.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star mac
One Star

Re: Iterator over all input row combinations (not memory buffered)

Hello Shong
You say to right click the main flow and to select "set this connetion as lookup".
Unfortunately I cannot find that menu or am I completely in the wrong spot?
For illustation I attached you my screen where I right clicked the main flow of output of tMap_2, although it should
be the output of tMap_1 which I then would like to feed into the tMap_3.
The attached screenshot of your last post shows an example with a database. It this maybe limited to this scenario?
Thanks a lot, I really appreciate your patient support
One Star mac
One Star

Re: Iterator over all input row combinations (not memory buffered)

Hi Shong or anyone having patience with me Smiley Wink
I am sure there is a simple explanation why I cannot feed the above to feeds into the tMap.
From my limited point of view it is because one data stream has to be a lookup but how
to get this??? I even checked out some webinars and there they do similar things but can
always input the to records into a tMap.
Am I simply missing a concept of Talend e.g. some issue with synchronicity (guarantee that
both are existing in the tMap...I know...I am starting to come up with some adventurous
ideas but am quite helpless at the moment)
Community Manager

Re: Iterator over all input row combinations (not memory buffered)

Hello Mac
If you want to see the lookup flow, there must exist more than one flow linked to tMap, one is main flow and others are lookup flow. You should right click on the input flow of tMap, not the output ones.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business