One Star

How to merge two replicated streams(tMap, tReplicate, tUnite)

Issue Description : Not able to merge replicated data streams, In this particular example I have replicated data with tReplicate/tMap component and after processing on different parts/columns I am trying to merge the data again so that I can have a properly processed result set which can be loaded into the target system.
References :
http://jira.talendforge.org/browse/TDI-13416
https://jira.talendforge.org/browse/TDI-22174
https://jira.talendforge.org/browse/TDI-22232
Note : Processing could have been done on sequential basis also but in order to utilize CPU time trying to process the records in parallel.
Scenario 1 : If I have 10GB of file then after replication it is not a good idea to store that all data on the disk. In this particular scenario if I am replicating stream in 3 parts then I would get data around 30 GB(processed might be less but would always be >10GB to maintain the wholeness of data) or 20 GB something like that.
1) Now storing that data will require disk space.
2) There will be some permanent latency in I/O.
3) Housekeeping of that data has to be maintained.
Ref. Image : Scenario1.PNG

Scenario 2: Have also tried doing same thing by assigning same set of input variables to three different streams of output variable in tMap but the result is same and I could not join that data.
Ref. Image : Scenario2.PNG

I have been told that it is not a bug but a known limitation but can anyone please suggest how to process such data, how to merge those data sets. Any alternative would be appreciated may be inserting some component in between or some coding in order to merge the data sets, for joining purpose I have created a ID column too, so anything is appreciated.
7 REPLIES
One Star

Re: How to merge two replicated streams(tMap, tReplicate, tUnite)

join with a tUnite
One Star

Re: How to merge two replicated streams(tMap, tReplicate, tUnite)

Thanks Janhess,
However it doesn't work for jobs which create loops. As per my knowledge there can be only two viable options 1) Job flow should be switched to sequential mode 2) Replicated data must be stored somewhere and again should be read to process it.
However both options put a lot of burden on tool as well as it introduces unavoidable latency with growing data volume.
So looking for some elegant solution as it would not look good design to process only 1 column out of 7 at a time in 10 GB of data and waste CPU time or storing 10 GB of data on disk multiple times.
Any suggestions as why Talend behaves in such a way, what is that design which leads to this situation may be then I can think about some alternatives.
Any suggestion is most welcome.
One Star

Re: How to merge two replicated streams(tMap, tReplicate, tUnite)

Could you get round it with tDenormalize combined with a map rule if necessary?
Seven Stars

Re: How to merge two replicated streams(tMap, tReplicate, tUnite)

You need to remember that Talend actually creates a Java program; essentially that is the reason for the limitation you've encountered.
tUnite in essence says for the following components take all the data provided by each of the inputs in turn i.e. all of A then all of B then all of C. It cannot take row 1 from A then row 1 from B then row 1 from C then row 2 from A then row 2 from B etc. because of the nature of programming loops used for each flow.
However, tMap multiple outputs or tReplicate do create row 1 to A then row 1 to B then row 1 to C then row 2 to A then row 2 to B etc.. This is why you cannot split and then rejoin flows.
You should also note from the above that splitting a flow does NOT mean that the subsequent split flows will be processed in parallel; they will still be processed sequentially. The only way to introduce parallel processing would be to store the three data sets and then process those in parallel.
Your screenprints don't show what you're trying to do in the tExtractRegexFields. Perhaps if you gave an example we might be able to suggest an alternative approach.
One Star

Re: How to merge two replicated streams(tMap, tReplicate, tUnite)

Thanks Alevy,
That was a great piece of information, though I am still skeptic about the parallel processing part. If that is the case then I think loop logic needs to be reviewed for tMap and tReplicate and needs to implement best approach in both.
--
Regards,
Vinod
Seven Stars

Re: How to merge two replicated streams(tMap, tReplicate, tUnite)

Again, it's not possible to change the behaviour of tMap and tReplicate because they pass each incoming row through to the following components one at a time. To pass the whole set through to A and then to B etc. would require it to be cached in memory or re-read from the source. The current setup is the most efficient programatically and in execution.
Unfortunately, that fact combined with the reverse being best practice for tUnite is exactly why you can't split and then rejoin flows.
To prove that it's not processed in parallel just look at the generated code, which clearly creates output row A and processes all components thereafter then creates row B and processes all components thereafter etc.. To process in parallel, a new thread would have to be created for each set of processing.
One Star

Re: How to merge two replicated streams(tMap, tReplicate, tUnite)

Thanks alevy,
I think, I must park this topic here, have noted down the limitation. Will try to work on it. Till than would continue as suggested.
--
Regards,
Vinod