One Star

parallel Iterate link

I?m just starting to use the parallel Iterate link and I?ve seen some strange effects so I decided to do some testing to better understand what?s going on.
For me the parallel execution (multi threaded) is a very important feature because I have to deal with a data size, which takes several days to process and the only chance I see at the moment to get acceptable response time is to parallelize some tasks.
For testing, I created two tables in Oracle and I?m using the first table to filter data on the second table (an extremely simplified case of what I was planning to do with the real data). When I run the job without multi threaded, I get a nice list of counts for every value pair.
When I set multi threaded = 2 on the Iterate link I suddenly become few lines double but missing some others.
When I set the parallelism to 20 I get a significant number of rows twice (and more get missing).
If there is not something I?m missing here, how I should build the job, then the multi threaded option is mixing up my data and the job is now returning wrong results.
Anybody else who has experience with multi threaded execution with Talend?
5 REPLIES
Community Manager

Re: parallel Iterate link

Hello Vaiko
I replied you in 3485.
Best regards
shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: parallel Iterate link

Hi shong.
Thanks for your reply, but 3485 is unrelated to this.
3485 talks about tRunJob is not running in parallel. This post is about: the parallel works for Iterate, all the 20 Iterate are running in parallel but they produce wrong results.
3485 is a minor issue. My job takes about 8 times as long as necessary (when it takes more than a day to run the job which should run every 24 hours it is not so minor anymore, btw) if they don?t run in parallel.
This post is about my job produces wrong results and if I don?t know that (which is not always easy to find out if you work with millions of rows), I?m in deep trouble.
One Star

Re: parallel Iterate link

When I look at the effects I?m getting when using parallel Iterate inks I have the feeling that global variable handling is not harmonizing very well with parallel execution.
Iterate links usually use global variables to push down parameter to a sub job because there is often no row link pointing to the sub job. Do the different sub jobs running in parallel have a separate global variable list? If they share the same list I will always get parameter mix up when starting a new thread. If they don?t share the global variable list, then the thread starting is fine but then I need to be aware that I can not communicate back to the rest of the job through global variables.
It is really not clear to me how the parallel Iterate links can work. Can someone enlighten me?
One Star

Re: parallel Iterate link

In Parallel Iterative Algorithms: From Sequential to Grid Computing, Bahi, Contassot-Vivier, and Couturier bring mathematical formalism to the study of parallel iterative solution techniques, creating a book that will be useful to those with a strong maths background who are making the transition into parallel scientific computing. ? a great fit as a part of a graduate-level course on scientific computing in the math department, or for those already in scientific computing seeking to understand the key mathematical foundations of the analysis of iterative techniques.
250-365 exam | HP0-A01 exam | 70-350 exam
One Star

Re: parallel Iterate link

Did anyone solve this yet? I'm having the same issue. As far as I see it this could be solved with a FIFO stack component... if it would exist.
Something like this:
Read data -> push to FIFO -> Iterate(x times parallel) -> pop from FIFO -> use data.
/Martin