Parallel execution takes more time than the non-parallel execution

Highlighted
Four Stars
Four Stars

Parallel execution takes more time than the non-parallel execution

Hi Team,

 

Followed the instructions described in the below link to develop a job execution in a loop:

https://help.talend.com/reader/mjoDghHoMPI0yuyZ83a13Q/iL2h45sTpz~InS1_0iOj5w

 

if I disable the component tSleep and select the check box for Parallel Execution, job takes more time to complete than the non parallel execution.

 

Attached the document that has all the details about the job design and execution results.

 

Can you please let me how to create a job with tloop(iterate) enabling  parallel execution so that .parallel execution  takes less time than the non-parallel execution.

 

Thanks.

 

Highlighted
Employee

Re: Parallel execution takes more time than the non-parallel execution

Hi,

 

    The parallel execution depends on lot of parameters like number of available threads, memory availability for the Talend job, CPU utilization etc.

 

    Could you please specify your use case for parallel execution? I would recommend to start using parallel processing in Talend using the component tParallelize. Please also increase your Xms and Xmx parameters of the job. You will see the difference in performance when you are running a job with longer processing time.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

Highlighted
Four Stars
Four Stars

Re: Parallel execution takes more time than the non-parallel execution

Hi Nikhil,

 

Thanks for the response.

 

My use case is to process the thousands of files present in the main directory as well as sub directories.

 

Once we read the document present in each main dir as well as sub dir, will send it to Apache Solr API for indexing.

 

To improve the performance, thought of using the Parallel Execution.

 

tParallelize component is not available by default in TOS-7.1.1

 

Is there any known issue with using option Parallel Execution?

 

My system configuration is 16 GB RAM and i7-4790 processor.

 

Attachment has all the details about job design and execution results.

 

Kindly let me know your thoughts on how we can use parallel execution for the above use case to improve performance.

 

Thanks.

 

 

Highlighted
Employee

Re: Parallel execution takes more time than the non-parallel execution

Hi,

 

   Since you are using free version of Talend, the parallelism options are limited. Could you please try to initiate multiple instances of same job in parallel through scheduler where you can pass the directory name as parameter? In this way, you can run multiple instances of same job and each instance of the job will process specific directory like DirA, DirB etc. Please make sure that you are having enough memory when you are triggering multiple job instances.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

Highlighted
Six Stars

Re: Parallel execution takes more time than the non-parallel execution

Hello,

 

     You also can use the iteration option. For this you have write the processing logic in sub job and assigning the directories or files will be in main job. Main job will iterate for each folder or file. You can increase the iterations based on your resources. Do not forget to check use independent sub job while calling the sub job. Your job looks like below.

 

tfileList ---iterate-->trunjob.

 

 

Thanks & Regards,
Chandu.
Highlighted
Four Stars
Four Stars

Re: Parallel execution takes more time than the non-parallel execution

Hi Chandu,

 

Thanks for the response.

 

Here are my findings:

Job was designed as you suggested with the below components:

tfileList ---iterate-->trunjob.

 

  • tFileList
  • Parallel Iterate link -  to enable/disable parallelism
  • tRunJob
  • tFixedFlowInput
  • tLogRow
  • Enabled  independent sub job while calling the sub job

Test data:

Test Data:2008 documents distributed in 8 main directories and 100 sub directories for each of 8 main directories;

 

Test Results:

  • Without Parallelism - 3703 milliseconds
  • With Parallelism      - 3469 milliseconds


After selecting independent sub job while calling the sub job, it took  78665 milliseconds

 

Kindly let me know if any other configuration/tuning needs to done.

 

Thanks.

Highlighted
Six Stars

Re: Parallel execution takes more time than the non-parallel execution

Hello,

 

         While using the independent sub job system should have enough resources because each sub-job executed as a separate instance. Try it out by reducing the number of iterations.

 

 

Thanks & Regards,
Chandu.
Highlighted
Four Stars
Four Stars

Re: Parallel execution takes more time than the non-parallel execution

Hi Chandu,

 

My system configuration is 16 GB RAM and i7-4790 CPU.

 

Configured number of parallel execution to 2 and then it took 148814 milliseconds.

 

Other than Talend Open Studio-7.1.1, not running any other major processes.

 

Thanks.

Four Stars
Four Stars

Issue with Enabling Parallel execution for Iterate - Talend Data Integration 6.5.1

Hi Team.

 

Issue - Parallel execution(with number of parallel execution 2) took 338 milli seconds where as non parallel execution took 49 seconds to get the document name of 15 PDF files(just file name without reading file content) in a directory.

 

Use Case - Get document name of files in a directory parallely


Talend Tool - Talend Data Integration 6.5.1

 

Here are the Main Job Design Details:

Using tFileList to get the list of all files in a directory
Using iterate to connect to tRunJob and using the Enable Parallel Execution to process files parallely. Number of parallel executions specified were 2.
Passing the FileName as a context parameter as below to subjob:
(String)globalMap.get("tFileList_1_CURRENT_FILEPATH")
Sub Job Details:

Getting the FileName as context parameter as below:
context.FileName
Using tFixedFlowInput  to generates a fixed flow from internal variables
using tLogRow to print the PDF file name on console


My System Configuration:

i7-4790 processor with 16 GB RAM

# CPU Cores 4

 

Test Data:

Tried to get the file name of 15 PDF files stored in a directory parallelly:


Test Results:

Parallel execution(number of parallel execution were 2) took 338 milli seconds where as non parallel execution took 49 seconds to get the 15 PDF files name


Test Results Details:

Test Results without  enabling parallel execution:


Starting job ReadFiles3 at 14:16 11/10/2019.

[statistics] connecting to socket on port 3857
[statistics] connected
.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\28997.pdf|
'-----------------------------------------------------------'
.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\34684.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\34686.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\35495.pdf|
'-----------------------------------------------------------'
.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\36491.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\36651.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\37561.pdf|
'-----------------------------------------------------------'
.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\37905.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\39296.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\40822.pdf|
'-----------------------------------------------------------'
.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\56645.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\56647.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\57195.pdf|
'-----------------------------------------------------------'
.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\57197.pdf|
'-----------------------------------------------------------'

.---------------------------------------------------------------.
| tLogRow_1 |
|=-------------------------------------------------------------=|
|FileName |
|=-------------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\Geodatafy.pdf|
'---------------------------------------------------------------'

49 milliseconds
[statistics] disconnected
Job ReadFiles3 ended at 14:16 11/10/2019. [exit code=0]

 

 

Test Results with enabling parallel execution:

 

Starting job ReadFiles3 at 14:15 11/10/2019.

[statistics] connecting to socket on port 4069
[statistics] connected
.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\28997.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\34684.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\35495.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\34686.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\36651.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\36491.pdf|
'-----------------------------------------------------------'
.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\37905.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\37561.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\39296.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\40822.pdf|
'-----------------------------------------------------------'
.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\56645.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\56647.pdf|
'-----------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\57197.pdf|
'-----------------------------------------------------------'
.---------------------------------------------------------------.
| tLogRow_1 |
|=-------------------------------------------------------------=|
|FileName |
|=-------------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\Geodatafy.pdf|
'---------------------------------------------------------------'

.-----------------------------------------------------------.
| tLogRow_1 |
|=---------------------------------------------------------=|
|FileName |
|=---------------------------------------------------------=|
|D:\Data\GeoDatafy\Lakshmi\25092019\IndexTest\Temp\57195.pdf|
'-----------------------------------------------------------'
338 milliseconds
[statistics] disconnected
Job ReadFiles3 ended at 14:15 11/10/2019. [exit code=0]


Can you please let us know what could be the issue if we enable Parallel execution.

 

Thanks in advance for your help!

Highlighted
Employee

Re: Issue with Enabling Parallel execution for Iterate - Talend Data Integration 6.5.1

Hi,

 

    You are saying that parallel execution is happening in milli seconds where as the serial execution is happening around 50 seconds. So could you please tell me the issue now you are facing? I am slightly confused here.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

Highlighted
Four Stars
Four Stars

Re: Issue with Enabling Parallel execution for Iterate - Talend Data Integration 6.5.1

Hi Nikhil,

 

Apologize for the typo and here is the edited issue:

 

Issue - Parallel execution(with number of parallel execution 2) took 338 milli seconds where as serial execution took 49 milli seconds to get the document name of 15 PDF files(just file name without reading file content) in a directory.

 

Could you please let me know why  parallel execution is taking more time than the serial execution.

 

Thanks for your response.

Highlighted
Employee

Re: Issue with Enabling Parallel execution for Iterate - Talend Data Integration 6.5.1

Hi,

 

    The first verification I will do will be to increase the memory parameters and see the performance results.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

Highlighted
Four Stars
Four Stars

Re: Issue with Enabling Parallel execution for Iterate - Talend Data Integration 6.5.1

Hi Nikhil,

 

Increased the memory for main job as well as sub job as below:

-Xms1024M

-Xmx1792M

 

Parallel execution took 338 milli seconds where as serial execution took 46 milli seconds.

 

sub job settings.png

 

 

 

 

Main Job Settings.png

 

Thanks.

Highlighted
Employee

Re: Issue with Enabling Parallel execution for Iterate - Talend Data Integration 6.5.1

Hi,

 

     You will see more difference in the performance for a longer job execution processes. 

 

     Since your job is executing in milli seconds, I would pick either of them as end of day, its a batch process.

 

     My view is that since you are using Open source version of Talend, the number of options in front of you will be limited. If you are using a mission critical application with lot of bench marking required, I would opt the enterprise version of Talend, which is having more parallel processing features.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

Highlighted
Four Stars
Four Stars

Re: Issue with Enabling Parallel execution for Iterate - Talend Data Integration 6.5.1

Hi Nikhil,

 

As I mentioned in my previous post, am using the 30 days trail version of Talend Data Integration 6.5.1 to evaluate enabling parallel execution feature for iterate loop. 

 

Downloaded the trail version of Talend Data Integration 6.5.1 from the below link:

https://www.talend.com/download/

 

Can you please provide a working sample job design (if you have one) for enabling parallel execution for iterate loop. 

 

Thanks,

Narayana.

Highlighted
Employee

Re: Issue with Enabling Parallel execution for Iterate - Talend Data Integration 6.5.1

Hi,

 

    I would recommend you to go through the tParallelize component along with sample scenarios.

 

https://help.talend.com/reader/hCrOzogIwKfuR3mPf~LydA/J2Gx1RZDoO6xyhQalZJqUQ

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

Highlighted
Four Stars
Four Stars

Re: Issue with Enabling Parallel execution for Iterate - Talend Data Integration 6.5.1

Hi Nikhil,

 

Thanks for the response. Will explore the tParallelize component and update you.

 

Do you see any issue with enabling Parallel execution for Iterate loop or do not suggest to use it?

 

Thanks,

Narayana.

2019 GARTNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

Best Practices for Using Context Variables with Talend – Part 2

Part 2 of a series on Context Variables

Blog

Best Practices for Using Context Variables with Talend – Part 1

Learn how to do cool things with Context Variables

Blog

Migrate Data from one Database to another with one Job using the Dynamic Schema

Find out how to migrate from one database to another using the Dynamic schema

Blog