Four Stars

Can passing contexts from Parent Job to Child Job slow down my job performance

I am parsing XML file in parent job. In that, I am extracting portion of whole node structure from XML and putting it in one string variable(let's say A). I am passing string variable A to child job from parent job using context parameters from tRunJob and parsing that node structure in child job.

It is slowing down my job as per my initial analysis.

 

With tRunJob in my job design:

Performance is 15 rows/s

 

Without tRunJob in my job design(Placed whole child job design in parent job):

Performance is 240 rows/s

 

Q: Is there anything I can do to improve performance using tRunJob with context passing? 

Q: Is it bad to use too many tRunJob components in job design?

Q: Is it bad to pass large String data from Parent Job to Child Job using context parameters?

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Fifteen Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

How do you mean? There are dozens of ways to do this. I simply suggested how I would do it above. The tRunJobs are really useful, but you need to make sure you are not using them inefficiently. Running a tRunJob 6000 times is not ideal, but if you could get the tRunJob to run once and carry out the processing inside the tRunJob 6000 times, that will be much faster. However, with this usually comes the issue of passing a large amount of data in and out of child jobs. You *can* load your data into an in-memory collection (an ArrayList for example) and pass that to the tRunJob as an Object, cast it back to an ArrayList when inside the child job, then process the data. Then when you want the data out you can return it using a tBufferOutput component. That would be the way to achieve my suggestion above without using a database.

I tend to find that people want to force one tool to do everything, when maybe another tool is better. For example, I have running shoes for when I go running. They are great. But if I use them for tennis, changing direction quickly becomes a little dangerous. I *can* use them, but it is much better to use tennis shoes. You can certainly achieve the processing of your data in memory using just Talend components, you can almost certainly get the job to run a lot faster using just Talend components as well. But if you have a database to hand, why not make use of the power of the DB to join/filter/etc? I'm not sure that your requirements NEEDS a database, but it is a good example that I see regularly. Using a database with Talend is not cheating and not highlighting a flaw in Talend (or any other integration tool), it is just using the power of a related tool to enhance what Talend can do. 

Rilhia Solutions
16 REPLIES
Four Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

Same issue here also, I'm using the some functionality with & without tRunJob, then speed is reduced 13%

localhost pub_no insert.pnglocalhost pub_no with tRunJob.png

When i using tRunJob that reduced speed 165 to 143 (13%)

 

EDIT

Attaching tRunJob Config

Screen Shot 2018-04-12 at 3.51.05 PM.png

 

 

Fifteen Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

Have you tried building your job and running it from the command-line? Running in the Studio is slightly different from running a job in it's compiled state. From my experience you will notice that the job runs a lot faster when compiled. There will be some latency from starting a child job, but it shouldn't be anywhere near what you have described. I would *guess* that what you are seeing is largely down to the Studio 

Rilhia Solutions
Four Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

I already tried running after compiling from command line and there was an improvement but not much.

My biggest slow down is context passing from Parent to Child job.

Fifteen Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

You really need to post a screenshot of your job design and give us an idea of how big the XML you are passing is. Your tRunJob config would also be useful. I would expect a slight slow down because you are adding a child job start and end for each row of data you processing, but what you have described is not in line with what I would expect. It could be that a slight tweak of your job will solve this.

Rilhia Solutions
Four Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

XML file size ranging from 500 MB to 8 GB.

And job design screenshot already provided by @linto_cheeran, please refer to that.

@linto_cheeran, please provide tRunjob configuration screenshot too.

Four Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

I tested reading string from context

Screen Shot 2018-04-12 at 4.00.23 PM.pngreading string from context is the slowest thing in entire child job

Fifteen Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

Hang on, @linto_cheeran said the performance fell by 13%. That isn't that bad considering there were over 6000 child job initializations. When a job starts it runs processes to load context variables and initialise the AMC logging functionality, etc. There will be a performance hit. The way off is between readability and performance. If you add extra code it is always going to take longer. 

I was more interested in the example where it went from 240 rows a second to 15. That sort degradation in performance is well beyond what I'd expect.

Rilhia Solutions
Four Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

I have created a simple job that printing xml fragment, in normal execution that processing 1038 rows per second

 

iteration without child.png

 

the same job with subjob & argument passing via context, the processing is reduced to 75 rows per second (93% reduced)

 

iteration with child job & context.pngiteration with child job & context subjob.png

 

 

now testing only tRunJob with out passing any argument (just printing some constant string) then processing speed is 75 rows per second (93% reduced)

 

Screen Shot 2018-04-13 at 11.42.46 AM.pngScreen Shot 2018-04-13 at 11.43.02 AM.png

 

so my conclusion is using tRunJob for iterating purpose the performance reduce at-least 90%

Fifteen Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

Given your example above, that is what I would expect. You are essentially comparing just printing some text 6000 times with initializing a new job to print a line of text and then closing that job, 6000 times. It will slow things down dramatically in that scenario. Adding a child job is never going to speed things up. It is really meant to simplify your jobs and enable you to reuse bits and pieces in different areas. 

Rilhia Solutions
Four Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

While it is right to expect drop in perf with usage of tRunJob (for better organization), would you expect a drop in perf of 90%? That doesnt seem right. An additonal overhead of 1-5% would be understandable.

 

We have hit the Max size limit for java function when we tried to keep all the functionality in the main job itself. 

 

In this case the subjob is not doing much (to keep the example simple), but in real world it would process the fragment and store in DB.

 

Do you see any other way to organize a large job (i.e not using tRunJob)? Please suggest.

Fifteen Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

Look at the other example where there was actually something being carried out in the job. There was a drop of 13% apparently. If you are comparing......

 

1) Acquire text

2) Print text 

 

......6000 times with ......

 

1) Acquire text
2) Send it to a child job

2.1) Instantiate child job

2.2) Synchronized contexts

2.3) Deal with other built-in startup code (including logging, error handling, component globalMap interaction, etc)

2.4) Generate a row with your context variable data 
2.5) Print that data

2.6) Shut down the child job
2.7) Passing back control to the parent job

 

.....6000 times, do you see where the example of 90% slow down is somewhat unfair?

Rilhia Solutions
Fifteen Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

To reduce the size of your job and make it faster there are a number of things you can do. I can't give you detailed answers to this, but I can make suggestions.

 

1) Extract your data from the XML file and load it into a flattened structure in a DB. Do this in one job. 

2) Read in the data from the database (this will be much quicker than iterating through child jobs) and process the whole recordset in a child job.

3) Build a "Main Job" that will call the job at step 1 and then call the job at step 2

 

This is very high level, but essentially you are trying to remove the 6000 iterations of running your child job.

 

Rilhia Solutions
Four Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

is there any other components available in talent to do this without any performance loss ?

Fifteen Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

How do you mean? There are dozens of ways to do this. I simply suggested how I would do it above. The tRunJobs are really useful, but you need to make sure you are not using them inefficiently. Running a tRunJob 6000 times is not ideal, but if you could get the tRunJob to run once and carry out the processing inside the tRunJob 6000 times, that will be much faster. However, with this usually comes the issue of passing a large amount of data in and out of child jobs. You *can* load your data into an in-memory collection (an ArrayList for example) and pass that to the tRunJob as an Object, cast it back to an ArrayList when inside the child job, then process the data. Then when you want the data out you can return it using a tBufferOutput component. That would be the way to achieve my suggestion above without using a database.

I tend to find that people want to force one tool to do everything, when maybe another tool is better. For example, I have running shoes for when I go running. They are great. But if I use them for tennis, changing direction quickly becomes a little dangerous. I *can* use them, but it is much better to use tennis shoes. You can certainly achieve the processing of your data in memory using just Talend components, you can almost certainly get the job to run a lot faster using just Talend components as well. But if you have a database to hand, why not make use of the power of the DB to join/filter/etc? I'm not sure that your requirements NEEDS a database, but it is a good example that I see regularly. Using a database with Talend is not cheating and not highlighting a flaw in Talend (or any other integration tool), it is just using the power of a related tool to enhance what Talend can do. 

Rilhia Solutions
Four Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

From the above, following seems to be the costs:

 

TotalCost = ParentJob(Setup-TeardownCost+ProcessingCost) + 6000 * ChildJob(Setup-TeardownCost+ProcessingCost)

 

The subJob setup-teardown costs are significant and hence should be used only if the subJob processing work (cost) is much larger than this cost. In this subJob setup-teardown costs, I am wondering if there is a variable component that (may be context synchronization?) which is making the cost more pronounced? If there were no or very light context sharing between parent and child job, the subJob overheads would be much smaller? 

 

Are there any best practices as far as Context sharing is concerned that affects performance?

Fifteen Stars

Re: Can passing contexts from Parent Job to Child Job slow down my job performance

There will be an overhead for context synchronisation since there is casting involved. You can see this if you look at the code. In order to see the cost of this, you can create a scenario where you are sharing the values that cause your current job to take so long and then recreate that exact scenario, but sharing fewer context variables. I do not know the precise cost, but would imagine that it will be reasonably significant. 

I am not aware of anyone writing up any Context Sharing best practices if I am honest. I tend to share all of my context variables across all of my child jobs, but I rarely (unless I have to) use context variables for anything other than configuration when I have a database to work with. There are times when I do put my data into a Java collection and send the data into a child job in an Object context, but that is usually when I do not have a database to play with. In that situation I expect a little slowdown, but also make sure that that happens as few times as possible.

Rilhia Solutions