Append XML performance degrades exponentially

Four Stars

Append XML performance degrades exponentially

Hi all,
I have a problem very similar to this one:
I'm also doing the trick where you have to set the "Append the source XML file" option in tAdvancedOutputXML in order to get 2 loop elements in one XML file.
Like that poster, my performance is very poor. What I have done is set an expression filter to limit the results and found that the performance degrades exponentially as the number of rows/nodes in a file increases. For instance, when I limit it to ~350 nodes, I'm getting 117 rows/sec. When I remove the filter expression and it goes to 5K rows, performance degrades to .9 rows/sec. The bad news is that in production there should be around 17K rows. I imagine performance will drop another order of magnitude or more.
I'm using TOS 4.2 on a Core i7 quad with 6GB RAM and an SSD. When I don't have the append option checked, I'm processing several hundred rows/sec. I should also note that while it's running slowly, the JVM it launches is only using 12% of the CPU (single hardware thread) and ~100MB memory. So it's not starved for resources.
This leads me to believe the XML append operation is very inefficient. I'm guessing it's not doing anything in parallel since it seems to be maxing out a half a core. It could allocate more memory, but it doesn't. Is there a way to improve this? A different way to build this type of job? Some settings for XML that I missed?
Thanks much for your help!
One Star

Re: Append XML performance degrades exponentially

Yes. The XML append operation is inefficient, especially when it comes to large XML file.
Try to optimize your job as this related topic .
Besides, modify parameters in "TalendOpenStudio-win32-x86.ini" to increase JVM memory.

Does it work? Wait for you feedback.
Four Stars

Re: Append XML performance degrades exponentially

Thanks for your response, Pedro. Previously my settings were:
I changed them to your settings, but there is no change in performance. It starts off at .9 rows/s. and gradually climbs to 1.3 rows/s. I see the JVM is now using 237MB, but I assume that's because we've increased the minimum. It's not taking advantage of the memory.
I took a look at the initial file size and it's 8.5MB. Not small, but even after parsing should fit into memory just fine. It could get 50x larger and still fit.
What other ways are available to generate XML in the way that I (and so many other users) require? I imagine there must be something that will take advantage of memory, hashing/indexing etc. Thanks!
One Star

Re: Append XML performance degrades exponentially

"Append the Source XML file" is very inefficient which must be the bottleneck of this job. Because it will handle input rows and the data in the source xml file at the same time. And all these data will be calculated in memory instead of storing in disk. So when the source xml file gets larger, calculation will be more and more complex. As a result, the job will become very slow.
Here are some workarounds.
No.1: Split output in several files. You can find this at Advanced Settings.
No.2: Split input rows. Imagine that you put 17K rows and the data in source file together into memory. That must be very slow.
Spliting 17K into several files may help. But because of huge source xml file, the speed is still very very slow.
No.3: Modify the args of JVM. I made a mistake in my previous post. Changing TalendOpenStudio-win32-x86.ini is wrong here.
You can change the args of JVM as the following image.
In short, if you want to get high speed of Job execution, don't use "Append the Source XML file".
If you have to use it, do as NO.2 and NO.3.
Hope thie will help you.
Four Stars

Re: Append XML performance degrades exponentially

Just a note I tried these suggestions, but none really worked, so I had to do it more manually. My solution is here and it resulted in an enormous (~1000x) performance improvement:

Re: Append XML performance degrades exponentially

Very interesting liquidcool! Thanks for the feedback.
Six Stars

Re: Append XML performance degrades exponentially

Just a note....2016 now and Talend STILL has this issue. Trying to output an XML with multiple loops. If I output into separate files - no append, each loop is in it's own XML file - the job takes 18 seconds. When I append the second loop into the first file - the job runs between 4 to 6 minutes for only 10,000 records. I have 2.5 million to process - so using "out of the box" Talend won't work for XML output. Ridiculous performance. There has to be a way to merge or append or process multiple loop XML output efficiently. 


Talend named a Leader.

Get your copy


Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables


How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration


Why Companies Move to the Cloud: 7 Success Stories

Learn how and why companies are moving to the Cloud

Read Now