Java Out of Memory Occurring for large process of csv files

One Star

Java Out of Memory Occurring for large process of csv files

Hi
I have being running a Talend job, that processes a large amount of csv files for a number of months without issue. The last successful run had just over 20000 files. The set is now up to almost 23,000 and all of a sudden it's running out of memory
Exception in component tRunJob_2
java.lang.RuntimeException: Child job return 1. It doesn't terminate normally.
Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Java heap space
The main job runs 4 subjobs and on the second job the above error occurs. The first job simple takes each file (inside a directory) filters by a single column value and outputs the file into another directory. The second Job takes the output files and one by one does a tSortRow & tUniqRow and then a filter based on a column falue and a Date value.
Is there any idea why an extra 2000 files would cause Talend to run out of memory? I've tried upping the heap size and it's still running out of memory
Any help at all would be greatly appreciated.
Thanks
Suzy
One Star

Re: Java Out of Memory Occurring for large process of csv files

Hi,
I suggest you use some monitoring tools like VisualVM or something like that to find out what object is actually consuming all the memory.
If all files are sequentially processed one would expect it not to run out of memory even with this number of files.
Hope this helps.
Regards,
Arno
One Star

Re: Java Out of Memory Occurring for large process of csv files

Thanks for your reply Arno. I forgot to mention, I exported this job and am running it on a server through the run script. Do you think this would make a difference to why it would run out of memory?
Thanks
Suzy
One Star

Re: Java Out of Memory Occurring for large process of csv files

Hi,
No, this shouldn't make any difference. It should even run better from the script on command line, because no graphical interface is needed there.
You can however still use VisualVM but instead of connecting it to the local Java VM, you should make a little adjustment to the job start script to allow it to accept the remote monitoring connection. (If you need more info on what these parameters are I'll look them up for you).
Regards,
Arno
One Star

Re: Java Out of Memory Occurring for large process of csv files

Hi Arno,
If you could help me with the parameters for allowing the job to accept remote monitoring connection that would be great
Thanks!
Suzy
One Star

Re: Java Out of Memory Occurring for large process of csv files

Hi Suzy,
To make the Talend application listen for remote monitoring you should add the following parameters to the .sh script:
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=8100
-Dcom.sun.management.jmxremote=true
-Djava.rmi.server.hostname=192.168.10.101

Of course you should change the IP address to reflect your situation and make sure that port 8100 in open in the firewall of the server running the job (eventually you change the port number to another open port that is not in use)
Regards,
Arno
One Star

Re: Java Out of Memory Occurring for large process of csv files

Thanks a mil!
One Star

Re: Java Out of Memory Occurring for large process of csv files

Hi,
The issue I'm finding seems to be around processing a csv file that is 190mb. I tried using the 'sort on disk' option but that causes
java.lang.OutOfMemoryError: GC overhead limit exceeded
What's the best way for processing large CSV files? (Attached image of my subjob)
Five Stars

Re: Java Out of Memory Occurring for large process of csv files

increase java memory
One Star

Re: Java Out of Memory Occurring for large process of csv files

That didnt work on our server
One Star

Re: Java Out of Memory Occurring for large process of csv files

Hi Suzy,
Still Jugal's solution should help. The GC overhead limit exceeded error occurs when the JVM is using to much CPU for garbage collection (around 98% CPU where no more than 2% heap is freed if I'm correct).
Allowing the job to use more memory (Xmx) should increase the available free heap. To test anything like that you could try do reduce the size of your file temporary, to see if that prevents the job from crashing.
I've seen a lot of the GC overhead errors when parsing XML and the document gets loaded into memory using the complexParser in the SAX utility. With the VisualVM I analyzed this and saw that extending memory helped (to an extent) in preventing this error from occurring.
Did you monitor the job with VisualVM already? If so, could you upload a screenshot of the VisualVM when the job crashed?
Regards,
Arno
One Star

Re: Java Out of Memory Occurring for large process of csv files

Was finally able to get it working on the Server, there were some configuration needed but upping the heap size resolved the out of memory issue.
I think ultimately though we'll have to find a better way to sort the data.
Thanks a million guys for your help!
Cheers
Suzy
Five Stars

Re: Java Out of Memory Occurring for large process of csv files

Smiley Happy
One Star

Re: Java Out of Memory Occurring for large process of csv files

Hi,
Thanks for the feedback. Glad we could help. Smiley Happy
Arno
One Star

Re: Java Out of Memory Occurring for large process of csv files

If you go into ADVANCE settings the OUTPUT component...the one receiving the data, you can define a batch size. If you define a low batch and commit size, like 1K, then it doesn't hold as much in memory and can get thru the wide data in small chunks.