One Star

[resolved] tSortRow and Large Files

I am just starting with TOS 5.6.0 and I am trying to sort a large CSV file (2.5GB, 11M rows, 45 columns).  I am setting JVM to 2GB and I've tried various sizes of buffer for the external sort in Advanced tab.  The error stack shows that the out-of-memory occurs in various places, but the results is always similar to:
Exception in thread "Thread-0" java.lang.OutOfMemoryError: Java heap space
at java.util.LinkedList.listIterator(LinkedList.java:667)
at java.util.AbstractList.listIterator(AbstractList.java:284)
at java.util.AbstractSequentialList.iterator(AbstractSequentialList.java:222)
at routines.system.RunStat.sendMessages(RunStat.java:261)
at routines.system.RunStat.run(RunStat.java:225)
at java.lang.Thread.run(Thread.java:662)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringBuilder.toString(StringBuilder.java:430)
at com.talend.csv.CSVReader.endColumn(CSVReader.java:131)
at com.talend.csv.CSVReader.readNext(CSVReader.java:301)
at johnmdm.sqlinout_0_1.SQLInOut.tFileInputDelimited_1Process(SQLInOut.java:3380)
at johnmdm.sqlinout_0_1.SQLInOut.runJobInTOS(SQLInOut.java:5199)
at johnmdm.sqlinout_0_1.SQLInOut.main(SQLInOut.java:5056)
--john
2 REPLIES
Community Manager

Re: [resolved] tSortRow and Large Files

Hi
Take a look at this KB article, to resolve this error, try to store the data on disk instead of memory, check the 'sort on disk' box on the advanced setting tab of tSortRow component.
Best regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: [resolved] tSortRow and Large Files

Hi, I am trying to solve an performance issue around sorting huge file(50 Million record) to be sorted on Integer column+Alpha column(file has 6 columns). tSort takes around 30 mins with enabling sort on disk .
I am using TOS 5.6.2 and evaluating this sort for my POC . Please advise and the optimized job design .