Problem reading large files

One Star

Problem reading large files

Hi
I've a job that reads a positional file and processes a positional file. The file is read using a tFileInputPositional file which is connected to a tConvertType component. Also, 6 childJobs are triggered from the job on subJobOk - tFileInputPositional is the first component of the parent job.The input file is 530MB and the records are 354 bytes long.
The job processes correctly for small files but throws a "java.lang.OutOfMemoryError" error on the tFileInput component after 100233 rows for the 530MB file. The xmx is set to 1024M in Windows>Preferences>Talend>Run/Debug.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.String.<init>(String.java:208)
at java.io.BufferedReader.readLine(BufferedReader.java:331)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at org.talend.fileprocess.delimited.RowParser.readRecord(RowParser.java:156)
at x.y.z.tFileInputPositional_1Process(z.java:7532)

I tried setting the xmx parameter to 1800M and 2048M but it results in unable to create jvm error.
Could not create the Java virtual machine.
Error occurred during initialization of VM
Could not reserve enough space for object heap

Has anyone come across and resolved this issue? is there a different way I should be reading large files?
Community Manager

Re: Problem reading large files

Hello
First, you can read the related
4460
5283
You can try to split the long record to several columns, then join them on another component.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: Problem reading large files

Hi,
could you please post a screen of you job and give an example of your input data.
Bye
Volker
One Star

Re: Problem reading large files

Hi again,
to help you a little bit out I think you have some ways to solve the problem:
* If you have any aggregate or sort function check if you really need them or if you have any alternative . If you should use them try to sort them external (there are special components to do that) and use aggregate functions which are based on the sort order. This prevents Talend to use big chunks of memory to temporary store the data.
* In tMap you could set an external file too to reduce memory usage.
* Try to cut your job in two or more parts which could be run each after another. Stop the first part with an temporary file which will be the input for the second one.
* Try to read only a part of your file (you could create a job which will, for example, split your file in chunks of 1000 rows).
Hope this will help you.
Bye
Volker
One Star

Re: Problem reading large files

Hi Volker, excellent suggestions. It was the aggregate that was causing the exception. The message was pointing to tFileInputPositional which threw me off. I tried to split the job into subjobs and then into a child job but it still seems to be failing at the aggregate. I'm trying to aggregate with 10 fields to create a summary record that can be used by multiple child jobs. I've tried various options such as using sorted input, sorted aggregate etc., I might have to use a custom java component to do it. Any other suggestions? Please let me know. Thanks.
One Star

Re: Problem reading large files

Hi,
did you tried to use tAggregateSortedRow (in your first screen you already have sorted the data)? Additional there is a component for external sort.
What are you doing in your tAggregateRow and which information so you really need? Can you reduce the number of columns for example (tFilterColumns).
Bye
Volker
One Star

Re: Problem reading large files

I was trying to create a summary file that grouped 10 fields and summed the amount of each line in the group. I tried the aggregate sorted row but it did not work either. I was not able to try the external sort since i did not have a desktop license for the sort we use on the server.
I managed to solve the problem by using tMap to concatenate the 10 fields into a single field (keyField - i know pretty creative after debugging for a few days Smiley Wink ) and grouping/aggregating by that keyField. I later extract the fields to the required layout. This solves the problem but I'm afraid will not scale properly so I'm going to keep looking. Maybe a external sort (like u suggested) or custom java code to do the job is the way to go.
One Star

Re: Problem reading large files

Hi,
I'll take a look into the source code next week.
Bye
Volker
One Star

Re: Problem reading large files

Hi,
I looked today into the generated code (tAggregateRow) and I really understand your problem. For example the data in tAggregateRow is stored in a nested HashMap-construct. It looks very ugly and I think it use extremely memory. And memory usage will increase rapidly depending on the number of keys and for low density of the key values.
So your solution is the best workaround and I think should be replace the standard behavior of the Talend component.
Bye
Volker
One Star

Re: Problem reading large files

Thanks Volker!
Employee

Re: Problem reading large files

Indeed Volker, you are right, this is the reason why we have improved the current tAggregateRow into the 3.1.0 M1/M2 releases. So the new tAggregateRow hasn't anymore the very-very ugly nested Maps.
Therefore I will update the "tAggregateRowOpt" into the "Exchange" page (previously called Ecosystem"), which will be the exact copy of the tAggregateRow for 3.1. The "tAggregateRowOpt" is useful for user which will use TOS before version 3.1.

.
One Star

Re: Problem reading large files

Thats interesting. I'll take a look into the new version. I made some brainstorming for my self to avoid the memory consumption (and speed to).