Iterate on rows of a large file and free memory between each row

Four Stars

Iterate on rows of a large file and free memory between each row

Hi all, 

 

I follow the advice of @rhall_2_0 to create this topic rather than prolongating the topic : "How to iterate on tFileInputFullRow rows?" and making my question clearer. 

 

My use case is that I have a fairly large XML file with multiple rows (the file is actually 300 Mb for 23000 rows). From that file I only need to create 1 file per row containing one of the columns (I do not need to deal with the other columns). 

 

I tried the following : 

(iterating on each row, use the tJavaFlex to extract the data I need, do a simple replace in the text, write the file to an output which name is based on the ID of the row).

 

This way ends up with an "out of memory" even with the 300Mb file... 

It seems the tIterateToFlow tries to load the whole XML file in RAM, and each row/iterate kind of does, before throwing the parts after the iteration. 

I know I can increase the memory of the job, and it actually works when I give 8Gb of RAM to the job.

 

But, even thow, my question is : Is there a way to tell Talend to : 

- take one row

- do a processing on it

- free its memory

- go on with the next row

 

Or is it mandatory to load the whole input before doing the row by row iterations ? If so, how could I treat a 20Gb file if I one day receive one (my computer has 16Gb ram). 

 

Thank you in advance for your help. 

I thank you all in advance for your answer.

Best regards. 


Accepted Solutions
Sixteen Stars

Re: Iterate on rows of a large file and free memory between each row

SAX does not require the whole document to be loaded into memory. It essentially reads the XML from top to bottom and parses elements as it hits them (it's not precisely like that, but that is a good way of thinking about it). As such, it has a much lighter footprint on memory BUT it does limit look forward and back XPath functionality. However, from what you have said, your XPaths should not require that. Give it a try.


All Replies
Sixteen Stars

Re: Iterate on rows of a large file and free memory between each row

Can you give us an example of your XML so that we can understand the complexity of the file. You refer to "rows" in your question. I take it you mean loops within one XML file instead of multiple XML documents in one file?

 

If this just happens to be a very simple XML file with lots of loops, you may be able to solve your problem using  just a tFileInputXML and then going to the "Advanced settings" tab and changing the "Generation Mode" to "SAX".

Four Stars

Re: Iterate on rows of a large file and free memory between each row

Hi, 

The xml file represents a "table" style (flat 2D data file) the content could be summarized like : 

<row id=1>
   <column1>content</column1>
   <column2>content</column2>
   ...
   <column200>content</column200>
</row>
<row id=2>
   ...
</row>

My question being : "how to avoid Talend loading everything into RAM ? (or control how it does flush the ram)", can I infer from your answer that the "SAX" parameter does allow the full process to be treated before a memory flush and the launch of the next iteration ? 

 

Best regards. 

 

 

Sixteen Stars

Re: Iterate on rows of a large file and free memory between each row

SAX does not require the whole document to be loaded into memory. It essentially reads the XML from top to bottom and parses elements as it hits them (it's not precisely like that, but that is a good way of thinking about it). As such, it has a much lighter footprint on memory BUT it does limit look forward and back XPath functionality. However, from what you have said, your XPaths should not require that. Give it a try.

Four Stars

Re: Iterate on rows of a large file and free memory between each row

It is working like charm  ! 

Thank you very much !