Running out of memory with tAggregateRow component

One Star

Running out of memory with tAggregateRow component

Hi,
I'm trying to aggregate 5M+ records with tAggregareRow and running out of Java heap space. Is there any way around this issue except for increasing VM memory settings (Xmx)? I tried to increase them to 1.5GB (Xmx1536M) but still running out of heap space at 400K rows.
Below is exception message if it is of any help:
Starting job Load_Hostel_Products_Prices_Allocations at 16:19 27/09/2013.

connecting to socket on port 4050
connected
disconnected
disconnected
disconnected
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.lang.StringBuilder.toString(Unknown Source)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileInputDelimited_1Process(Load_Hostel_Products_Prices_Allocations.java:43509)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileInputDelimited_12Process(Load_Hostel_Products_Prices_Allocations.java:17241)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileInputDelimited_11Process(Load_Hostel_Products_Prices_Allocations.java:15048)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileInputDelimited_4Process(Load_Hostel_Products_Prices_Allocations.java:12681)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileDelete_2Process(Load_Hostel_Products_Prices_Allocations.java:9797)
disconnected
disconnected
disconnected
disconnected
disconnected
disconnected
disconnected
disconnected
disconnected
disconnected
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tMysqlInput_2Process(Load_Hostel_Products_Prices_Allocations.java:9655)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tSalesforceBulkExec_7Process(Load_Hostel_Products_Prices_Allocations.java:8707)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tMysqlInput_1Process(Load_Hostel_Products_Prices_Allocations.java:7903)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileDelete_1Process(Load_Hostel_Products_Prices_Allocations.java:4344)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tSalesforceBulkExec_5Process(Load_Hostel_Products_Prices_Allocations.java:4202)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tSalesforceInput_7Process(Load_Hostel_Products_Prices_Allocations.java:3203)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tSalesforceBulkExec_4Process(Load_Hostel_Products_Prices_Allocations.java:2775)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tSalesforceInput_6Process(Load_Hostel_Products_Prices_Allocations.java:1782)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.runJobInTOS(Load_Hostel_Products_Prices_Allocations.java:47743)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.main(Load_Hostel_Products_Prices_Allocations.java:47525)
Job Load_Hostel_Products_Prices_Allocations ended at 16:41 27/09/2013.
Seventeen Stars

Re: Running out of memory with tAggregateRow component

tAggregateRow collects memory for every unique dataset in the input stream. It needs to collect nearly everything because the component can calculate the uniqueness only at the end of the flow.
If you are able to read the data in a sorted order you can use tAggregateSortedRow. This component releases all data which are ready inspected because of the sort order. It save a lot of memory but need sorted data.
One Star

Re: Running out of memory with tAggregateRow component

Thank for your reply. Couple thoughts on it:
It needs to collect nearly everything because the component can calculate the uniqueness only at the end of the flow.

Not sure what "uniqueness" you are talking about. I get that final aggregation result can only be produced after all rows where processed, but I think a lot of processing ("first", "last", "sum" functions) can be done on the go and processed rows can be discarded.
If you are able to read the data in a sorted order you can use tAggregateSortedRow.

I had a look at tAggregateSortedRow but I noticed it had a setting called "Input rows count". Does that mean I need to know the total number of rows it will be aggregating before I run the job? Or can this setting be skipped?

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 1

Learn how to do cool things with Context Variables

Blog

Migrate Data from one Database to another with one Job using the Dynamic Schema

Find out how to migrate from one database to another using the Dynamic schema

Blog

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog