One Star

Running out of memory with tAggregateRow component

Hi,
I'm trying to aggregate 5M+ records with tAggregareRow and running out of Java heap space. Is there any way around this issue except for increasing VM memory settings (Xmx)? I tried to increase them to 1.5GB (Xmx1536M) but still running out of heap space at 400K rows.
Below is exception message if it is of any help:
Starting job Load_Hostel_Products_Prices_Allocations at 16:19 27/09/2013.

connecting to socket on port 4050
connected
disconnected
disconnected
disconnected
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.lang.StringBuilder.toString(Unknown Source)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileInputDelimited_1Process(Load_Hostel_Products_Prices_Allocations.java:43509)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileInputDelimited_12Process(Load_Hostel_Products_Prices_Allocations.java:17241)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileInputDelimited_11Process(Load_Hostel_Products_Prices_Allocations.java:15048)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileInputDelimited_4Process(Load_Hostel_Products_Prices_Allocations.java:12681)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileDelete_2Process(Load_Hostel_Products_Prices_Allocations.java:9797)
disconnected
disconnected
disconnected
disconnected
disconnected
disconnected
disconnected
disconnected
disconnected
disconnected
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tMysqlInput_2Process(Load_Hostel_Products_Prices_Allocations.java:9655)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tSalesforceBulkExec_7Process(Load_Hostel_Products_Prices_Allocations.java:8707)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tMysqlInput_1Process(Load_Hostel_Products_Prices_Allocations.java:7903)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tFileDelete_1Process(Load_Hostel_Products_Prices_Allocations.java:4344)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tSalesforceBulkExec_5Process(Load_Hostel_Products_Prices_Allocations.java:4202)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tSalesforceInput_7Process(Load_Hostel_Products_Prices_Allocations.java:3203)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tSalesforceBulkExec_4Process(Load_Hostel_Products_Prices_Allocations.java:2775)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.tSalesforceInput_6Process(Load_Hostel_Products_Prices_Allocations.java:1782)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.runJobInTOS(Load_Hostel_Products_Prices_Allocations.java:47743)
at hi360_20130926.load_hostel_products_prices_allocations_0_1.Load_Hostel_Products_Prices_Allocations.main(Load_Hostel_Products_Prices_Allocations.java:47525)
Job Load_Hostel_Products_Prices_Allocations ended at 16:41 27/09/2013.
2 REPLIES
Seventeen Stars

Re: Running out of memory with tAggregateRow component

tAggregateRow collects memory for every unique dataset in the input stream. It needs to collect nearly everything because the component can calculate the uniqueness only at the end of the flow.
If you are able to read the data in a sorted order you can use tAggregateSortedRow. This component releases all data which are ready inspected because of the sort order. It save a lot of memory but need sorted data.
One Star

Re: Running out of memory with tAggregateRow component

Thank for your reply. Couple thoughts on it:
It needs to collect nearly everything because the component can calculate the uniqueness only at the end of the flow.

Not sure what "uniqueness" you are talking about. I get that final aggregation result can only be produced after all rows where processed, but I think a lot of processing ("first", "last", "sum" functions) can be done on the go and processed rows can be discarded.
If you are able to read the data in a sorted order you can use tAggregateSortedRow.

I had a look at tAggregateSortedRow but I noticed it had a setting called "Input rows count". Does that mean I need to know the total number of rows it will be aggregating before I run the job? Or can this setting be skipped?