One Star

talend job using tAggregateRow consumes 39 Gb of memory

First off, i am using Talend Open Studio for Big Data (5.4.1).
I have a talend job that attempts to aggregate a 6.5 Gb file. It uses a tAggregateRow component to do this. The tAggregateRow components groups by 1 column. Its operation is a count(distinct) on 1 column. The file that is being processed has 5 columns in the schema, all strings.
When the aggregation starts, the JVM consumes 39 Gb of memory to process this 6.5 Gb file. This seems extremely inefficient.
Does anyone know what i am doing wrong or what i might do differently?
I've contacted talend support and the engineer's suggestion was to perform this aggregation outside of talend (such as in a database). However, it seems unreasonable that talend job would consume 39 Gb of memory to aggregate a 6.5 Gb file. Less than 20 Gb of memory seems more reasonable.
Your expert advice is greatly appreciated.
Seventeen Stars

Re: talend job using tAggregateRow consumes 39 Gb of memory

The method of tAggregateRow is to collect all keys and all output columns and build the aggregations at the end of the flow. If the amount of datasets is large this component could easily run out of memory.
I suggest you read you data sorted by the group by column and use the component tAggregateSortedRow. This component depends on a sorted input and because of the sorted input it can free the datasets when the key column(s) changed.