talend job using tAggregateRow consumes 39 Gb of memory

One Star

talend job using tAggregateRow consumes 39 Gb of memory

First off, i am using Talend Open Studio for Big Data (5.4.1).
I have a talend job that attempts to aggregate a 6.5 Gb file. It uses a tAggregateRow component to do this. The tAggregateRow components groups by 1 column. Its operation is a count(distinct) on 1 column. The file that is being processed has 5 columns in the schema, all strings.
When the aggregation starts, the JVM consumes 39 Gb of memory to process this 6.5 Gb file. This seems extremely inefficient.
Does anyone know what i am doing wrong or what i might do differently?
I've contacted talend support and the engineer's suggestion was to perform this aggregation outside of talend (such as in a database). However, it seems unreasonable that talend job would consume 39 Gb of memory to aggregate a 6.5 Gb file. Less than 20 Gb of memory seems more reasonable.
Your expert advice is greatly appreciated.
Seventeen Stars

Re: talend job using tAggregateRow consumes 39 Gb of memory

The method of tAggregateRow is to collect all keys and all output columns and build the aggregations at the end of the flow. If the amount of datasets is large this component could easily run out of memory.
I suggest you read you data sorted by the group by column and use the component tAggregateSortedRow. This component depends on a sorted input and because of the sorted input it can free the datasets when the key column(s) changed.


Talend named a Leader.

Get your copy


Kickstart your first data integration and ETL projects.

Download now

Put Massive Amounts of Data to Work

Learn how to make your data more available, reduce costs and cut your build time

Watch Now

How OTTO Utilizes Big Data to Deliver Personalized Experiences

Read about OTTO's experiences with Big Data and Personalized Experiences


Talend Integration with Databricks

Take a look at this video about Talend Integration with Databricks

Watch Now