Optimizing joins in Talend spark batch jobs

One Star

Optimizing joins in Talend spark batch jobs

I'm having an issue on a Spark batch job with talend. The job is pretty simple : it reads a file from HDFS, performs a left outer join with another file on HDFS (using a tMap) on a single key, and finally writes the result on HDFS. What I have noticed is weird : the resulting spark job performs a cogroup at one point and tries to gather all the dataset on a single task before writing it into HDFS ! Thus, if the dataset is big enough It results in an OutOfMemory error : java heap space.
Why does talend handles the joins that way ? Is it possible to optimise it ?

Re: Optimizing joins in Talend spark batch jobs

Have you tried to allocate more memory to a Job execution by setting the -Xmx Java VM parameter and store the data on disk instead of memory on tMap?
Best regards
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.

What’s New for Talend Spring ’19

Watch the recorded webinar!

Watch Now


Introduction to Talend Open Studio for Data Integration.


Downloads and Trials

Test drive Talend's enterprise products.


Definitive Guide to Data Integration

Practical steps to developing your data integration strategy.