Optimizing joins in Talend spark batch jobs

Highlighted
One Star

Optimizing joins in Talend spark batch jobs

Hi,
I'm having an issue on a Spark batch job with talend. The job is pretty simple : it reads a file from HDFS, performs a left outer join with another file on HDFS (using a tMap) on a single key, and finally writes the result on HDFS. What I have noticed is weird : the resulting spark job performs a cogroup at one point and tries to gather all the dataset on a single task before writing it into HDFS ! Thus, if the dataset is big enough It results in an OutOfMemory error : java heap space.
Why does talend handles the joins that way ? Is it possible to optimise it ?
Walid.
Moderator

Re: Optimizing joins in Talend spark batch jobs

Hi,
Have you tried to allocate more memory to a Job execution by setting the -Xmx Java VM parameter and store the data on disk instead of memory on tMap?
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.

What’s New for Talend Spring ’19

Watch the recorded webinar!

Watch Now

Tutorial

Introduction to Talend Open Studio for Data Integration.

Watch

Downloads and Trials

Test drive Talend's enterprise products.

Downloads

Definitive Guide to Data Integration

Practical steps to developing your data integration strategy.

Download