One Star

Optimizing joins in Talend spark batch jobs

I'm having an issue on a Spark batch job with talend. The job is pretty simple : it reads a file from HDFS, performs a left outer join with another file on HDFS (using a tMap) on a single key, and finally writes the result on HDFS. What I have noticed is weird : the resulting spark job performs a cogroup at one point and tries to gather all the dataset on a single task before writing it into HDFS ! Thus, if the dataset is big enough It results in an OutOfMemory error : java heap space.
Why does talend handles the joins that way ? Is it possible to optimise it ?

Re: Optimizing joins in Talend spark batch jobs

Have you tried to allocate more memory to a Job execution by setting the -Xmx Java VM parameter and store the data on disk instead of memory on tMap?
Best regards
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.