Pre-load “Spark.Yarn.Jar” to speed up execution of a Spark Job

Talend Version (Required)       6.3.1


Each run of a Spark Job uploads a talend-spark-assembly-x.x.x-SNAPSHOT-hadoopx.x.x-cdhx.x.x.jar jar package that affects the performance of HDFS and takes up HDFS space.
Additional Versions

Cloudera 5.7 & HDP2.5 and above

Spark 1.6 above

Product (Required) Talend Data Fabric
Component (Required) Spark Job Setting
Problem Description

To run a Spark Job from Talend Studio can be very time consuming, especially for those Spark Jobs that interact with a Hadoop server installed in a remote location. To upload the spark-assembly-xxx jar (over 100 Mb) manually using Putty or SSH to a target HDFS directory will speed up execution a lot.

Problem root cause

The size of jars below is big, which affects performance.

HDP 2.5: talend-spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-with-hive-07122016.jar

CDH 5.8: talend-spark-assembly-1.6.0-cdh5.8.1-hadoop2.6.0-cdh5.8.1-with-hive.jar

Solution or Workaround
  1. Upload the jar directly into any target HDFS location of the Big Data server:




  2. Move the jar from a local place to a folder of the Hadoop server:




    Copying in progress:




    Copying complete:




  3. Move the uploaded jar to an HDFS directory by issuing the following command on the Hadoop server side:




  4. Check the HDFS location for the jar, and add the Spark Yarn Jar in the Advanced Properties of the Job settings:




  5. Define the location of spark.yarn.jar in the Advanced Properties of the Spark configuration:



JIRA ticket number N/A
Version history
Revision #:
9 of 9
Last update:
‎10-11-2017 10:26 AM
Updated by:

Good Article 


If Job Server is running from Edge Node, the jar need to be loaded on Edge node or on Hive target server ?




it should be hive target server no matter task runs on job server or studio.