Pre-load “Spark.Yarn.Jar” to speed up execution of a Spark Job

Talend Version (Required)       6.3.1

Summary

Each run of a Spark Job uploads a talend-spark-assembly-x.x.x-SNAPSHOT-hadoopx.x.x-cdhx.x.x.jar jar package that affects the performance of HDFS and takes up HDFS space.
Additional Versions

Cloudera 5.7 & HDP2.5 and above

Spark 1.6 above

Product (Required) Talend Data Fabric
Component (Required) Spark Job Setting
Problem Description

To run a Spark Job from Talend Studio can be very time consuming, especially for those Spark Jobs that interact with a Hadoop server installed in a remote location. To upload the spark-assembly-xxx jar (over 100 Mb) manually using Putty or SSH to a target HDFS directory will speed up execution a lot.

Problem root cause

The size of jars below is big, which affects performance.

HDP 2.5: talend-spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-with-hive-07122016.jar

CDH 5.8: talend-spark-assembly-1.6.0-cdh5.8.1-hadoop2.6.0-cdh5.8.1-with-hive.jar

Solution or Workaround
  1. Upload the jar directly into any target HDFS location of the Big Data server:

     

    big_Jar.PNG

     

  2. Move the jar from a local place to a folder of the Hadoop server:

     

    copyJarToHadoopServer.PNG

     

    Copying in progress:

     

    Copying.PNG

     

    Copying complete:

     

    completeCopy.PNG

     

  3. Move the uploaded jar to an HDFS directory by issuing the following command on the Hadoop server side:

     

    hdfs_put.PNG

     

  4. Check the HDFS location for the jar, and add the Spark Yarn Jar in the Advanced Properties of the Job settings:

     

    retrieveSchema.PNG

     

  5. Define the location of spark.yarn.jar in the Advanced Properties of the Spark configuration:

     

    spark.yarn.jar.PNG

JIRA ticket number N/A
Version history
Revision #:
9 of 9
Last update:
‎10-11-2017 10:26 AM
Updated by:
 
Contributors
Comments
sjain

Good Article 

sjain

If Job Server is running from Edge Node, the jar need to be loaded on Edge node or on Hive target server ?

boyuan

sjain,

 

it should be hive target server no matter task runs on job server or studio.