Pre-load “Spark.Yarn.Jar” to speed up execution of a Spark Job

Talend Version (Required)       6.3.1

Summary

Each run of a Spark Job uploads a talend-spark-assembly-x.x.x-SNAPSHOT-hadoopx.x.x-cdhx.x.x.jar jar package that affects the performance of HDFS and takes up HDFS space.
Additional Versions

Cloudera 5.7 & HDP2.5 and above

Spark 1.6 above

Product (Required) Talend Data Fabric
Component (Required) Spark Job Setting
Problem Description

To run a Spark Job from Talend Studio can be very time consuming, especially for those Spark Jobs that interact with a Hadoop server installed in a remote location. To upload the spark-assembly-xxx jar (over 100 Mb) manually using Putty or SSH to a target HDFS directory will speed up execution a lot.

Problem root cause

The size of jars below is big, which affects performance.

HDP 2.5: talend-spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-with-hive-07122016.jar

CDH 5.8: talend-spark-assembly-1.6.0-cdh5.8.1-hadoop2.6.0-cdh5.8.1-with-hive.jar

Solution or Workaround
  1. Upload the jar directly into any target HDFS location of the Big Data server:

     

    big_Jar.PNG

     

  2. Move the jar from a local place to a folder of the Hadoop server:

     

    copyJarToHadoopServer.PNG

     

    Copying in progress:

     

    Copying.PNG

     

    Copying complete:

     

    completeCopy.PNG

     

  3. Move the uploaded jar to an HDFS directory by issuing the following command on the Hadoop server side:

     

    hdfs_put.PNG

     

  4. Check the HDFS location for the jar, and add the Spark Yarn Jar in the Advanced Properties of the Job settings:

     

    retrieveSchema.PNG

     

  5. Define the location of spark.yarn.jar in the Advanced Properties of the Spark configuration:

     

    spark.yarn.jar.PNG

JIRA ticket number N/A
Version history
Revision #:
9 of 9
Last update:
‎10-11-2017 10:26 AM
Updated by:
 
Contributors
Comments
sjain

Good Article 

sjain

If Job Server is running from Edge Node, the jar need to be loaded on Edge node or on Hive target server ?

boyuan

sjain,

 

it should be hive target server no matter task runs on job server or studio.

ahsiao

Hi Bo,

 

Just curious, do you happen to have the list of jars that are used for Spark 2 by any chance? It seems that the talend-spark-assembly jar is no longer used for Spark 2 and therefore it doesn't quite work in new Spark 2 jobs.

 

Thanks,

Arthur

ahsiao

To answer my own question, yes it does work with Spark 2.X and there is another link that can be used: https://help.talend.com/reader/x~UhXb9twRsML6YgvpiaSw/Z69iDRnhIEiMm~~mhJec2g

 

Also, in order to find the CDH jars, here's a place to look: /etc/spark/conf and vi the classpath.txt file. In that classpath.txt file it shows the location and names of all the jars being used by Spark. You will then load all of those jars into HDFS and then run the job the way it is described in the link above.