Orchestrate Spark Batch Job using subjobs (tRunJob) in parent job

Highlighted
Five Stars

Orchestrate Spark Batch Job using subjobs (tRunJob) in parent job

Hi everyone, 

I am trying to migrate some data processing tasks to Talend and I am facing a problem trying to run a Spark Job from a tRunJob component.

Requirements:

- List some HDFS Paths

- Execute a Spark Job

Environment:

HDP 2.6.4

Spark 2.2.0

Talend Big Data 7.0.x

Java 1.8

This is a simple test scenario:

spark orch talend.JPG

Every subjob runs successfully separately.

The List_Files_HDFS is a conventional Talend job and list HDFS files in a directory, the results are stored in an output buffer.

The Demo_Paul is a simplification of a Spark Job that processes some Json files in the provided directories. At the moment I deleted the parameter of the tRunJob_3 to simplify the scenario.

This is the actual configuration of the subjob:

spark orch talend 2.JPG

I must check "Use an independent process to run subjob" since Spark Job required it, if not I am getting an error. 

The problem is, when I try to execute the parent job from the studio I got this error:

Exception in component tRunJob_3 (hdfs_to_Teradata_Orchestration)
java.lang.RuntimeException: Child job returns 1. It doesn't terminate normally.
Error: Could not find or load main class com.demo_paul_0_1.Demo_Paul

I know the problem is related to this independent execution but it should theoretically work. 

Any hint would be appreciated. 

 

Paul


Accepted Solutions
Highlighted
Five Stars

Re: Orchestrate Spark Batch Job using subjobs (tRunJob) in parent job

Hi Nikhil ,

I finally managed to find the problem. We installed Talend Studio on the "C" drive but created our Workspaces on Drive "D". At the first moment I thought it was related to blank spaces in the Workspace path but at the end I changed the Workspace to the "C" drive as the studio and it worked. Man Very Happy

BR.

Paul

View solution in original post


All Replies
Highlighted
Employee

Re: Orchestrate Spark Batch Job using subjobs (tRunJob) in parent job

Hi,

 

    Are you using a Spark job or standard DI job for the orchestration process?

 

Warm Regards,

 

Nikhil Thampi

Highlighted
Five Stars

Re: Orchestrate Spark Batch Job using subjobs (tRunJob) in parent job

I am using a standard DI job for orchestration
Highlighted
Employee

Re: Orchestrate Spark Batch Job using subjobs (tRunJob) in parent job

If you are running the subjob as independent process, it should not throw the error.

 

Could you please try to avoid the iteration and pass only one value to the demo subjob (which still need to be executed as independent process) and see whether the error persists?

 

Warm Regards,

 

Nikhil Thampi

 

 

Highlighted
Five Stars

Re: Orchestrate Spark Batch Job using subjobs (tRunJob) in parent job

Unfortunately not, 

spark orch talend 3.JPG

Highlighted
Five Stars

Re: Orchestrate Spark Batch Job using subjobs (tRunJob) in parent job

Hi nikhilthampi,
I tried compiling the job, then verified that subjob jar is listed in the .classpath and the corresponding .jar in the project folder.
I executed the .bat file and it worked.
The issue is on the Studio ( Talend Big Data 7.0).
Could you please advise me how to verify the classpath in the studio or how to debug this issue?
OS: Windows Server 2012
Java JDK 1.8
Best regards, Paul
Highlighted
Employee

Re: Orchestrate Spark Batch Job using subjobs (tRunJob) in parent job

Hi,

 

    Could you please try the below link to run the job in debug mode?

 

https://help.talend.com/reader/bYQYL576UebCj3V76mUoeA/0dAZ~fdGwtsNaUX4YaWe7w

 

Warm Regards,

 

Nikhil Thampi

Highlighted
Five Stars

Re: Orchestrate Spark Batch Job using subjobs (tRunJob) in parent job

Hi Nikhil ,

I finally managed to find the problem. We installed Talend Studio on the "C" drive but created our Workspaces on Drive "D". At the first moment I thought it was related to blank spaces in the Workspace path but at the end I changed the Workspace to the "C" drive as the studio and it worked. Man Very Happy

BR.

Paul

View solution in original post

2019 GARTNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

Put Massive Amounts of Data to Work

Learn how to make your data more available, reduce costs and cut your build time

Watch Now

How OTTO Utilizes Big Data to Deliver Personalized Experiences

Read about OTTO's experiences with Big Data and Personalized Experiences

Blog

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog