In a Spark Job, tJavaRow component gets NULL value in context variable

Problem Description

You have a Spark Job with two subflows, which are linked by OnSubjobOk.

  • In the first subflow, a tJavaRow component sets a context variable with data.
  • In the second subflow, another tJavaRow component gets the value of this context variable.

 

At execution time, the second tJavaRow component finds the value of the context variable to be NULL instead of the value set in the first tJavaRow.

 

Root Cause

In a standard Job, you can update context variables in one flow and use them in another because all the code is executed on the same Java virtual machine (JVM). However, you can't do this in a Spark Job, where the execution is done from several executors on separate JVMs.

 

When a function passed to a Spark operation is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program.

 

Solution

Instead of storing data in context variables, consider using the tCacheOut and tCacheIn components.

Version history
Revision #:
10 of 10
Last update:
‎07-08-2019 11:40 AM
Updated by: