Talend 6.4.1 Spark Job Fails with 'IllegalStateException: unread block data' Exception

 

 

Talend Version       
6.4.1

Summary

Talend 6.4.1 Job fails due to a Java version incompatibility between the Java Job code generated version and the Hadoop cluster Java version.
Additional Versions  
Product Talend Big Data
Component Spark
Problem Description

When executing a Talend 6.4.1 Spark Job against Cloudera 5.10 (or 5.8), it fails with a java.lang.IllegalStateException: unread block data exception.

The Java exception stack shows the following:

 

org.apache.spark.scheduler.TaskSetManager - Lost task 1.0 in stage 0.0 (TID 1, n0.talend.fr, executor 1): java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
...
Caused by: java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

The Spark Job consists of a simple treatment:

tFixedFlowInput -> tMap -> tLogrow

and is executed against a "Cloudera QuickStart cdh 5.8 VM" environment.

 

When using Talend Studio 6.3.1 to generate the same Job, it runs fine.

Problem root cause

This issue can occur when there is Java version incompatibility between the Java generated code (related to the Job) and the Java version used by the Hadoop Cluster (where the Job is running). In this specific case, the Job code in Talend Studio 6.4.1 is generated with JDK 8 and the Java version used in "Cloudera QuickStart cdh 5.8 VM" environment is Java 7.

 

Note also that with a 'default' Cloudera Express (5.10) installation and configuration, JDK 7 will be used by the Cloudera Hadoop cluster. So, potentially, this issue can occur against Talend 6.4.1 not only with "Cloudera QuickStart" VM.

Solution or Workaround

One solution consists of setting the compiler compliance level to 1.7 in Talend Studio.

  1. Set the Compiler compliance level to 1.7 in the Talend Studio:

    1. Click Window > Preferences Menu.
    2. Click Java > Compiler.
    3. Under JDK compliance, set the Compiler compliance level to 1.7.
  2. Rebuild the Job.

    compliance.png

     

Another solution consists of changing the JDK version used in the Hadoop Cluster where the Job is run. For example, for a Cloudera 5.10 installation, the default JDK used is 1.7 (found in /usr/java/jdk1.7.0_67-cloudera. Using Cloudera Manager, you can configure a Custom Java Home Location that will point to a different Java version than the default one (such as JDK 8 rather than JDK 7). For more details, refer to Cloudera documentation for how to upgrade the JDK version used in Cloudera.

 

JIRA ticket number TBD-5477
Version history
Revision #:
21 of 21
Last update:
‎01-02-2018 12:53 PM
Updated by:
 
Contributors
Tags (1)