Spark SQL query with a lot of small tables under broadcast threshold leading to 'java.lang.OutOfMemoryError'

Problem Description

 

Running a Spark SQL query, in a Big Data Spark Job, that has a lot of small tables under the broadcast threshold, may fail with the following exception in the execution log.

 

Exception in thread "broadcast-hash-join-4" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:535)
at org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:403)
at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:128)
at
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

...
Exception in thread "broadcast-hash-join-1" java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:410)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:106)

 

Root Cause

This is Apache Spark issue, see https://issues.apache.org/jira/browse/SPARK-12358 for details.

 

Solution

To resolve this issue, perform the following steps:
  1. On your Spark Job, select the Spark Configuration tab.

  2. In the Advanced properties section, add the following parameter "spark.sql.autoBroadcastJoinThreshold" and set the value to "-1".

  3. Regenerate the Job in TAC.

  4. Run the Job again.

Version history
Revision #:
7 of 7
Last update:
‎03-01-2019 03:54 AM
Updated by:
 
Contributors