Problem Description


Running a Spark SQL query, in a Big Data Spark Job, that has a lot of small tables under the broadcast threshold, may fail with the following exception in the execution log.


Exception in thread "broadcast-hash-join-4" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(
at org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:403)
at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:128)
at java.util.concurrent.ThreadPoolExecutor$

Exception in thread "broadcast-hash-join-1" java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(


Root Cause

This is Apache Spark issue, see for details.



To resolve this issue, perform the following steps:
  1. On your Spark Job, select the Spark Configuration tab.

  2. In the Advanced properties section, add the following parameter "spark.sql.autoBroadcastJoinThreshold" and set the value to "-1".

  3. Regenerate the Job in TAC.

  4. Run the Job again.

