A Spark batch Job fails to execute when using a custom query in the tOracleInput component

Problem Description

A Spark Job built with a tOracleInput and a tFileOutputDelimited component, and the input component uses a complex query that includes sub-queries, as shown below:

 

"select a.*,TRUST_OFCR_CD from (select * from acct_portfolio where ACCT_PP_CD NOT IN ( SELECT RULE_CD FROM DOWNSTRM_FILT_RULE_MSTR WHERE DOWNSTRM_SYS_CD ='ALL' AND INCL_EXCL_IND='E' AND SRC_SYS_CD='US' AND RULE_TYP_CD = 'PP_CD' MINUS SELECT RULE_CD FROM DOWNSTRM_FILT_RULE_MSTR WHERE DOWNSTRM_SYS_CD ='DMS' AND INCL_EXCL_IND='I' AND SRC_SYS_CD='US' AND RULE_TYP_CD = 'PP_CD) a left outer join ACCT_TRUST_DETL apt on a.acct_key=apt.acct_key"

 

When running the Job, the following error is observed:

java.lang.RuntimeException: [6.14] failure: ``)'' expected but identifier RULE_CD found
( SELECT RULE_CD
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)

 

Root Cause

This issue is a limitation of Spark 1.6. In this use case, the Hadoop cluster is using Spark 1.6., and Spark does not support database sub-queries before version 2.0.

 

Solution

There are two possible solutions:

  1. Use joins instead of sub-queries in the SQL query.

  2. Upgrade to Spark 2.0+.

Version history
Revision #:
4 of 4
Last update:
‎02-25-2019 01:10 AM
Updated by: