Five Stars

tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

we are trying to use talend batch (spark) jobs to access hive in a Kerberos cluster but we are getting the below "Can't get Master Kerberos principal for use as renewer" error.

 

By using the standard jobs(non spark) in talend we are able to access hive without any issue.

 

Sample Batch Job:

 

talend_issues.PNG

Below are the observation:

 

  1. When we are running spark jobs talend could able to connect to hive metastore and validating the syntax. ex if I provide the wrong table name it does return "table not found".
  2. when we select count(*) from table where there is no data it returns "NULL" but if some data present in Hdfs(table) It failed with the error "Can't get Master Kerberos principal for use as renewer".

I am not sure exactly what is the issue which is causing the token problem. could some one help us know the root cause.

One more thing to add instead of hive if I read / write to hdfs using spark batch jobs it works , So only problem is with hive and Kerberos.

Tags (1)
11 REPLIES
Moderator

Re: tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

Hello,

The error says that you try to access a kerberized resource with a unsecured client configuration.

In the batch job, did you select the kerberos configuration in the tHDFSConfiguration?
Also, where does the configuration comes from ? Repository ? Built-In ?

Best regards

Sabrina

--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Four Stars

Re: tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

We are encountering the same issue. To answer your questions (in our case), Yes, we have selected kerberos configuration in the tHDFSConfiguration and configuration is built-in.

 

Regards,

Erick

Five Stars

Re: tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

Hi Sabrina,

 

Yes we already selected Kerberos in HDFS configuration and reading / writing inside HDFS with batch jobs works. Only problem when it tries to select the data from Hive by using tHiveInput component especially in batch jobs not in standard job.

 

Please clarify does talend uses /etc/spark/conf/ anyway for batch Jobs ??

 

Thanks

 

Four Stars

Re: tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

hi @rspwilliam, did you find a solution for this? we have a similar issue...

Moderator

Re: tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

Hello,

There is a jira issue about the spark job can not be run successfully when import a hive schema using HDP25 with kerberos authentication.

https://jira.talendforge.org/browse/TBD-4470

Let us know if this is the case you are meeting.

 

Best regards

Sabrina

--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Four Stars

Re: tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

Hi @xdshi,

We are using talend 6.4.1 and HDP 2.6.2. I think my problem is different. My talend job uses the same components as described in this post, in a HA + kerberized cluster context. The spark job works fine until we add a hiveinput/output component. We get a recurrent error that loops till a timeout of the hive sub-job created by yarn  : 

[WARN ]: org.apache.hadoop.ipc.Client - Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

Tried to add -Djava.security.krb5.conf as described here http://coheigea.blogspot.fr/2017/09/configuring-kerberos-for-hive-in-talend.html but this didn't help.

Hopped that someone here might have suggestions about other talend or spark properties to make this work?

Will share something here if the talend support get me a solution. Mean while, any other suggestions are more than welcome.

Regards,

 

Five Stars

Re: tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

Hi,

 

We have created the workaround to have all the Hadoop configs into a fat jar and loaded into talend using tLibrary. Configurations are not taken from talend HDFS configuration / Talend  Hive Configuration.

 

We / talend Support could not find any solution yet.

 

Five Stars

Re: tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

@yham Please let me know whether it helps or not.

Four Stars

Re: tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

thank you @rspwilliam. could you please give an example of the content of the jar file. it contains hadoop xml files, or a properties file (key=value)? how do you reference those properties in your job? do you use "spark.yarn.am.extraJavaOptions" or "spark.driver.extraJavaOptions".

appreciate your help.

Five Stars

Re: tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

@yham

it contains all the site xml's which is required to connect to cluster. Not a key value property. you need to build a proper maven jar and add all the hadoop files.

 

In your talend job add the jar using tLibrary which will add all the config files when its build and deployed. We haven't passed any extraJavaOptions parameter.

Four Stars

Re: tHiveInput is not working in batch jobs (spark) in a kerberos cluster.

unfortunately this didn't solve our case. i generated a simple jar file with all xmls: 

- core-site.xml

- hdfs-site.xml

- hive-site.xml

- krb5.conf

- mapred-site.xml

- tez-site.xml

- yarn-site.xml

seems that the files was token into account, because i had to delete the python property in core-site.xml that generated an error.

thank you any way Smiley Happy