A Talend Job with Hadoop source/target fails intermittently with a "Connection reset by peer" error

Symptoms

Hadoop is using Kerberose authentication. A Talend Job that uses Hadoop source/target (HDFS or hive table) will fail intermittently with the following error:

 

[FATAL]: bda_prod.socmed_main_initial_0_1.SOCMED_Main_Initial - tRunJob_5 Child job returns 1. It doesn't terminate normally.
Exception in component tHDFSConnection_1
java.io.IOException: Login failure for hdfs/xxx@xxx from keytab /home/talenduser/hdfs.keytab: javax.security.auth.login.LoginException: Connection reset by peer (connect failed)
	at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:962)
	at bda_prod.count_source_line_number_0_1.Count_Source_Line_Number.tHDFSConnection_1Process(Count_Source_Line_Number.java:1911)
	at bda_prod.count_source_line_number_0_1.Count_Source_Line_Number.tMysqlConnection_1Process(Count_Source_Line_Number.java:1757)
	at bda_prod.count_source_line_number_0_1.Count_Source_Line_Number.runJobInTOS(Count_Source_Line_Number.java:5760)
	at bda_prod.count_source_line_number_0_1.Count_Source_Line_Number.runJob(Count_Source_Line_Number.java:5054)
	at bda_prod.berita_table_data_hive_0_1.Berita_Table_Data_Hive.tRunJob_1Process(Berita_Table_Data_Hive.java:3984)
	at bda_prod.berita_table_data_hive_0_1.Berita_Table_Data_Hive.tHiveConnection_1Process(Berita_Table_Data_Hive.java:3367)
	at bda_prod.berita_table_data_hive_0_1.Berita_Table_Data_Hive.tHDFSConnection_1Process(Berita_Table_Data_Hive.java:3046)
	at bda_prod.berita_table_data_hive_0_1.Berita_Table_Data_Hive.tMysqlConnection_1Process(Berita_Table_Data_Hive.java:2829)
	at bda_prod.berita_table_data_hive_0_1.Berita_Table_Data_Hive.tJava_1Process(Berita_Table_Data_Hive.java:2595)
	at bda_prod.berita_table_data_hive_0_1.Berita_Table_Data_Hive.runJobInTOS(Berita_Table_Data_Hive.java:10785)
	at bda_prod.berita_table_data_hive_0_1.Berita_Table_Data_Hive.runJob(Berita_Table_Data_Hive.java:9765)
	at bda_prod.berita_table_data_main_initial_0_1.BERITA_TABLE_DATA_Main_Initial.tRunJob_1Process(BERITA_TABLE_DATA_Main_Initial.java:4492)
	at bda_prod.berita_table_data_main_initial_0_1.BERITA_TABLE_DATA_Main_Initial.tWaitForFile_1Process(BERITA_TABLE_DATA_Main_Initial.java:3728)
	at bda_prod.berita_table_data_main_initial_0_1.BERITA_TABLE_DATA_Main_Initial.runJobInTOS(BERITA_TABLE_DATA_Main_Initial.java:7805)
	at bda_prod.berita_table_data_main_initial_0_1.BERITA_TABLE_DATA_Main_Initial.main(BERITA_TABLE_DATA_Main_Initial.java:6526)
Caused by: javax.security.auth.login.LoginException: Connection reset by peer (connect failed)
	at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:808)
	at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
	at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
	at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
	at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
	at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:953)
	... 15 more
Caused by: java.net.SocketException: Connection reset by peer (connect failed)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.security.krb5.internal.TCPClient.<init>(NetClient.java:63)
	at sun.security.krb5.internal.NetClient.getInstance(NetClient.java:43)
	at sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:393)
	at sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:364)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.security.krb5.KdcComm.send(KdcComm.java:348)
	at sun.security.krb5.KdcComm.sendIfPossible(KdcComm.java:253)
	at sun.security.krb5.KdcComm.send(KdcComm.java:229)
	at sun.security.krb5.KdcComm.send(KdcComm.java:200)
	at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316)
	at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361)
	at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776)
	... 28 more

 

Diagnosis

From the error, you can see that the HDFS keytab is used to generate the Kerberos ticket to access the HDFS service of Hadoop. The connection refused error is received. This normally results from a loss of the connection on the remote socket due to a timeout (network issue) or unresponsive server (KDC server, due to high load).

 

Solution

This problem is observed if there is an issue in communication between the Hadoop and Kerberos (KDC) server during the time of authentication. This issue is not related to Talend.

Version history
Revision #:
5 of 5
Last update:
‎01-22-2018 01:06 PM
Updated by:
 
Labels (5)
Contributors