Connection to HDFS on tHDFSInput takes a long time (HDFS HA)

One Star

Connection to HDFS on tHDFSInput takes a long time (HDFS HA)

Hello,
I have a kerberized High Availability HDFS cluster setup and followed the instructions here for setting up my connection to the HA cluster:

When configuring the Authentication settings in a tHDFSInput component after changing to the _HOST value the job ran much slower than when using the fully qualified host name.
Here is the log file from job in question.
Starting job test1 at 11:36 10/09/2014.
connecting to socket on port 3931
connected
: org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
: org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
: org.apache.hadoop.util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
    at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
    at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
    at org.apache.hadoop.security.Groups.<init>(Groups.java:77)
    at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
    at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
    at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:232)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:718)
    at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:703)
    at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:605)
    at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2554)
    at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2546)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2412)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
    at bigdatademoproject.test1_0_1.test1.tHDFSInput_1Process(test1.java:508)
    at bigdatademoproject.test1_0_1.test1.runJobInTOS(test1.java:903)
    at bigdatademoproject.test1_0_1.test1.main(test1.java:762)
# Kickstart file automatically generated by anaconda.
#version=DEVEL
install
cdrom
lang en_US.UTF-8
keyboard us
network --onboot no --device em1 --bootproto dhcp --noipv6
network --onboot no --device em2 --bootproto dhcp --noipv6
network --onboot no --device em3 --bootproto dhcp --noipv6
network --onboot no --device em4 --bootproto dhcp --noipv6
firewall --service=ssh
authconfig --enableshadow --passalgo=sha512
selinux --enforcing
timezone --utc America/New_York
bootloader --location=mbr --driveorder=sdb --append="crashkernel=auto rhgb quiet"
# The following is the partition information you requested
# Note that any partitions you deleted are not expressed
# here so unless you clear all partitions first, this is
# not guaranteed to work
#clearpart --none

#part /boot --fstype=ext4 --asprimary --size=200
#part / --fstype=ext4 --asprimary --size=200000
#part /var --fstype=ext4 --grow --asprimary --size=200
#part swap --size=64000

%packages
@base
@client-mgmt-tools
@console-internet
@core
@debugging
@development
@directory-client
@emacs
@hardware-monitoring
@java-platform
@large-systems
@legacy-unix
@network-file-system-client
@network-tools
@performance
@perl-runtime
@system-management-snmp
@security-tools
@server-platform
@server-platform-devel
@server-policy
@system-admin-tools
pax
python-dmidecode
oddjob
sgpio
device-mapper-persistent-data
systemtap-client
jpackage-utils
samba-winbind
certmonger
pam_krb5
krb5-workstation
tcp_wrappers
perl-DBD-SQLite
p11-kit-trust
%end
disconnected
Job test1 ended at 11:42 10/09/2014.


We have Hadoop installed on RedHat Enterprise Linux and Talend Studio running from Windows 7 desktop OS.  The problem is:

When configuring the Authentication settings in a tHDFSInput component after changing to the "_HOST" value the job ran much slower than when using the fully qualified host name.

I'm simply using 1 component (tHDFSInput) 2 different ways:

Connecting directly to the Active NameNode (which may not always work since I have High Availability in place).
Connecting to the nameservice URL using the "_HOST" value.
Configuration 1 above shows great (expected) speed in reading the small text file stored on HDFS but would not work in an HDFS cluster configured for High Availability. Configuration 2 works for HDFS cluster configured for High Availability however it takes a long time (unexpected) in order to make the connection and read the small text file.

Any assistance would be greatly appreciated.
Community Manager

Re: Connection to HDFS on tHDFSInput takes a long time (HDFS HA)

Hi
There is a KB article about enabling the HDFS High Avaliability feature in the Studio, I hope it can help you.
: org.apache.hadoop.util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

It is a well-know error, see jlolling's comment in this topic.
Best regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business