tImpalaConnection - HDFS settings, Kerberos & High Availability?

One Star

tImpalaConnection - HDFS settings, Kerberos & High Availability?

Hello - I've had success using tHiveConnection & tHiveCreateTable to connect to my HiveServer2 and create an external table with data already stored on my Kerberized, HA-HDFS cluster.  I'm trying to do the same using Impala components in TOS-BD.
I am unable to find where I enter the following in the tImpalaConnection settings (I looked in Basic & Advanced):

Namenode principle - this is needed for access to my HDFS cluster and was easy to find in tHiveConnection but is nowhere to be found in tImpalaConnection
Namenode URI - this is needed for access to my HDFS cluster and was easy to find in tHiveConnection but is nowhere to be found in tImpalaConnection
"Hadoop properties" where I can set the HA configuration - this is needed to ensure that Talend is able to connect to the Active Namenode via the name service URI.
I'm using Talend Open Studio for Big Data version 5.6.0.20141024_1545.
I may just need to add these settings somewhere else - but I don't see where they would be (I'm familiar with the setup of tHDFS and tHive components.  Screenshots of what I'm seeing and comparing against (tHive components vs. tImpala components) attached.
Also - I think some users may need to be able to set Kerberos principles for ResourceManager in Impala components if they have Impala setup to work through YARN - I don't have this setup but I believe Impala can be configured as such.
Thank you!
One Star

Re: tImpalaConnection - HDFS settings, Kerberos & High Availability?

Anyone have any ideas?
One Star

Re: tImpalaConnection - HDFS settings, Kerberos & High Availability?

Anyone?  I must be doing something wrong, I would appreciate even a snide comment. :-)
Six Stars

Re: tImpalaConnection - HDFS settings, Kerberos & High Availability?

Well, I don't want to be snide, but I don't quite understand what you are asking for Smiley Happy . Impala is its own thing (no Map/R, no Jobtracker needed) and you really only talk to impalad thru JDBC (same driver as Hive) or ODBC as far as I know. The connection to the namenodes are configured at the Impala daemon level (hdfs-site.xml maybe??) and for HA you would need to use the Impala HA proxy(to load balance across Impalad's) and in the hdfs-site.xml you would use the name service stuff. Are you not able to connect to impala with the components? Can you post the error?
One Star

Re: tImpalaConnection - HDFS settings, Kerberos & High Availability?

Hi jholman - Thank you for your feedback and input.
I am mostly asking about the Talend side, I guess I was confused as to why the tHiveConnection component would need the Namenode URI interface for - and if Talend had a reason for needing that with tHiveConnection then I'd guess it would need it for tImpalaConnection as well.
Likewise tHiveConnection has Hadoop Properties under Advanced settings - which is where I'd have the HA configurations settings from hdfs-site.xml.  tImpalaCreateTable also has URI path and Set file location parameters as well - I may be confused as to the purpose they serve.
When defining external tables I believe both Hive and Impala will need to interact (in some form) with HDFS so they will need the HA configurations since Talend is serving as the client - correct?  Or do both of those get passed onto a cluster-side service (like HiveServer2 & impalad)?
I appreciate your help.
Six Stars

Re: tImpalaConnection - HDFS settings, Kerberos & High Availability?

In the case of Impala it really is more of a client-server model like Hive2, where the Impalad does all the work. Which ever impalad instance you connect to becomes the coordinator node who will parse the query,execute it among nodes, and collect the results from the other nodes and return the results to the JDBC client (in this case your Talend job) In the case of Hive, you are either connecting to an existing Hive server (Standalone via JDBC) in which case you are a client, or instantiating your own Hive instance whose job is to parse the SQL and convert it into Map/R jobs (Embedded) and execute them via jobtrackers, making you the server(sorta). If you are using the embedded server you will need the namenode and jobtracker configuration so your instance can execute the mappers and reducers. If you are using standalone, I don't think those values will do anything since the server instance you are connecting to should already have those configured, all it will do is set the system properties ("mapred.job.tracker","fs.default.name") in your client process.
One Star

Re: tImpalaConnection - HDFS settings, Kerberos & High Availability?

So if I am using HiveServer2 - then I'd never need to enter values for HDFS settings in the tHive connections - just like I don't need to enter any HDFS info for tImpala components, is that correct?
I'm running some tests with queries and will test using tImpalaCreateTable.
Thank you for the help.
Six Stars

Re: tImpalaConnection - HDFS settings, Kerberos & High Availability?

Yep. Your cluster can have one or more HiveServer2 instances (using Zookeeper for DTC and locking) and all you should need is the hive2 jdbc driver or something like Beeline to connect to an instance. I did forget there is also the thrift API which you can use, but that's probably not super relevant here. The only other values you would be interested in would be the Kerberos settings.