Using JDBC components to connect to Kerberos-enabled Impala

Overview

This article explains how to use JDBC components to connect to Impala that has Kerberos enabled. The same Job design will work for any Impala configuration, not just for Kerberos-enabled Impala.

 

Environment

  • Talend Studio 6.5.1, though you can utilize this design on any Talend 6.x environment
  • Cloudera 5.13.2

 

Configuring the connection

Configure the Hadoop Cluster connection in metadata in Studio.

  1. Right-click Hadoop Cluster, then click Create Hadoop Cluster.
  2. Select the distribution and version of your Hadoop cluster, then select Retrieve configuration from Ambari or Cloudera.

    import.png

     

  3. Enter the Cloudera Manager URL, along with your user credentials, then click Next. Cluster information will be retrieved and populated.

    update.png

     

  4. Once the cluster information is populated, click Check Services to ensure that Studio can successfully connect to the cluster.

    check.png

     

 

Creating the Job

According to the Cloudera documentation, when configuring Impala to work with JDBC, you can utilize two different options to connect: the Cloudera JDBC driver, and the Hive JDBC driver. Based on this information, you can utilize the JDBC components with the Impala JDBC driver to connect to Impala. Start by creating a Job that creates a file in HDFS, creates the table in Impala, loads that file in Impala, and then reads it.

  1. Right-click Job Designs, click Create Standard Job, and give it a name.
  2. In the Designer, add a tPreJob component. You will attach your HDFS and Hive connections to this component with an On Component Ok between them, and you will use it throughout your Job.
  3. For the tHDFSConnection, drag the HDFS connection from the Hadoop Cluster connection created above to the canvas, then select to enter a tHDFSConnection component.
  4. Add a tJDBCConnection, then connect it to the tHDFSConnection component using an On Component Ok connection.
  5. Download from Cloudera’s website the Impala JDBC driver you will use in your tJDBCConnection component:

    https://www.cloudera.com/downloads/connectors/impala/jdbc/2-6-3.html

    Download the version of the driver that is compatible with the version of Impala you have on the cluster.

  6. Once the driver Zip file is downloaded and you unzip it, under the folder for the Impala JDBC4 version driver you should see the following list of libraries:

    libraries.png

     

  7. You need to add all these libraries to your tJDBCConnection component, in the Driver JAR section as below:

    driverJar.png

     

  8. With the driver libraries added, configure the JDBC URL for the connection.

    To identify the URL string that you need to use, follow the Cloudera instructions for structuring the URL based on authentication method in Cloudera JDBC Driver for Impala. Since you are configuring your components to connect to a Kerberized Impala, the JBDC URL to use is:

    url.png

    This URL specifies:

    • The Impala daemon host
    • The port (usually 21050)
    • The authentication mechanism (Kerberos)
    • The Kerberos realm of the cluster
    • The full qualified domain name of the Impala host
    • The Kerberos Service Name of the Impala service

     

    If you leave the JDBC URL as it appears above, then from wherever you launch this Job, it will look for a Kerberos ticket to utilize for the connection. If you want to control whether it uses a Kerberos ticket or a keytab, add one additional parameter called KrbAuthType. There are different values for that property depending on what you are trying to achieve:

    • 0 when you want the driver to automatically detect the method to use
    • 1 to use keytab over JAAS
    • 2 to use a Kerberos ticket cache

     

    For this example, use 1 as the value for this property, as you want to use a JAAS configuration with keytab. So, when you are done configuring your JDBC URL, it should look like this:

    JDBCurl.png

     

    The full configuration of the component should look like this:

    fullconf.png

     

    At this point, your Job should look like this:

    job.png

     

  9. Add a tRowGenerator component that will use TalendDataGenerator functions to generate 100 rows of data in two columns: one named fname, and the other Iname.

    tRowGenerator.png

     

  10. Configure the tRowGenerator to write the data directly to HDFS using a tHDFSOutput component that uses the tHDFSConnection you created above, connecting to it using a main row:

    hdfs.png

     

  11. Connect the tHDFSOutput to the tJDBCRow component using an On Component Ok connection. tJDBCRow creates the table in Impala that will load your data, using the JDBC connection you set up in the tJDBCConnection:

    tHDFSOutput.png

     

  12. Connect the tJBDCRow component to another tJDBCRow component with an “On Component Ok” connection. This will insert the data you created into the Impala table using the JDBC connection.
  13. Set up the tJDBCConnection as follows:

    loadhdfs.png

     

  14. Connect the tJDBCConnection to a tJDBCInput component with an On Component Ok connection. This will read the information from the table above and output it to a tlogRow:

    readImpala.png

     

  15. The final addition to the Job is to connect a tPostJob component to a tJDBCClose component with an On Component Ok connection, so you can close the connection you opened:

    tPostJob.png

     

    The complete Job should look like this:

    jobcomplete.png

     

 

Configuring the Impala connection to work with JAAS

  1. You need a JAAS file with the information for the keytab, such as the one below, residing on the system that you will use to run the Job:

    jaas.png

     

  2. On the Run tab of the Job, in Advanced Settings > Use Specific JVM Arguments, add the following JVM parameter to specify the JAAS file you will use:

    VMarg.png

     

 

Running the Job

Run the Job to see if you successfully connect to the Impala daemon using SSL, load data to the table, and read from it:

run.png

 

Version history
Revision #:
5 of 5
Last update:
‎07-20-2018 12:02 PM
Updated by:
 
Contributors