Using Hive Components to Connect to HiveServer2 High Availability


This article explains how to get your Hive components to utilize the HiveServer2 High Availability and not use a single HiveServer2.



  • Talend Studio 6.3.1
  • HortonWorks 2.6.1



  1. Set up your system for Kerberos tickets.
    1. Install the MIT Kerberos Client.

    2. Apply the JCE Unlimited Strength Jurisdiction Policy Files to your Java install.

    3. Ensure that you have a copy of the krb5.conf file used by your cluster locally, and placed in the following locations:

      • Linux: /etc/krb5.conf
      • Windows: C:\Windows\krb5.ini, %JAVA_HOME%\jre\lib\security\krb5.ini, C:\ProgramData\MIT\Kerberos5\krb5.ini

        Note: On Windows, the name of the krb5 file is krb5.ini and not krb5.conf like in Linux. If not placed with the correct name, Studio won’t be able to use the file.

    4. Ensure that your Talend Studio has access to your KDC server and that it can resolve the KDC from your DNS.

    5. Perform a kinit to get a new ticket, and do a klist to confirm you have a valid ticket:



  2. Configure the Hadoop Cluster connection in metadata in your Studio.

    1. Right-click Hadoop Cluster then click Create Hadoop Cluster.

    2. Select the distribution and version of your Hadoop cluster, then select Retrieve configuration from Ambari or Cloudera.



    3. Enter your Ambari URL along with your user credentials and click Next.

    4. Cluster information will be retrieved and populated.



    5. You will notice that, after it populates the information, it will give you a warning about the resource manager being invalid. If you take a closer look you will see that the port is missing for the resource manager. That is because in the Hortonworks config files, the resource manager’s files that are used in HA mode are only referenced by hostname, and port is not mentioned.

    6. When using the Resource Manager HA in Hortonworks, the default port (8050 for single one) changes to 8032. Enter 8032 as the port for the resource manager, as shown below. Click check services to ensure that your Studio can connect successfully to the cluster.





Build Job

  1. When you use High Availability for HiveServer2, the multiple instances of HiveServer2 are registered with ZooKeeper. This is done by the Hadoop cluster end to have ZooKeeper direct a client request to the HiveServer2 instance that is running at that point in time, or load balance among the different HiveServer2 instances running, based on workload. As a result, the client (in this case Talend) needs to go through ZooKeeper to find the HiveServer2 instance to use, based on availability and workload.
  2. Based on this, you will have to design your Hive connections in a certain way to take advantage of this High Availability provided.
  3. Start by creating a Job that creates a file in HDFS, loads that file in Hive, and then reads it.
  4. Right-click Job Designs, then Create Standard Job, and give it a name.
  5. In the canvas, add a tPreJob component, to which you will attach your HDFS connection and Hive Connection (with an On Component Ok between them), which you will use throughout your Job.
  6. For the tHDFSConnection, drag from the Hadoop Cluster connection created above the HDFS connection to the canvas, and select to enter a tHDFSConnection component.
  7. Manually add a tHiveConnection and connect it to the tHDFSConnection component using an On Component Ok connection.
  8. What you will now mimic in Talend is the JDBC URL that is provided by your cluster as the way to address the HiveServer2 High Availabililty. Here is an example of how that JDBC URL looks:


    Based on the JDBC URL above, here is how the tHiveConnection component should be configured:



  9. As you can see above, the host is the ZooKeeper quorum, the port is the ZooKeeper port, the Additional JDBC Settings field tells the JDBC connection that ZooKeeper will be used for discovering the HiveServer2 Instance, and includes the ZooKeeper namespace that the HiveServer2 instances are registered with.

    At this point, your Job should look like this:



  10. Add a tRowGenerator that will use the Talend Data Generator functions to generate 10 rows of data in two columns: one named firstname and the other named lastname:



  11. Have the tRowGenerator write the data directly to the HDFS using a tHDFSOutput component that uses the tHDFSConnection you created above, connecting to it using a main row:



  12. Connect the tHDFSOutput to your tHiveCreateTable component with an On Component Ok connection. tHiveCreateTable will create your Hive Table using the Hive connection you set up in the tHiveConnection:



  13. Connect this component to tHiveLoad with an On Component Ok connection. This will load the file you wrote to HDFS to the Hive Table:



  14. Connect tHiveLoad to a tHiveInput component with an On Component Ok connection. This will read the information from the table above and output it to a tlogRow:



  15. The last part of your design of the Job is to connect a tPostJob component to a tHiveClose component with an On Component Ok connection, so that you can close the connection you opened:



  16. Your completed Job should look like this:




Run the Job

  1. Run your Job to see if you successfully connected to the active HiveServer2 Instance via Zookeeper, and if you are able to create the table, load data to it, and read from it:





Version history
Revision #:
9 of 9
Last update:
‎04-13-2019 12:35 PM
Updated by: