This article explains how to get your Hive components to utilize the HiveServer2 High Availability and not use a single HiveServer2.
Install the MIT Kerberos Client.
Apply the JCE Unlimited Strength Jurisdiction Policy Files to your Java install.
Ensure that you have a copy of the krb5.conf file used by your cluster locally, and placed in the following locations:
Windows: C:\Windows\krb5.ini, %JAVA_HOME%\jre\lib\security\krb5.ini, C:\ProgramData\MIT\Kerberos5\krb5.ini
Note: On Windows, the name of the krb5 file is krb5.ini and not krb5.conf like in Linux. If not placed with the correct name, Studio won’t be able to use the file.
Ensure that your Talend Studio has access to your KDC server and that it can resolve the KDC from your DNS.
Perform a kinit to get a new ticket, and do a klist to confirm you have a valid ticket:
Configure the Hadoop Cluster connection in metadata in your Studio.
Right-click Hadoop Cluster then click Create Hadoop Cluster.
Select the distribution and version of your Hadoop cluster, then select Retrieve configuration from Ambari or Cloudera.
Enter your Ambari URL along with your user credentials and click Next.
Cluster information will be retrieved and populated.
You will notice that, after it populates the information, it will give you a warning about the resource manager being invalid. If you take a closer look you will see that the port is missing for the resource manager. That is because in the Hortonworks config files, the resource manager’s files that are used in HA mode are only referenced by hostname, and port is not mentioned.
When using the Resource Manager HA in Hortonworks, the default port (8050 for single one) changes to 8032. Enter 8032 as the port for the resource manager, as shown below. Click check services to ensure that your Studio can connect successfully to the cluster.
What you will now mimic in Talend is the JDBC URL that is provided by your cluster as the way to address the HiveServer2 High Availabililty. Here is an example of how that JDBC URL looks:
Based on the JDBC URL above, here is how the tHiveConnection component should be configured:
As you can see above, the host is the ZooKeeper quorum, the port is the ZooKeeper port, the Additional JDBC Settings field tells the JDBC connection that ZooKeeper will be used for discovering the HiveServer2 Instance, and includes the ZooKeeper namespace that the HiveServer2 instances are registered with.
At this point, your Job should look like this:
Add a tRowGenerator that will use the Talend Data Generator functions to generate 10 rows of data in two columns: one named firstname and the other named lastname:
Have the tRowGenerator write the data directly to the HDFS using a tHDFSOutput component that uses the tHDFSConnection you created above, connecting to it using a main row:
Connect the tHDFSOutput to your tHiveCreateTable component with an On Component Ok connection. tHiveCreateTable will create your Hive Table using the Hive connection you set up in the tHiveConnection:
Connect this component to tHiveLoad with an On Component Ok connection. This will load the file you wrote to HDFS to the Hive Table:
Connect tHiveLoad to a tHiveInput component with an On Component Ok connection. This will read the information from the table above and output it to a tlogRow:
The last part of your design of the Job is to connect a tPostJob component to a tHiveClose component with an On Component Ok connection, so that you can close the connection you opened:
Your completed Job should look like this:
Run your Job to see if you successfully connected to the active HiveServer2 Instance via Zookeeper, and if you are able to create the table, load data to it, and read from it: