This KB article explains how to set up Talend Studio using Spark 1.6 to work with Kerberized Kafka that is supported by HortonWorks 2.4 and later. The example Job will read from a Kafka topic and output to a tlogrow.
Install the MIT Kerberos Client.
Apply the JCE Unlimited Strength Jurisdiction Policy Files to your Java install.
Ensure that you have a copy of the krb5.conf file used by your cluster locally, and placed in the following locations:
Windows: C:\Windows\krb5.ini, %JAVA_HOME%\jre\lib\security\krb5.ini, C:\ProgramData\MIT\Kerberos5\krb5.ini
Note: On Windows, the name of the krb5 file is krb5.ini and not krb5.conf like in Linux. If not placed with the correct name, Studio won’t be able to use the file.
Ensure that your Talend Studio has access to your KDC server and that it can resolve the KDC from your DNS.
Perform a kinit to get a new ticket, and do a klist to confirm you have a valid ticket:
Ensure that your Studio has access to all the Cluster nodes, and that they can reach back to your Studio (consistent with Spark security documentation) since Talend utilizes the YARN-Client paradigm that has the Spark driver spun up at the same location from which the Job is run.
Configure the Hadoop Cluster connection in metadata in your Studio.
Right-click Hadoop Cluster then click Create Hadoop Cluster.
Select the distribution and version of your Hadoop cluster, then select Retrieve configuration from Ambari or Cloudera.
Enter your Ambari URL along with your user credentials and click Next.
Cluster information will be retrieved and populated.
You will notice that, after it populates the information, it will give you a warning about the resource manager being invalid. If you take a closer look you will see that the port is missing for the resource manager. That is because in the Hortonworks config files, the resource manager’s files that are used in HA mode are only referenced by hostname, and port is not mentioned.
When using the Resource Manager HA in Hortonworks, the default port (8050 for single one) changes to 8032. Enter 8032 as the port for the resource manager, as shown below. Click check services to ensure that your Studio can connect successfully to the cluster.
Before you can start setting up the job, ensure that there is a jaas file that you can use for the driver, and one for the Spark executors, that point to the keytab to use.
Notice that the path to the keytab for the executor is a relative path. That is because you are going to have Spark send it to the containers of the executors along with the jaas file so they will both reside on the same location.
For the Spark driver, the jaas file will look like this:
Notice that the path used for the keytab is local to your Studio, as the Spark Driver will be run on your local system in this scenario.
Right-click Job Designs, click Create Big Data Streaming Job, and give it a name.
From the Hadoop Cluster connection created above, drag the HDFS connection to the canvas, then add a tHDFSConfiguration component. Notice that it also populates the Run tab > Spark configuration information for you, so that the Job knows how to communicate with Spark.
For the configuration of your Kafka component, add the following:
Notice that the for the broker, the port is 6667 and not 9092, which is the usual default value. You can confirm the correct port by looking at your Kafka configuration file in /usr/hdp/current/kafka-broker/conf/server.properties. The property to look for is port=6667.
Select From beginning to read from the beginning of this test case, as you want to check all the messages in the Kafka topic.
In the Advanced settings of the tkafkainput component, add the following:
This property lets the broker know that you are doing a kerberized, not a PLAINTEXT, connection. If you don’t specify it, Kafka will not allow the connection.
Add a tlogrow output to your canvas to capture the messages from Kafka and output them to the screen.
Go to your Run tab > Spark configuration, and in the Advanced Properties section, add the following Spark properties:
The last thing you need to do before running the Job is to add the following JVM arguments in the Run tab > Advanced settings:
The first option specifies the cluster version that you want to use, and the second one is required by the Spark driver to ensure that they're passed within the Job JVM and are available to use.
Check if you are getting your Kafka consumer to authenticate with JAAS, and if you are seeing the messages come through.