Spark Streaming Kerberized Kafka

Spark Streaming with Kerberized Kafka

This KB article explains how to set up Talend Studio using Spark 1.6 to work with Kerberized Kafka that is supported by HortonWorks 2.4 and later. The example Job will read from a Kafka topic and output to a tlogrow.

 

Environment

  • Talend Studio 6.3.1
  • HortonWorks 2.5.3

 

Prerequisites

  1. Set up your system for Kerberos tickets.
    1. Install the MIT Kerberos Client.

    2. Apply the JCE Unlimited Strength Jurisdiction Policy Files to your Java install.

    3. Ensure that you have a copy of the krb5.conf file used by your cluster locally, and placed in the following locations:

      • Linux: /etc/krb5.conf
      • Windows: C:\Windows\krb5.ini, %JAVA_HOME%\jre\lib\security\krb5.ini, C:\ProgramData\MIT\Kerberos5\krb5.ini

        Note: On Windows, the name of the krb5 file is krb5.ini and not krb5.conf like in Linux. If not placed with the correct name, Studio won’t be able to use the file.

    4. Ensure that your Talend Studio has access to your KDC server and that it can resolve the KDC from your DNS.

    5. Perform a kinit to get a new ticket, and do a klist to confirm you have a valid ticket:

      kinit.png

       

  2. Ensure that your Studio has access to all the Cluster nodes, and that they can reach back to your Studio (consistent with Spark security documentation) since Talend utilizes the YARN-Client paradigm that has the Spark driver spun up at the same location from which the Job is run.

  3. Configure the Hadoop Cluster connection in metadata in your Studio.

    1. Right-click Hadoop Cluster then click Create Hadoop Cluster.

    2. Select the distribution and version of your Hadoop cluster, then select Retrieve configuration from Ambari or Cloudera.

      import_option.png

       

    3. Enter your Ambari URL along with your user credentials and click Next.

    4. Cluster information will be retrieved and populated.

      cluster_conn.png

       

    5. You will notice that, after it populates the information, it will give you a warning about the resource manager being invalid. If you take a closer look you will see that the port is missing for the resource manager. That is because in the Hortonworks config files, the resource manager’s files that are used in HA mode are only referenced by hostname, and port is not mentioned.

    6. When using the Resource Manager HA in Hortonworks, the default port (8050 for single one) changes to 8032. Enter 8032 as the port for the resource manager, as shown below. Click check services to ensure that your Studio can connect successfully to the cluster.

      cluster_conn_repo.png

      check_svcs.png

       

    7. Before you can start setting up the job, ensure that there is a jaas file that you can use for the driver, and one for the Spark executors, that point to the keytab to use.

    8. The jaas file to be used by the executors must look like this:

      jaas.png

      Notice that the path to the keytab for the executor is a relative path. That is because you are going to have Spark send it to the containers of the executors along with the jaas file so they will both reside on the same location.

    9. For the Spark driver, the jaas file will look like this:

      jaas_spark.png

      Notice that the path used for the keytab is local to your Studio, as the Spark Driver will be run on your local system in this scenario.

 

Build Job

  1. Right-click Job Designs, click Create Big Data Streaming Job, and give it a name.

  2. From the Hadoop Cluster connection created above, drag the HDFS connection to the canvas, then add a tHDFSConfiguration component. Notice that it also populates the Run tab > Spark configuration information for you, so that the Job knows how to communicate with Spark.

    kafka_stream.png

     

  3. Drag a tkafkainput component from the palette to the canvas.
  4. For the configuration of your Kafka component, add the following:

    tkafkainput_1.png

     

  5. Notice that the for the broker, the port is 6667 and not 9092, which is the usual default value. You can confirm the correct port by looking at your Kafka configuration file in /usr/hdp/current/kafka-broker/conf/server.properties. The property to look for is port=6667.

  6. Select From beginning to read from the beginning of this test case, as you want to check all the messages in the Kafka topic.

  7. In the Advanced settings of the tkafkainput component, add the following:

    tkafkainput_1_adv.png

     

    This property lets the broker know that you are doing a kerberized, not a PLAINTEXT, connection. If you don’t specify it, Kafka will not allow the connection.

  8. Add a tlogrow output to your canvas to capture the messages from Kafka and output them to the screen.

  9. Go to your Run tab > Spark configuration, and in the Advanced Properties section, add the following Spark properties:

    spark_adv_prop.png

     

    • The first property is spark.files. This property tells Spark to send the jaas file you created above and the keytab you are going to use to the Spark cache, and distribute them to the Spark executors. The example specifies that the Job should find them from the location from which it is run. Ensure that you have given unique names to the jaas files to be used by your executors and Spark driver, so that you don’t pass the wrong jaas file. In this case, the one for the executors is named kafka_client_jaas.conf and the one for the driver is named kafka_driver_jaas.conf.
    • The second property is spark.executor.extraJavaOptions. This one tells the Spark executor that it needs to utilize jaas for the Kerberos authentication to Kafka and the location of the jaas file.
    • The third property is spark.driver.extraJavaOptions. This lets the Spark driver know that it also needs to use jaas for the Kerberos authentication, the location of the jaas file to use, and the version of HortonWorks that you want to use at the cluster level.
    • The fourth property is spark.yarn.am.extraJavaOptions. This passes the same string of JVM options as the property above, but according to the Spark documentation you need to pass it as well, since the YARN-client mode is used to ensure that your JVM properties are passed.
  10. The last thing you need to do before running the Job is to add the following JVM arguments in the Run tab > Advanced settings:

    jvm_arg.png

     

    The first option specifies the cluster version that you want to use, and the second one is required by the Spark driver to ensure that they're passed within the Job JVM and are available to use.

    hdpha.png

     

 

Run Job

  1. Check if you are getting your Kafka consumer to authenticate with JAAS, and if you are seeing the messages come through.

    runjob.png

 

Additional Notes

  1. The same settings used above for the Kafka consumer will also work for the Kafka producer.
  2. This setup can only be used with HortonWorks 2.4 and later versions that use Spark 1.6 with the Spark feature of supporting kerberized Kafka backported.
  3. For Cloudera, this setup won’t work, as official support for kerberized Kafka is only available on Spark 2.1 with Kafka connector 0.10. In this case, the recommended approach is to upgrade to 6.4.1 that has official support of kerberized Kafka and Spark 2.1.
  4. When upgrading to Studio 6.4, you won’t need to maintain two different jaas files. You will only need to provide the jaas file that is used for the Spark driver on the new component tKafkaConfiguration, added in the Kerberos authentication section, and Studio will take care of the rest. In addition, in Spark Configuration > Advanced Properties, you need only the Spark properties for specifying the hdp.version. You no longer need the properties you used to copy the jaas and keytab, and specify the location of the jaas file.
Version history
Revision #:
7 of 7
Last update:
‎02-24-2019 11:06 PM
Updated by:
 
Comments
Four Stars

Hi,

 

Thanks for the elaborate explanation. But i have some doubts.

First for yarn client mode. In this case if we mention ./kafka.keytab in the jaas.conf file, job will try to find the keytab file from the current working directory in local proxy from where we have triggered the job for driver scenario and for executors will get the files from its relative path. If we mention <somepath>/kafka.keytab then job will copy the keytab file and the jass file for executors and whenever it will try to authenticate from executor side it will fail as the keytab file has to be present in the relative path for executor. For driver in yarn client mode job will be able to find the keytab file from <somepath>/kafka.keytab location but executors will fail but if we mention ./kafka.keytab then driver will fail as the file is not present in the current directory. And also job tries to copy the jaas file and truststore and keystore files current working directory for yarn client mode.

For yarn cluster mode. If we mention <somepath>/kafka.keytab in jaas.conf file job will copy the jaas.conf file and keytab file to as staging hdfs location but executors will fail as the keytab has to be present in the relative path. Now i changed the path to ./kafka.keytab in jaas.conf file but in this scenario job failed as the keytab file was not present in the current working directory from where i have triggered the job.

So what will be the solution in this scenario for yarn cluster mode.

Community Manager

Hi @T3MMnA,

 

Thanks for your patience, I have asked the author, @pnomikos, for assistance with your question and he'll reply as soon as he can.

Alyce

Employee

Hi @T3MMnA ,

 

 

First as you can see in my post, I mention that before version 6.4.1, you have to use 2 different jaas files. One is going to be for the executors, and the other is going to be for the spark driver. The jaas file that the spark driver will be use, will reference the full path to the keytab in the local filesystem, and the jaas file for the executors, it will use the relative path. As you can see I also named my 2 JAAS files differently so that I know which one is which so that I can set them correctly. This is again before 6.4.1, as after 6.4.1, our R&D has implemented in the code, to upload the jaas file and keytab without you needing to use those properties, and also you just need to have one jaas file with the full path to the keytab in the local filesystem as the Talend code of the job, will take care of creating the one for the executors.

 

Now a big difference that happens when you use YARN-cluster, is that the driver is not anymore running in your jobserver or Studio depending on where you started the job from, but it is actually shipped and now running in a YARN container. So here the full path to the keytab in the JAAS file for the driver, doesnt make sense anymore, as for the driver now it will be a relative path just like the executors since now it is running in YARN container. So now, after Talend 6.4.1, and also since for YARN-cluster mode in Talend jobs you primarily find it in 7.0, the JAAS file you will still leave it with the full path to the keytab, so that the job can send it to the cluster, and then the code within the job, will take care of renaming it for the executors, and it should also do for the Spark driver as it has to be a relative path now. Now if you still get errors, I would recommend for you to open up a ticket with our support team so that we can investigate, as for YARN-cluster mode after Talend 6.4.1, it might not rename the path for the keytab in the JAAS file for the Spark driver to a relative path, as when that functionality was added we still supported for most distros YARN-client mode which it meant it was only needed for the executors.