Spark Streaming Kerberized Kafka

Spark Streaming with Kerberized Kafka

This KB article explains how to set up Talend Studio using Spark 1.6 to work with Kerberized Kafka that is supported by HortonWorks 2.4 and later. The example Job will read from a Kafka topic and output to a tlogrow.

 

Environment

  • Talend Studio 6.3.1
  • HortonWorks 2.5.3

 

Prerequisites

  1. Set up your system for Kerberos tickets.
    1. Install the MIT Kerberos Client.

    2. Apply the JCE Unlimited Strength Jurisdiction Policy Files to your Java install.

    3. Ensure that you have a copy of the krb5.conf file used by your cluster locally, and placed in the following locations:

      • Linux: /etc/krb5.conf
      • Windows: C:\Windows\krb5.ini, %JAVA_HOME%\jre\lib\security\krb5.ini, C:\ProgramData\MIT\Kerberos5\krb5.ini

        Note: On Windows, the name of the krb5 file is krb5.ini and not krb5.conf like in Linux. If not placed with the correct name, Studio won’t be able to use the file.

    4. Ensure that your Talend Studio has access to your KDC server and that it can resolve the KDC from your DNS.

    5. Perform a kinit to get a new ticket, and do a klist to confirm you have a valid ticket:

      kinit.png

       

  2. Ensure that your Studio has access to all the Cluster nodes, and that they can reach back to your Studio (consistent with Spark security documentation) since Talend utilizes the YARN-Client paradigm that has the Spark driver spun up at the same location from which the Job is run.

  3. Configure the Hadoop Cluster connection in metadata in your Studio.

    1. Right-click Hadoop Cluster then click Create Hadoop Cluster.

    2. Select the distribution and version of your Hadoop cluster, then select Retrieve configuration from Ambari or Cloudera.

      import_option.png

       

    3. Enter your Ambari URL along with your user credentials and click Next.

    4. Cluster information will be retrieved and populated.

      cluster_conn.png

       

    5. You will notice that, after it populates the information, it will give you a warning about the resource manager being invalid. If you take a closer look you will see that the port is missing for the resource manager. That is because in the Hortonworks config files, the resource manager’s files that are used in HA mode are only referenced by hostname, and port is not mentioned.

    6. When using the Resource Manager HA in Hortonworks, the default port (8050 for single one) changes to 8032. Enter 8032 as the port for the resource manager, as shown below. Click check services to ensure that your Studio can connect successfully to the cluster.

      cluster_conn_repo.png

      check_svcs.png

       

    7. Before you can start setting up the job, ensure that there is a jaas file that you can use for the driver, and one for the Spark executors, that point to the keytab to use.

    8. The jaas file to be used by the executors must look like this:

      jaas.png

      Notice that the path to the keytab for the executor is a relative path. That is because you are going to have Spark send it to the containers of the executors along with the jaas file so they will both reside on the same location.

    9. For the Spark driver, the jaas file will look like this:

      jaas_spark.png

      Notice that the path used for the keytab is local to your Studio, as the Spark Driver will be run on your local system in this scenario.

 

Build Job

  1. Right-click Job Designs, click Create Big Data Streaming Job, and give it a name.

  2. From the Hadoop Cluster connection created above, drag the HDFS connection to the canvas, then add a tHDFSConfiguration component. Notice that it also populates the Run tab > Spark configuration information for you, so that the Job knows how to communicate with Spark.

    kafka_stream.png

     

  3. Drag a tkafkainput component from the palette to the canvas.
  4. For the configuration of your Kafka component, add the following:

    tkafkainput_1.png

     

  5. Notice that the for the broker, the port is 6667 and not 9092, which is the usual default value. You can confirm the correct port by looking at your Kafka configuration file in /usr/hdp/current/kafka-broker/conf/server.properties. The property to look for is port=6667.

  6. Select From beginning to read from the beginning of this test case, as you want to check all the messages in the Kafka topic.

  7. In the Advanced settings of the tkafkainput component, add the following:

    tkafkainput_1_adv.png

     

    This property lets the broker know that you are doing a kerberized, not a PLAINTEXT, connection. If you don’t specify it, Kafka will not allow the connection.

  8. Add a tlogrow output to your canvas to capture the messages from Kafka and output them to the screen.

  9. Go to your Run tab > Spark configuration, and in the Advanced Properties section, add the following Spark properties:

    spark_adv_prop.png

     

    • The first property is spark.files. This property tells Spark to send the jaas file you created above and the keytab you are going to use to the Spark cache, and distribute them to the Spark executors. The example specifies that the Job should find them from the location from which it is run. Ensure that you have given unique names to the jaas files to be used by your executors and Spark driver, so that you don’t pass the wrong jaas file. In this case, the one for the executors is named kafka_client_jaas.conf and the one for the driver is named kafka_driver_jaas.conf.
    • The second property is spark.executor.extraJavaOptions. This one tells the Spark executor that it needs to utilize jaas for the Kerberos authentication to Kafka and the location of the jaas file.
    • The third property is spark.driver.extraJavaOptions. This lets the Spark driver know that it also needs to use jaas for the Kerberos authentication, the location of the jaas file to use, and the version of HortonWorks that you want to use at the cluster level.
    • The fourth property is spark.yarn.am.extraJavaOptions. This passes the same string of JVM options as the property above, but according to the Spark documentation you need to pass it as well, since the YARN-client mode is used to ensure that your JVM properties are passed.
  10. The last thing you need to do before running the Job is to add the following JVM arguments in the Run tab > Advanced settings:

    jvm_arg.png

     

    The first option specifies the cluster version that you want to use, and the second one is required by the Spark driver to ensure that they're passed within the Job JVM and are available to use.

    hdpha.png

     

 

Run Job

  1. Check if you are getting your Kafka consumer to authenticate with JAAS, and if you are seeing the messages come through.

    runjob.png

 

Additional Notes

  1. The same settings used above for the Kafka consumer will also work for the Kafka producer.
  2. This setup can only be used with HortonWorks 2.4 and later versions that use Spark 1.6 with the Spark feature of supporting kerberized Kafka backported.
  3. For Cloudera, this setup won’t work, as official support for kerberized Kafka is only available on Spark 2.1 with Kafka connector 0.10. In this case, the recommended approach is to upgrade to 6.4.1 that has official support of kerberized Kafka and Spark 2.1.
  4. When upgrading to Studio 6.4, you won’t need to maintain two different jaas files. You will only need to provide the jaas file that is used for the Spark driver on the new component tKafkaConfiguration, added in the Kerberos authentication section, and Studio will take care of the rest. In addition, in Spark Configuration > Advanced Properties, you need only the Spark properties for specifying the hdp.version. You no longer need the properties you used to copy the jaas and keytab, and specify the location of the jaas file.
Version history
Revision #:
5 of 5
Last update:
‎08-07-2017 03:04 PM
Updated by:
 
Labels (4)
Contributors