This article discusses how to set up Spark jobs in Talend Studio to utilize Dynamic Context.
Install the MIT Kerberos Client.
Apply the JCE Unlimited Strength Jurisdiction Policy Files to your Java install.
Ensure that you have a copy of the krb5.conf file used by your cluster locally, and placed in the following locations:
Windows: C:\Windows\krb5.ini, %JAVA_HOME%\jre\lib\security\krb5.ini, C:\ProgramData\MIT\Kerberos5\krb5.ini
Note: On Windows, the name of the krb5 file is krb5.ini and not krb5.conf like in Linux. If not placed with the correct name, Studio won’t be able to use the file.
Ensure that your Talend Studio has access to your KDC server and that it can resolve the KDC from your DNS.
Perform a kinit to get a new ticket, and do a klist to confirm you have a valid ticket:
Ensure that your Studio has access to all the Cluster nodes, and that they can reach back to your Studio (consistent with Spark security documentation) since Talend utilizes the YARN-Client paradigm that has the Spark driver spun up at the same location from which the Job is run.
Configure the Hadoop Cluster connection in metadata in your Studio.
Right-click Hadoop Cluster then click Create Hadoop Cluster.
Select the distribution and version of your Hadoop cluster, then select Retrieve configuration from Ambari or Cloudera.
Enter your Ambari URL along with your user credentials and click Next.
Cluster information will be retrieved and populated.
You will notice that, after it populates the information, it will give you a warning about the resource manager being invalid. If you take a closer look you will see that the port is missing for the resource manager. That is because in the Hortonworks config files, the resource manager’s files that are used in HA mode are only referenced by hostname, and port is not mentioned.
When using the Resource Manager HA in Hortonworks, the default port (8050 for single one) changes to 8032. Enter 8032 as the port for the resource manager, as shown below. Click check services to ensure that your Studio can connect successfully to the cluster.
Due to the nature of distributed Jobs, it is a best practice to treat the context in the Spark Job as immutable when it starts. That is why tContextLoad is not offered in the Spark palette. Within Spark jobs, a portion of the Job will be run on one executor and another part of it on a different executor, and thus it is very difficult to maintain a globally distributed and consistent state for the context between two executors in a Spark job.
The recommended approach is to utilize a DI Job to orchestrate the context of the Spark Job. Most of the time, a trigger will decide the context to be used in the Job, so the approach is to have the trigger part in DI set the context, and then pass it to the Spark job for the main part of the job.
Right-click Job Designs, then click Create Standard Job and give it a name. This test Job, in order to simulate a trigger that causes a certain context to be loaded, uses a tFixedFlowInput to a tFileOutputDelimited, writing one entry to a file.
From the tFileOutputDelimited, you have a Run If trigger going to a tJava component that will set your context. In the Run If trigger, set up the following condition:
This triggers once a new line is detected in the file, and will trigger a move into your next component, tJava, that sets the following context:
The tJava component is setting a new context value for your context variable, and prints that value to ensure that your value is getting picked up.
In the context tab of this example, this variable is set and left with no default value, so you can use whatever value you set within your tjava component:
With the context set that you want your Spark job to use, connect the tJava component to a tRunJob component, where you will set it to pass your context to the Spark job:
In this case, the tRunJob transmits the whole context to the Spark job.
Right-click Job Designs, click Create Big Data Batch Job, and give it a name.
Use this Spark job in the tRunJob component as the one to be executed.
Drag the HDFS connection created above from the Hadoop Cluster connection to the canvas, then add a tHDFSConfiguration component. Notice that right away, it fills in the Run tab > Spark configuration information, so that the Job knows how to communicate with Spark.
Add a tFixedFlowInput component from the palette to the canvas.
In its configuration, add the following content to read the context you passed from the DI orchestrator Job, to ensure you are getting it on your Spark executors:
This component will output the value to a tlogrow component, to ensure you got the correct value from the DI job.
Here are the two jobs that you will have in the end: one DI job for orchestrating and setting up your dynamic context, and one Spark job that will be utilizing it.
Check if you are getting the context output in your DI job that dynamically sets up the context.
Then check that the Spark job got the dynamic context set by the DI orchestrator job to use.
The same Job design will work for all Hadoop distros.
You can also utilize this design with a tContextLoad setup added in the DI orchestrator Job, to load the context for a database or file and set it before the Spark job.