Azure Data Lake Store Gen1 (ADLS) is a hyper-scale Big Data store. It is common for solutions within the Azure environment to store data within ADLS and process it with other compute resources, such as Azure Databricks. Azure Databricks is a consumption-based Spark managed service that simplifies processing of Big Data and Artificial Intelligence workloads. Processing data with Talend within this environment is a common pattern. Talend 7.1 adds support for executing Jobs in Azure Databricks 3.5LTS. This article explains the process of creating a solution that processes data stored in ADLS with Databricks using Talend.
Service principals are a means of authenticating within the Azure environment.
Create a service principal from the Azure Portal by navigating to Azure Active Directory and selecting App Registrations. Then select New application registration.
Provide the name of the registration and the URL. For service principal, the URL is required but not utilized. Select Web app / API for the Application type. Click Create.
Record the resulting information, which includes the Application ID, also known as the Client id. The Client id is used during the configuration of the Talend Job and the creation and configuration of the Databricks cluster. Make sure you save this information, as you'll need it in the future.
Create a key by clicking Settings to open a blade containing the Keys option. Enter a description and select an expiration. Click Save.
Record the key value. Important: this is the only opportunity to capture the key value. The key is required to configure the Talend Job and Databricks cluster.
Grant the service principal access to the Azure Data Lake API by selecting Required Permissions, then click Add. Select Azure Data Lake.
Grant full access to the ADLS service and click Select, then Done.
The last Azure Active Directory data element needed is the OAUTH 2.0 TOKEN ENDPOINT.
Navigate to App Registrations and select Endpoints.
Click the copy button next to the textbox containing the OAUTH 2.0 TOKEN ENDPOINT. Save the resulting value to use when you configure the Databricks cluster and the Talend Job.
At this point, you should have captured and stored three values for future use, as shown in the examples below. Note: these values are examples and will not work in your settings.
If an ADLS store does not exist, create one using the Azure Portal.
Provide a unique Name, Subscription, Resource group, and Location for the new ADLS. Click Create.
Grant permission to the previously created service principal, so that it can interact with the ADLS. Navigate to ADLS Data explorer for the appropriate ADLS instance.
Depending on security requirements, grant the previously created service principal access to the appropriate location within the ADLS, such as a folder. Select Access > Add, then search for the service principal using the description you created earlier.
Select the service principal, then click Select. Select the appropriate permissions, the permissions scoping, and default or access permission. Once the permissions are selected, click Ok. Note that the account must have read and executed permissions to all ancestors of that item to access an object at a lower level. For more information on ADLS permissions, see the Access control in Azure Data Lake Gen 1 article referenced in the Resources section of this article.
Databricks offers Spark as a managed service. An Azure Databricks service has zero or more clusters. This section discusses the provisioning or identification of a Databricks service instance, and the creation and configuration of a cluster within that service.
If one does not exist, create an Azure Databricks Service.
Ensure, as recommended, that the ADLS instance is in the same location as the Azure Databricks Service. For more information, see the Azure Databricks article, referenced in the Resources section of this article.
Select the newly created Azure Databricks Service or choose an existing one.
Click Launch Workspace.
Databricks utilizes clusters to execute Jobs. Click the Clusters icon on the left to navigate to the Clusters section.
Initially, a workspace does not have clusters associated with it. Select Create Cluster to start the creation process.
Assign a name to the cluster.
Select Standard as the Cluster Mode type.
Select 3.5 LTS, the version required by Talend 7.1, as the Databricks Runtime Version.
Select the appropriate sizing of the Driver Type and Worker Type, based on the expected workloads.
Auto Termination is appropriate in non-production environments, where cost management is of greater concern than responsiveness. When Auto Termination is enabled, the cluster shuts down after the specified period of inactivity. The default is 120. However, you can adjust it according to your requirements.
The Spark Configuration section of the cluster is used to capture information necessary for the Jobs to access ADLS.
Add the items in Table 1 to the Spark Configuration section, using the previously captured values.
Note: replace <insert client id here> with the Application Id, replace <insert client secret key here> with the key value associated with the Application/Service Principal, and replace <insert url endpoint here> with the OAUTH 2.0 TOKEN ENDPOINT you captured earlier.
Table 1 - Spark Configuration
spark.hadoop.dfs.adls.oauth2.client.id <insert client id here>
spark.hadoop.dfs.adls.oauth2.credential <insert client secret key here>
spark.hadoop.dfs.adls.oauth2.refresh.url <insert url endpoint here>
Your configuration section should look like this:
Talend Studio requires the Databricks cluster endpoint for execution. The URL is typically in the format https://location.azuredatabricks.net. In this case, it is https://eastus2.azuredatabricks.net. Make a note of this value, as it is needed later.
You can capture the Cluster Id in two ways. One way is by examining the URL. The second, and preferred way, is by looking at the Environment section of the Spark UI tab of the cluster.
Search for ClusterId to locate the value.
To grant Talend Studio permissions to push a Job to the Spark cluster, you must first generate a token in the Databricks workspace.
Click the User icon on the top left of the Databricks workspace, then select User Settings.
Click Generate New Token from the Access Token tab.
Provide a comment describing the purpose of the token and a lifetime in days for that token.
Make a note of the generated token.
At this point you should have captured the following information:
Sources for the Job are available in the attached DatabricksADLSTempHumidFile.zip and TempHumidData.csv files.
Talend 7.1 added support for executing Big Data Jobs in Databricks 3.5LTS. For example, a Big Data Batch Job can now target Databricks for execution. The example Job, reads the CSV file, from ADLS containing a timestamp, temperature, humidity, and probe temperature. The Job then computes the average of the temperature and probe temperature and writes the results back to a different location within ADLS.
The example Talend Job looks like this:
Use the tAzureFSConfiguration component to provide Spark with the authentication information necessary to access ADLS. In this case, copy the previously captured values into the appropriate settings of the component. Again, Client Id corresponds to the Application Id of the service principal. The Client key value is the same value captured during the creation of the key for the service principal.
Use the tFileInputDelimited component to read the input file from ADLS. Note that ADLS is case sensitive, and that the tAzureFSConfiguration component is being used to define storage.
Supplying a schema for the input file, simplifies later calculations.
Use the tMap component to compute the average temperature. The AverageTemp expression is a simple average: (row2.AmbientTempF + row2.ProbeTempF ) / 2.0.
Use the tFileOutputDelimited component to write out the original values, with the newly computed values.
After you create and test the Job in local mode, configure it to execute on a Databricks cluster.
Select Databricks from the Distribution drop-down list. Populate Endpoint, Cluster ID, and Token using the previously captured values. Here the token request is the token generated in the Databricks workspace under User settings.
Run the Job and ensure that it completes successfully. Note that on the first run, required JARs are uploaded to the cluster’s file system. This process may take some time, depending on your connection speeds.
Under certain circumstances, the Databricks cluster recycles before execution of the Job. Under normal conditions, the cluster returns to a functioning state, and the Job executes.
View the execution results by using the ADLS data explorer.