How to connect to ADLS Gen2 using Azure Databricks

Overview

This article shows you how to design a Talend Spark Databricks Job to interact with and connect securely to Azure Data Lake Storage (ADLS) Gen2.

 

Environment

  • Talend Studio 7.2.1
  • Databricks 5.4
  • ADLS Gen2 Storage

 

Prerequisites

 

Configuring Azure

 

Create a service principal

 

  1. Create a service principal from the Azure Portal by navigating to Azure Active Directory and selecting App Registrations, then select New registration.

    App_Registration_2.JPG

     

  2. Provide the name of the registration, then select the Accounts in this organizational directly only (Talend only - Single tenant) radio button under Supported account types. For a service principal, the Redirect URI is optional, so leave it blank. Click Register.

    App_Registration.JPG

     

  3. Record the Application (client) ID information. You'll need it later.

    endpoint_step3.jpg

     

  4. Create a key for your service principal, by selecting Certificates & secrets from the menu on the left, then select New client secret.

    certificates_and_secrets.JPG

     

  5. Enter a description and select an expiration. Click Add.

    certificates_and_secrets_2.JPG

     

  6. Record the key value. Important: this is the only opportunity you'll have to capture the key value. Make sure you save this information, as you'll need it later.

     

Capture the OAuth 2.0 token endpoint

  1. On the Overview menu, select Endpoints.

    endpoints (1).JPG

     

  2. After the Endpoints window opens, use the copy button next to OAuth 2.0 token endpoint to capture the information, you'll need it in the Databricks Job.

    endpoints2.JPG

     

Access ADLS Gen2 storage

This section shows you how to use the Azure CLI and Azure Storage Explorer applications to verify that your service principal has permission to access the ADLS Gen2 storage.

  1. In Azure CLI, pass the Application (client) ID you captured in the Creating a service principal section of this article, to get the Object ID that you'll use for your service principal to set permissions on the ADLS Gen2 storage, by entering the following command:
    az ad sp show --id

    The return output looks like this:

    azure_cli.JPG

     

  2. Using the Object ID captured in Step 1, open Azure Storage Explorer, and locate your ADLS Gen2 storage. You'll find the Blob Containers under your ADLS Gen2 storage. You'll use it in your Talend Job.

    azure_storage_explorer_2.JPG

     

  3. Right-click the container and select Manage Access. In the new window that opens, add your Object ID (captured in Step 1), then set the permissions to meet your requirements.

    azure_storage_explorer.JPG

     

Creating Databricks secrets

This section shows you how to use Databricks secrets to store the credentials for the ADLS Gen2 storage, and reference them in your Jobs.

  1. Create a new secret scope.

    create_secret_scope_and_list.JPG

     

  2. Add a secret to the scope, by running the following command:

    create_and_add_info_on_secret.JPG

     

  3. After the Notepad window opens, add and save your service principal key.

    create_and_add_info_on_secret_2.JPG

     

  4. Repeat the process to create secrets for the Client ID and the Endpoint.

    list_of_databricks_Secrets.JPG

     

  5. Following the instructions in the Process data stored in Azure Data Lake Store with Databricks using Talend, article, complete the steps in the Create a Cluster section to create a Databricks cluster. The KB uses a Databricks 3.5LTS cluster example, but the same steps apply when creating a 5.4 cluster.

  6. After the cluster is created and running, navigate to the main Azure Databricks Workspace page, then select Create a Blank Notebook.

    databricks_blank_notebook.jpg

     

  7. Name the Notebook, select Scala on the Language pull-down list, then select the 5.4 cluster you created in Step 5, on the Cluster pull-down list. Click Create.

    create_databricks_notebook.jpg

     

  8. To leverage the Databricks secrets you created to mount the ADLS Gen2 storage in DBFS and validate that you can read from it, add the following Scala code:

    val configs = Map(
     "fs.azure.account.auth.type" -> "OAuth",
     "fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
     "fs.azure.account.oauth2.client.id" -> dbutils.secrets.get(scope = "talendadlsgen2", key = "adlsclientid"),
     "fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope = "talendadlsgen2", key = "adlscredentials"),
     "fs.azure.account.oauth2.client.endpoint" -> dbutils.secrets.get(scope = "talendadlsgen2", key = "adlsendpoint")
      )
    
    // Optionally, you can add  to the source URI of your mount point.
     dbutils.fs.mount(
     source = "abfss://adlsgen2test@pnomikosadlsv2.dfs.core.windows.net/",
     mountPoint = "/mnt/adls",
     extraConfigs = configs);
    // val df = spark.read.text("/mnt/adls/databrickstest2/part-00000")
    // df.show();
    //Optionally, you can unmount the mount point from DBFS
    //dbutils.fs.unmount(mountPoint="/mnt/adls")
  9. Run the Notebook. Notice that the ADLS Gen2 storage is mounted in DBFS, and is accessible by your Databricks Cluster.

    databricks_notebook_for_mounting_adlsgen2.JPG

    Optional: consider adding the following before building your Talend Studio Job:

    • To mount and unmount the ADLS Gen2 storage from DBFS, or to verify that it is mounted in your Talend pipeline, in a DI Job, you can leverage a tRESTClient component, to call the Notebook using the Databricks Jobs API as defined on the Databricks, Runs submit page.

    • Leverage the Databricks Jobs API to call the Notebook you created, by adding a tJavaRow component with the following JSON request:

      row8.string = new String("{\"run_name\": \"sparkadlsgen2\", \"new_cluster\": {\"spark_version\": \""+ context.Notebook_Spark_Version + "\",\"node_type_id\": \"" + context.Node_Type + "\",\"num_workers\": " + context.databricks_Workers +"},\"notebook_task\": { \"notebook_path\":\"/Users/" + context.databricks_user + "/adlsgen2mount\"}}");

      tjavarow.jpg

       

    • Connect the tRESTClient component to the tJavaRow component using a Row > Main.

      tjavarow_tRESTClient.jpg

       

    • Configure the tRESTClient component, as shown below:

      tRESTClient.JPG

       

Building a Talend Studio Job

  1. In the Repository, right-click Job designs, click Create Big Data Batch Job, then name the Job. Click Finish.

    Studio_Create_Big_Data_Batch_Job.JPG

     

  2. Add a tFileInputDelimited component to the designer. In the Folder/File path, enter the path to the file using the mountpoint you created in DBFS. Notice that the Define a storage configuration component check box is not selected, because if one is not defined, by default, the component tries to read from DBFS Filesystem.

    tFileInputDelimited.JPG

     

  3. Click the [...] button next to Edit schema and define the schema.

    tFileInputDelimitedSchema.JPG

     

  4. Add a tFileOutputDelimited component and connect it to the tFileInputDelimited using the Row > Main connection. In the Folder path, enter the path to the file using the DBFS mountpoint and select the folder in ADLS Gen2 where the output will be written. Again, don't select the Define a storage configuration component check box because, by default, it uses DBFS.

    tFileOutputDelimited.JPG

     

  5. Click the Run tab and select Spark Configuration, then using the information you collected during the creation of the Databricks Cluster, configure the connection to your Databricks cluster.

    Note: you can leave the DBFS dependencies folder blank, or if you want the Job dependencies to be uploaded to a specific path, you can set the path. Talend 7.2.1, offers a patch that adds support for Databricks 5.4 cluster if those dependencies are needed, you can request the patch from Talend Support.

    spark_configuration.jpg

     

  6. Your Job should look like this:

    finished_job.JPG

     

  7. Run the Job.

  8. Review the output and verify that you have successfully connected to ADLS Gen2 using your Databricks cluster.

    jobrunOutput.JPG

     

  9. Open Azure Storage Explorer and verify that the folder exists and that the output is correct.

    azure_storage_explorer_output.JPG

    azure_storage_explorer_output_2.JPG

Conclusion

This article showed you how to use Azure and Databricks secrets to design a Talend Spark Databricks Job that securely interacts with Azure Data Lake Storage (ADLS) Gen2.

Version history
Revision #:
16 of 16
Last update:
‎11-04-2019 05:34 AM
Updated by:
 
Contributors