Running a Spark Job on a Microsoft Azure Databricks cluster

Overview

This article shows you how to create a sample Spark Job and run it on a Microsoft Azure Databricks cluster.

 

Powered by Apache Spark, Databricks, is one of the first platforms to provide serverless computing. Databricks provides automated cluster management that scales according to the load.

 

Prerequisites

 

Job Design

  1. Open Talend Studio.

  2. In the Repository view, expand Job Designs, right-click Big Data Batch, then select Create Big Data Batch Job.

    1 (1).png

     

  3. In the pop-up window, enter Databricks_Sample in the Name text box. Fill in the Purpose and Description text boxes. Click Finish.

    2.png

     

  4. Search for the tRowGenerator component in the Palette on the right, then drag it to the Designer.

    3.png

     

  5. In the Basic settings view of the tRowGenerator component, clear the Define a storage configuration component check box.

    11.png

     

  6. Double-click the tRowGenerator component. In the pop-up window, click the green + sign three times to add three columns.

    4.png

     

  7. Rename the newColumn to ID, change the Type to Integer, then select Numeric.sequence(String,int,int) from the Functions pull-down menu.

    5.png

     

  8. Rename the newColumn1 to FirstName, leave the Type as String, then select TalendDataGenerator.getFirstName() from the Functions pull-down menu. Similarly, rename the newColumn2 to LastName, leave the Type as String, then select TalendDataGenerator.getLastName() from the Functions drop-down menu. Click OK.

    6.png

     

  9. Search for the tLogRow component in the Palette on the right, then drag it to the Designer.

    7.png

     

  10. Right-click the tRowGenerator component, then selecting Row > Main, connect it to the tLogRow component.

    8.png

     

  11. In the Job, switch to the Spark Configuration tab in the Run view. Clear the Use local mode check box, then from the Distribution drop-down menu select Databricks.

  12. Configure the Endpoint, Cluster ID, and Token using your Microsoft Azure Databricks cluster registration settings.

    13.png

     

  13. Select the Basic Run tab. Click Run.

    14.png

     

  14. After successful completion of the Job review the output in the Spark Driver Logs in the Azure Databricks portal.

    12.png

     

Conclusion

This article showed you how to build a sample Spark Job in Talend Studio and how to run it on the Spark engine of the Azure Databricks cluster.

Version history
Revision #:
13 of 13
Last update:
‎08-05-2019 11:26 AM
Updated by: