Talend Job execution using Apache Airflow

Overview

Apache Airflow is a platform to programmatically author, schedule, and monitor workflows. Airflow uses Directed Acyclic Graph (DAG) to create workflows or tasks. For more information, see the Apache Airflow Documentation page.

 

This article shows you how to leverage Apache Airflow to orchestrate, schedule, and execute Talend Data Integration (DI) Jobs.

 

Environment

  • Apache Airflow 1.10.2
  • Nexus 3.9
  • WinSCP 5.15
  • PuTTY

 

Prerequisites

  1. Apache Airflow installed on a server (follow the Installing Apache Airflow on Ubuntu/AWS installation instructions).
  2. Python 2.7 installed on the Airflow server.
  3. Java 1.8 installed on the Airflow server.
  4. Access to the Nexus server from the Airflow server (in this example, both Nexus and Airflow are installed on the same server).
  5. Talend 7.x Jobs published to the Nexus repository. (For more information on how to set up a CI/CD pipeline to publish Talend Jobs to Nexus, see Configuring Jenkins to build and deploy project items in the Talend Help Center.)
  6. Access to the setup_files.zip file (attached to this article).

 

Process flow

  1. Develop Talend DI Jobs using Talend Studio.
  2. Publish the DI Jobs to the Nexus repository using Talend CI/CD module.
  3. Execute the Directed Acyclic Graph (DAG) in Apache Airflow:
    • The first step in DAG is to download the Job executable from Nexus using the customized script.
    • The second step is to execute the downloaded Job.

    Flow.jpg

     

Configuration and execution

  1. Login to the Airflow server through SSH using WinSCP or PuTTY.
  2. Create two folders named jobs and scripts under the AIRFLOW_HOME folder.

    Image 1.png

     

  3. Extract the setup_files.zip, then copy the shell scripts (download_job.sh and delete_job.sh) to the scripts folder.

    Image 2.png

     

  4. Copy the talend_job_dag_template.py file from the setup_files.zip to your local machine and update the following:

    • nexus_host
    • nexus_port
    • airflow_home
    • nexus_repo
    • job_group_id
    • job_name
    • job_version

    Also, update the default_args dictionary based on your requirements.

    Image 3.jpg

    For more information, see the Apache Airflow documentation: Default Arguments.

  5. The DAG template provided is programmed to trigger the task externally. If you plan to schedule the task, update the schedule_interval parameter under the DAG for airflow task with values based on your scheduling requirements.

    Task.jpg

    For more information on values, see the Apache Airflow documentation: DAG Runs.

  6. Rename the updated template file and place it in the dags folder under the AIRFLOW_HOME folder.
  7. After the Airflow scheduler picks up the DAG file, a compiled file with the same name and with a .pyc extension is created.

    Image 4.jpg

     

  8. Refresh the Airflow UI screen to see the DAG.

    Note: If the DAG is not visible on the User Interface under the DAGs tab, restart the Airflow webserver and the Airflow scheduler.

    Image 5.jpg

     

  9. To schedule the task, toggle the button to On. You can also run the task manually.Image 5 on.jpg
  10. Monitor the run status on the Airflow UI.

 

Conclusion

In this article, you learned how to author, schedule, and monitor workflows from the Airflow UI, and how to download and trigger Talend Jobs for execution.

Version history
Revision #:
14 of 14
Last update:
‎06-26-2019 01:49 PM
Updated by:
 
Tags (1)
Comments

Can Argo be used with Talend the same way Airflow is here?