Executing Talend ETL Jobs on Amazon Fargate using Airflow

Overview

This article shows you how to execute containerized Talend ETL Jobs on serverless platforms like Amazon Fargate leveraging Apache Airflow.

 

This is a continuation of the Talend Community Knowledge Base article, Provisioning and executing Talend ETL Jobs on Amazon EKS using Airflow.

 

Prerequisites

  1. Follow the Prerequisites described in the previous article of this series.

  2. Perform the steps in the Installing Airflow with Docker section of the previous article.

    Note: For Amazon Fargate, Airflow version 1.10.3 is required.

  3. IAM permissions for Airflow to launch ECS tasks.
  4. Basic understanding of serverless concepts and platforms, specifically, Amazon Fargate.

  5. Download and extract the DAG_ECSOperator_Fargate.zip file (attached to this article).

 

Objective and logical architecture

In this article, you will execute the following steps to deploy a Talend Data Integration (DI) Job to Amazon Fargate.

  1. Import a demo Job into Talend Studio.
  2. Publish a Talend Job to Amazon ECR.
  3. Create a Fargate cluster.
  4. Create a task in Fargate with containerized Talend Jobs.
  5. Create a DAG in Airflow using ECS Operator.
  6. Trigger and run Fargate tasks with Airflow.

    Architecture_new.jpg

     

Importing and publishing the Job

Following the instructions in the Provisioning and executing Talend ETL Jobs on Amazon EKS using Airflow article, complete the steps in the following sections:

Sources for the project are available in the preparation_files.zip file (attached in this article).

 

Creating a Fargate cluster

  1. Open the AWS console and search for ECS (the console layout may vary depending on the AWS release).

  2. Click Clusters > Create Cluster.

    Ecs_Create_Cluster.jpg

     

  3. Select the Networking only cluster template. Click Next step.

    create_cluster.jpg

     

  4. Enter a Cluster name, then select the Create VPC check box. Click Create.

    Cluster_name.jpg

    For more information on AWS VPC, see the Amazon Virtual Private Cloud page.

     

  5. Ensure that the Fargate cluster is created; this may take a few minutes.
  6. After the cluster is created, navigate to the VPC dashboard in the AWS console.
  7. Observe that a new VPC is created, enter a name for the VPC, for example, Airflow_Fargate_VPC.

    Fargate_vpc.jpg

     

  8. Click the Route Table hyperlink.
  9. Notice that the Route Table is associated with multiple subnets. Ensure that an Internet gateway is attached to the Route Table so the client application, Airflow, can access the Fargate cluster and launch the containers in these subnets.

    Route_table_settings.jpg

     

  10. Select the Subnet Associations tab, then copy the Subnet IDs.

    Subnet_ids.jpg

    Note: The Subnet IDs must be provided in the Airflow DAG AWS VPC network configuration.

     

Creating a task in Fargate with a containerized Job

  1. Click Task Definitions > Create new Task Definition, then select Fargate as the launch type. Click Next step.

  2. Configure the task details, as shown below:

    Task_Def_1.jpg

     

  3. Select Add container to add one or more containers to your task.

    Button_Add_container.jpg

     

  4. Fill in the Container name, the Image (or repository URI), and the Memory Limits (MiB) fields. Click Add.

    Note: Provide the ECR repository details of your Talend Job.

    container_def.jpg

     

  5. Click Create to complete the task creation.

 

Fargate Authorization

For provisioning and running the container tasks on Fargate, client applications should assume the ECS role.

Note: In the previous article, the Airflow EC2 instance is configured to assume the IAM k8s_role and has Full access to ECS resources.

k8s_ecs_role.jpg

 

Creating a DAG using ECS Operator

  1. Open the DAG, DAG_TMap_1_ECS_FG.py, located in the DAG_ECSOperator_Fargate.zip file.

  2. Review the Fargate task and network configuration.

    DAG_fargate.jpg

     

  3. Copy the DAG to the Airflow dags folder.

  4. Edit the Dockerfile, located in the Airflow folder, and change the Airflow version to 1.10.3.

    cd ~/airflow
    vi Dockerfile
    #Update AIRFLOW_VERSION to 1.10.3
    ARG AIRFLOW_VERSION=1.10.3
    # Save the file
  5. Build an Airflow Docker image with the new version.

    docker build -t xxx/docker-airflow-aws-ecs:1.10.3 .
  6. Edit the file docker-compose-CeleryExecutor.yml file and update the image with the new version.

    cd ~/airflow
    vi docker-compose-CeleryExecutor.yml
    # update image name to reflect the build in step # 5
    image: xxx/docker-airflow-aws-ecs:1.10.3
    # save the file
  7. Launch the Airflow services.

    docker-compose -f docker-compose-CeleryExecutor.yml up -d

 

Executing the Fargate task

  1. Log in to the Airflow Web UI.

    To access the Airflow Web UI, type the following URL into your browser.

    http://AIRFLOW_EC2_INSTANCE_PUBLIC_IP:8080/admin
  2. Verify that the DAG_TMap_1_ECS_FG is in the Airflow Web UI.

    DAG_Runnung.jpg

     

  3. In the AWS Console, search for ECS and open the ECS Service.

  4. Click Clusters and review the Pending tasks.

  5. In Airflow, run the DAG. A task is launched on ECS leveraging the ECS Operator.

  6. Notice that the Pending tasks in the cluster.

  7. Click the cluster hyperlink and open the Tasks view.

    Fargate_Task_pending.jpg

     

  8. Select the Tasks tab. Notice that the task is provisioned in the Task column, and is in the Running or Stopped status, and in the Started By column, the user is Airflow.

    Task_ex_status_fg_airflow_user.jpg

     

  9. Click the Task ID, and review the log output from your Talend Job.

    Execution_logs.jpg

     

  10. In the Airflow Web UI, select DAG_TMap_1_ECS_FG, then click Graph View.

  11. Review all of the tasks and operators in the DAG.

    Task_Executed.jpg

     

  12. Click task_ecs_fargate_tmap_1_1, then click View Logs.

  13. Review the Airflow task execution logs. In this case, the ECS Task started and stopped.

    Fargate_Tsk_execution_logs.jpg

     

Conclusion

In the previous article, you learned how to containerize and provision a Talend DI Job on an EKS cluster using Airflow. In this article, you moved one step ahead and executed the same Job on a complete serverless platform. You learned that you do not need to manage servers, clusters, or scaling of clusters as it is handled by using Amazon Fargate.

Version history
Revision #:
18 of 18
Last update:
‎07-31-2019 07:35 AM
Updated by:
 
Contributors
Tags (1)