This article shows you how to execute containerized Talend ETL Jobs on serverless platforms like Amazon Fargate leveraging Apache Airflow.
This is a continuation of the Talend Community Knowledge Base article, Provisioning and executing Talend ETL Jobs on Amazon EKS using Airflow.
Follow the Prerequisites described in the previous article of this series.
Perform the steps in the Installing Airflow with Docker section of the previous article.
Note: For Amazon Fargate, Airflow version 1.10.3 is required.
Basic understanding of serverless concepts and platforms, specifically, Amazon Fargate.
In this article, you will execute the following steps to deploy a Talend Data Integration (DI) Job to Amazon Fargate.
Following the instructions in the Provisioning and executing Talend ETL Jobs on Amazon EKS using Airflow article, complete the steps in the following sections:
Sources for the project are available in the preparation_files.zip file (attached in this article).
Open the AWS console and search for ECS (the console layout may vary depending on the AWS release).
Click Clusters > Create Cluster.
Select the Networking only cluster template. Click Next step.
Enter a Cluster name, then select the Create VPC check box. Click Create.
For more information on AWS VPC, see the Amazon Virtual Private Cloud page.
Note: The Subnet IDs must be provided in the Airflow DAG AWS VPC network configuration.
Click Task Definitions > Create new Task Definition, then select Fargate as the launch type. Click Next step.
Configure the task details, as shown below:
Select Add container to add one or more containers to your task.
Fill in the Container name, the Image (or repository URI), and the Memory Limits (MiB) fields. Click Add.
Note: Provide the ECR repository details of your Talend Job.
Click Create to complete the task creation.
For provisioning and running the container tasks on Fargate, client applications should assume the ECS role.
Note: In the previous article, the Airflow EC2 instance is configured to assume the IAM k8s_role and has Full access to ECS resources.
Open the DAG, DAG_TMap_1_ECS_FG.py, located in the DAG_ECSOperator_Fargate.zip file.
Review the Fargate task and network configuration.
Copy the DAG to the Airflow dags folder.
Edit the Dockerfile, located in the Airflow folder, and change the Airflow version to 1.10.3.
cd ~/airflow vi Dockerfile #Update AIRFLOW_VERSION to 1.10.3 ARG AIRFLOW_VERSION=1.10.3 # Save the file
Build an Airflow Docker image with the new version.
docker build -t xxx/docker-airflow-aws-ecs:1.10.3 .
Edit the file docker-compose-CeleryExecutor.yml file and update the image with the new version.
cd ~/airflow vi docker-compose-CeleryExecutor.yml # update image name to reflect the build in step # 5 image: xxx/docker-airflow-aws-ecs:1.10.3 # save the file
Launch the Airflow services.
docker-compose -f docker-compose-CeleryExecutor.yml up -d
Log in to the Airflow Web UI.
To access the Airflow Web UI, type the following URL into your browser.
Verify that the DAG_TMap_1_ECS_FG is in the Airflow Web UI.
In the AWS Console, search for ECS and open the ECS Service.
Click Clusters and review the Pending tasks.
In Airflow, run the DAG. A task is launched on ECS leveraging the ECS Operator.
Notice that the Pending tasks in the cluster.
Click the cluster hyperlink and open the Tasks view.
Select the Tasks tab. Notice that the task is provisioned in the Task column, and is in the Running or Stopped status, and in the Started By column, the user is Airflow.
Click the Task ID, and review the log output from your Talend Job.
In the Airflow Web UI, select DAG_TMap_1_ECS_FG, then click Graph View.
Review all of the tasks and operators in the DAG.
Click task_ecs_fargate_tmap_1_1, then click View Logs.
Review the Airflow task execution logs. In this case, the ECS Task started and stopped.
In the previous article, you learned how to containerize and provision a Talend DI Job on an EKS cluster using Airflow. In this article, you moved one step ahead and executed the same Job on a complete serverless platform. You learned that you do not need to manage servers, clusters, or scaling of clusters as it is handled by using Amazon Fargate.