Talend DI Job execution in AWS Lambda using Apache Airflow

Overview

This article shows you how to leverage Apache Airflow to orchestrate, schedule, and execute Talend Data Integration (DI) Jobs in an AWS Lambda environment.

 

Environment

  • Apache Airflow 1.10.2
  • AWS Lambda
  • Amazon API Gateway
  • Amazon CloudWatch Logs
  • Nexus 3.9
  • WinSCP 5.15
  • PuTTY

 

Prerequisites

  1. Apache Airflow installed on a server (follow the Installing Apache Airflow on Ubuntu/AWS installation instructions).
  2. Python 2.7 installed on the Airflow server.
  3. A licensed Amazon Web Services (AWS) account.
  4. Access to Nexus server from AWS Lambda.
  5. Talend 7.x Jobs published to the Nexus repository. (For more information on how to set up a CI/CD pipeline to publish Talend Jobs to Nexus, see Configuring Jenkins to build and deploy project items in the Talend Help Center.)
  6. Access to the Setup_files.zip file (attached to this article).

 

Process flow

  1. Develop Talend DI Jobs using Talend Studio.
  2. Publish the DI Jobs to the Nexus Repository using Talend CI/CD module.
  3. Execute the Directed Acyclic Graph (DAG) in Apache Airflow:
    • The DAG calls Amazon API Gateway triggering the AWS Lambda function.
    • The Lambda function downloads the Talend DI Job executable binaries from the Nexus repository.
    • The Lambda function executes the downloaded Job binaries.
    • The Lambda function returns a response to the API call and marks the task as completed in Apache Airflow.

    Lambda.jpg

     

Configuration and execution

 

Configuring AWS Lambda

  1. Login to your AWS account and open the AWS Lambda service.
  2. Click the Create function button.

    Lambda-1st page.png

     

  3. Select Author from scratch, then enter a Function name. Select Python 3.6 from the Runtime pull-down menu.

    Lambda-2nd page_1.png

     

  4. Under Permissions, select Create a new role with basic Lambda permissions from the Execution role pull-down menu. Click Create function.

    Lambda-2nd page_2.png

     

  5. After the function is created, select API Gateway, from the Add triggers dialog box on the left, to add a trigger to the function.

    Lambda-Trigger.png

    For more information, see the Amazon API Gateway page.

  6. When the API Gateway is added, a Configuration required warning message appears. Click the API Gateway tile to configure the trigger details. In the Configure triggers section, select Create a new API from the API pull-down menu and Open from the Security pull-down menu. Click Add.

    Lambda-Trigger_Config.png

     

  7. Click the Save button in the upper right corner.

    Image 1.png

     

  8. Select the API Gateway tile and review the Details section of the Open API.

    Image 2.png

     

  9. Copy the code from the lambda_function_code.py file (located in the Setup_files.zip) into the lambda_function.py in the Function Code window.

    Image 3.png

     

  10. Create a new file named download_job.sh and save it under the lambda_function folder. Copy the code from the download_job_code.sh (located in the Setup_files.zip) file, into the new file you created.

    Image 4.png

     

  11. In the same window, scroll down to Basic settings and increase the Memory (MB) to a reasonable amount, in this case, 2048 MB. Click Save.

    Image 5.png

     

Configuring Apache Airflow

  1. Log in to the Airflow Web UI.
  2. Navigate to Admin > Connection and create a new connection. Enter aws_api in the Conn Id field and leave the Conn Type field empty. Add the host address of the API endpoint you created in the Configuring AWS Lambda section, in the Host field. Click Save.

    Image 6.png

     

  3. Edit the lambda_DAG_call_template.py file and assign values to the variables, as shown below:

    Image 7.png

    Make sure to provide the http_conn_id and endpoint values under SimpleHttpOperator calls. In this case, the http_conn_id is aws_api.

  4. The DAG template provided is programmed to trigger the task externally. If you plan to schedule the task, update the schedule_interval parameter with values based on your scheduling requirements. For more information on values, see the Apache Airflow documentation: DAG Runs.

  5. Rename the updated file and place it in the dags folder under the AIRFLOW_HOME folder.

    airflow_file.png

     

  6. The Airflow webserver picks up the file and creates a DAG task in the Airflow Console, under the DAGs tab.

    Note: If the DAG is not visible on the User Interface under the DAGs tab, restart the Airflow webserver and Airflow scheduler.

    airflow task.png

     

Executing the Job and reviewing the logs

  1. To execute the Talend Job, toggle the button to On and run the Airflow task you created to trigger the AWS Lambda function.
  2. Monitor the task execution on the Airflow Web UI.
  3. Review the Job logs in the Amazon CloudWatch Logs service. Open the CloudWatch service and select Logs from the menu on the left. Select your Lambda function from the Log groups and open the log.

    Cloudwatch.png

     

Conclusion

In this article, you learned how to execute Talend DI Jobs in AWS Lambda, and how to use Apache Airflow to schedule the Jobs, which can be extended for further complex orchestration and scheduling plan.

Version history
Revision #:
11 of 11
Last update:
‎07-02-2019 09:02 AM
Updated by:
 
Tags (1)
Comments
Five Stars

Hi,

 

Please let us know how can we connect to On-Premise from Cloud.

Let me explain you the use-case.

we have Talend DI jobs. destination db is in cloud and the source db is in on-premise when we are running the jobs in ubutu .sh file it should connect  to the source system and  extract the data and load into destination db(cloud ).

how can we connect to the on-primise systems(local network/private network) . Please suggest the approach in details as am new to AWS.

no idea how to make a gateway between on-premise and cloud.

 

Thanks in advance.

 

Employee

@kolen 

 

Hi,

 

Talend provides you a lot of on-prem components as well as AWS components. You can design jobs based on your use case with the source as on-prem and target as AWS component. I have attached a screenshot of the category of AWS components we have below.

 

AWS.png

 

The jobs designed, when you deploy it to a job-server (ubuntu in this case), a .sh file will be created and this gets executed when you run the job.

 

Thanks,

Kapil