Introduction to Talend and Amazon Real-Time Machine Learning



This article explains how to use Talend to harness the capabilities of Amazon Web Services (AWS) Machine Learning (ML) services in real-time mode. The goal is to help Talend enthusiasts integrate AWS ML and Talend without worrying about underlying complexities.


It covers:

  • Data Preparation for an AWS Machine Learning

  • Creation of an AWS Machine Learning model

  • Creation of a real-time prediction endpoint for the AWS Machine Learning model

  • Configuration of a Talend routine to call an AWS Machine Learning real-time endpoint

  • Execution of a sample Job with an AWS Machine Learning real-time prediction



Talend 7.0.1


Data Preparation for an AWS Machine Learning

The first step in an AWS ML service is to identify source data that can be used to train the ML model. AWS S3 and AWS Redshift are the two services that can be used as the sources to train an AWS ML model.


In this article, you are going to create an ML model based on the Iris Data Set provided by the University of California, Irvine. The database classifies Iris plants in to three groups, Iris Setosa, Iris Versicolour, and Iris Virginica, based on the sepal length, sepal width, petal length, and petal width.




Talend can be seamlessly used to load data to AWS S3 and AWS Redshift.

Talend file load.jpg


For more information on loading data using Talend with S3 and Redshift, see the following resources in Talend Help Center:


In this example, S3 is identified as the source to train the AWS ML model, and the data was loaded to the bucket in Amazon S3 using Talend.



Creation of an AWS Machine Learning model

Creation of an ML model involves two sub-tasks, but these steps are integrated together in the AWS ML process.

  • Creation of a training data source in an AWS ML service

  • Creation of an AWS ML model

  1. Once the data source is ready in Amazon S3, select the AWS region of your choice for Machine Learning activities. After selecting the region, go to the AWS ML service page, and select Standard setup, then Launch.



  2. From the Input Data page, select the Amazon S3 button, and complete the bucket and file name details (where the Iris dataset is located). For Datasource name, type Iris_Dataset.



  3. If the AWS ML service is using the bucket for the first time, you are prompted to provide read permission to the bucket. Select Yes for the query.



  4. Select Continue.



  5. The Schema page displays details of the dataset. Answer Yes to the question, Does the first line in your CSV contain the column names? Click Continue.



  6. Select the entry in the Target column whose value has to be predicted. For this example, select the class row, then click Continue.



  7. From the Row ID page, answer No to the question, Does your data contain an identifier? Click Review.



  8. Review the data source information generated by AWS ML, review your settings, then click Continue.



  9. The next page provides training and evaluation settings of the ML model. Select the Default (Recommended) setting, to set aside 30% of the data for the evaluation process.



  10. From the final Review page, select Create ML model to create the Machine Learning Model in AWS.



  11. The model creation process can run from several minutes to several hours, based on the input data set size and the number of columns. The status is Pending until the model processing is complete, then status changes to Completed.



You have to repeat these steps in AWS whenever AWS ML model changes are required, due to massive modification in source data pattern.


Creation of a real-time prediction endpoint for the AWS ML model

  1. Once the model is generated and is in Completed status, go to the Prediction section of the AWS ML model and select Create endpoint. The new endpoint is used to send requests and receive responses in real-time between Talend and AWS ML.



  2. Once the endpoint is ready, the status changes to Ready, and the endpoint URL is displayed.




Configure a Talend routine to call an AWS ML real-time endpoint

  1. Connect to Talend Studio and create a new routine called AWS_ML_RT_Predict that connects to the AWS ML endpoint to transmit the incoming JSON record and process the data. The routine also collects the predict response back from the AWS ML Predict function.




  2. Insert the following code into the Talend routine:

    package routines;
    import java.util.HashMap;
    import java.util.Map;
    // Amazon SDK 1.11.380
    import com.amazonaws.auth.BasicAWSCredentials;
    import com.amazonaws.auth.AWSStaticCredentialsProvider;
    // Jackson Jar 2.6.6
    import com.fasterxml.jackson.core.JsonParseException;
    import com.fasterxml.jackson.core.type.TypeReference;
    import com.fasterxml.jackson.databind.JsonMappingException;
    import com.fasterxml.jackson.databind.ObjectMapper;
    public class AWS_ML_RT_Predict {
    public static String RT_Predict(String AWS_Access_Key,String AWS_Secret_Key, String AWS_regionName, String AWS_ML_model_id, String AWS_endpoint,String ML_request) throws JsonParseException, JsonMappingException, IOException 
    // AWS Connection
         BasicAWSCredentials basic = new BasicAWSCredentials(AWS_Access_Key,AWS_Secret_Key);
         AmazonMachineLearning awsClient = AmazonMachineLearningClientBuilder.standard().withCredentials(new AWSStaticCredentialsProvider(basic)).withRegion(AWS_regionName).build();
    // Create AWS Predict Request Object 
         PredictRequest predReq = new PredictRequest();
         ObjectMapper mapperObj = new ObjectMapper();
         Map<String,String> recMap = new HashMap<String, String>(); 
    // Convert Input JSON string to Map values
         recMap = mapperObj.readValue(ML_request,new TypeReference<HashMap<String,String>>(){});
    // Send map values to AWS ML Predict     
    // Receive the predict result     
         PredictResult predResult = awsClient.predict(predReq);
         String ML_response_JSON=predResult.toString();
         return ML_response_JSON;


  3. The Talend routine needs additional JAR files. Install the following JAR files in the routine:

    1. AWS SDK 1.11.380

    2. Jackson core 2.6.6

    3. Jackson Annotations 2.6.0

    4. Jackson Databind 2.6.6

    5. org.apache.commons.logging

    6. httpcore

    7. httpclient

    8. joda-time


  4. Add additional Java libraries to the routine by selecting Edit Routine Libraries.



  5. Select New in the pop-up window to add libraries to the routine.



  6. Select Artifact repository(local m2/nexus) and go to the Install a new module window.



  7. Select the JAR file from the local drive.



  8. Select Detect the module install status to verify whether the module is already installed.



  9. If the JAR file is not installed, the status changes from the error flag to Install a module followed by JAR file name. Click OK to load the JAR file to the routine. Once all the JAR files are installed, click Finish.




Talend sample Job with an AWS Machine Learning real-time prediction

The setup activities are complete and the routine can be used in any Talend Job as a user defined function. The Talend routine helps to generate real-time predictions based on the AWS ML model. In this example, nine sample JSON records, from the Iris dataset, are processed through the input_data.txt attached to this article.

{"sepal_length" : "4.8","sepal_width" : "3","petal_length" : "1.4","petal_width" : "0.1"}
{"sepal_length" : "4.3","sepal_width" : "3","petal_length" : "1.1","petal_width" : "0.1"}
{"sepal_length" : "4.4","sepal_width" : "2.9","petal_length" : "1.4","petal_width" : "0.2"}
{"sepal_length" : "5.8","sepal_width" : "2.7","petal_length" : "4.1","petal_width" : "1"}
{"sepal_length" : "5.6","sepal_width" : "2.5","petal_length" : "3.9","petal_width" : "1.1"}
{"sepal_length" : "5.9","sepal_width" : "3.2","petal_length" : "4.8","petal_width" : "1.8"}
{"sepal_length" : "7.9","sepal_width" : "3.8","petal_length" : "6.4","petal_width" : "2"}
{"sepal_length" : "6.8","sepal_width" : "3.2","petal_length" : "5.9","petal_width" : "2.3"}
{"sepal_length" : "6.7","sepal_width" : "3.3","petal_length" : "5.7","petal_width" : "2.5"}


The following diagram shows the overall Job flow for the AWS ML real-time prediction:



The configuration details for each Talend component are as follows:

  1. Use a tFileInputFullRow component to read the file and to process each row.



  2. Use a tJavaRow to call the Talend routine AWS_ML_RT_Predict to generate the prediction value based on the configuration details. The RT_Predict method of the Talend routine will process the incoming data and provide the Prediction value as output in String format. The parameters required for the method are:

    • AWS Access Key

    • AWS Secret Key

    • AWS region id, for example, us-east-1, eu-west-1

    • AWS Machine Learning Model Id (id generated by AWS once Model is successfully created)

    • AWS Machine Learning endpoint, for example,

    • Input JSON string

      Since AWS Access Key and Secret Key are passwords, Talend recommends handling them through context variables rather than hard coding the Job. The context variables can be passed to the Job at runtime.



  3. Use a tMap component to replace the "="character present in output String with ":" . This step makes data parsing easier in the downstream components.



  4. Use a tExtractJSONFields component to extract the PredictedLabel value from the output JSON.




  5. Use a tLogRow to capture the output and print the data to the console.



Once the Job is executed, the data is processed, and AWS ML predicts the target group in real-time.




This scenario is a use case of integrating Talend with AWS ML. Instead of using a file, you can transmit input data to a Talend routine from a queue or a web service.



Version history
Revision #:
42 of 42
Last update:
‎02-25-2019 01:11 AM
Updated by: