Talend and Amazon Personalize integration

Overview

 

This article shows you how to integrate Talend with Amazon Personalize to create individualized recommendations for customers. Amazon Personalize helps Talend developers, without prior machine learning experience, to build a product recommendation engine.

 

This article is a continuation of the Talend Amazon Web Service Machine Learning integration series. You can read the previous articles, Introduction to Talend and Amazon Real-Time Machine Learning, Talend and Amazon Comprehend Integration, Talend and Amazon Translate Integration, Talend and Amazon Rekognition Integration, Talend and Amazon Polly Integration, and Talend and Amazon Transcribe Integration in the Talend Community Knowledge Base.

 

Environment for Talend and AWS

This article was written using Talend 7.1.1. However, you can configure earlier or later versions of Talend with the logic provided to integrate Amazon Personalize.

 

Currently, Amazon Personalize is only available in selected AWS regions. Talend recommends verifying the availability of the service from the AWS Global Infrastructure, Region Table, before creating the overall application architecture.

 

Introduction to Amazon Personalize

A quick example of Amazon Personalize is the creation of a recommendation engine for a movie database. As shown in the diagram below, each user accessing a movie database will have their own personalized list of favorite movies. The goal of a good movie recommendation engine is to build a personalized favorite movie list for each user, that matches their dream list as closely as possible, based on the underlying data model and recommendation strategy.

Movie Selection.jpgMovie favorite list

 

Amazon Personalize can be broadly classified into the following modules:

  • Dataset groups and datasets
  • Events
  • Solutions and Campaigns

Amazon Personalize Flow Diagram.jpgHigh-level block diagram of Amazon Personalize

 

Dataset groups and datasets

Three categories of historical datasets are used to build the recommendation engine in Amazon Personalize:

  • user (metadata information about users like age and gender)

  • item (metadata information about items like movies and retail products)
  • user-item (metadata information showing the relationship between user and item)

These three datasets are grouped into a dataset group, and each dataset group contains only one kind of each dataset.

 

Datasets and dataset groups are used as base information to build the recommendation engine.

 

Events

Events are used to capture any new information coming into the recommendation engine. Capturing new events into the recommendation engine is not in the scope of this article.

 

Solutions and Campaigns

Solutions are the trained machine learning models created based on the datasets in a dataset group. Campaigns are created by deploying a solution version of an Amazon Personalize Recommendation model.

 

Practical use case

This use case examines the high-level steps included in the Job flow, where Talend integrates with Amazon Personalize to create a Customer Product Recommendation System.

 

Customer Product Recommendation System

In the current digital age, Product Recommendation Systems have become very popular, and every company is trying to tweak its product catalog in such a way that the recommendations match customer's ideal wish list.

AWS Personlize Use case.jpgCustomer Product Recommendation System

 

The diagram above describes the various stages present in the overall flow, and Talend helps to simplify the complex scenarios required for the use case with its signature graphical application design interface and data orchestration capabilities. The various stages involved in the flow are:

  1. Data, extracted from source systems, is used as base information to create the recommendation model.

  2. Talend loads the data from source systems to Amazon S3 under the user, item, and user-item dataset categories.

  3. IT performs a one-time activity to create Amazon Personalize solutions and campaigns using datasets.

  4. The customer logs in to the front-end web application, triggering a personalized recommendation request.

  5. The data flows through the Kafka queue, and Talend parses the input message from the request.

  6. Talend calls the Amazon Personalize service to get the list of recommended item IDs.

  7. Talend sends the response back to Kafka, which in turn shares it with the front-end web server.

  8. The customer views the personalized recommendation on the web page.

 

Create an Amazon Personalize campaign

Creation of an Amazon Personalize campaign can be done either the Amazon Web Services console or through API calls. Talend Jobs can be created to verify the status of these asynchronous API calls and to create the campaigns in a programmatic fashion. The scope of this article is limited to the generation of customized recommendation data because the Amazon Web Services console is used to create campaigns.

 

  1. In the AWS console, navigate to Amazon Personalize > Dataset groups > Create dataset group, then create a dataset group. Click Next.

    image.png

     

  2. Using the users.csv, movies.csv, and ratings.csv files (located in the sample_test_files.zip file attached to this article) create three datasets. Data import Jobs are used to load the different datasets (user, item, user-item) from Amazon S3 to Amazon Personalize service.

    image.png

     

  3. After loading the datasets, create a solution model based on the datasets. If you are not aware of the nuances of various algorithms, choose Automatic Recipe by selecting the Automatic (AtuoML) radio button. This selection authorizes Amazon Web services to identify the correct Machine Learning algorithm for the user. Click Next.

    image.png

     

  4. Click Create solution version, and Amazon Personalize generates the solution version based on the available metrics. The Solution process verifies the competency and success probability of each available algorithm added in Step 3. Then it picks the best possible recipe from the available list.

    image.png

     

  5. Create a campaign and set the throughput per second (TPS) to provision a specific solution, in this case, 10.

    image.png

     

  6. Click Create campaign. After the campaign generates successfully, the CampaignARN is used in the Talend Job to generate the recommendations API call.

    image.png

     

Configure a Talend routine for Amazon Personalize

Create a Talend user routine, by performing the following steps.

  1. Connect to Talend Studio, and create a new routine called AWS_Personalize that connects to the Amazon Personalize service to transmit the incoming input text and collect the response back from the Amazon Personalize service.

    image.png

     

  2. Insert the following code into the Talend routine:

    package routines;
    
    //Amazon SDK 1.11.613
    import com.amazonaws.auth.BasicAWSCredentials;
    import com.amazonaws.auth.AWSStaticCredentialsProvider;
    import com.amazonaws.services.personalizeruntime.AmazonPersonalizeRuntime;
    import com.amazonaws.services.personalizeruntime.AmazonPersonalizeRuntimeClientBuilder;
    import com.amazonaws.services.personalizeruntime.model.GetRecommendationsRequest;
    import com.amazonaws.services.personalizeruntime.model.GetRecommendationsResult;
    
    import org.apache.commons.logging.LogFactory;
    import com.fasterxml.jackson.databind.ObjectMapper;
    import com.fasterxml.jackson.databind.ObjectMapper;
    import com.fasterxml.jackson.annotation.JsonView;
    import org.apache.http.protocol.HttpRequestExecutor;
    import org.apache.http.client.HttpClient;
    import org.apache.http.conn.DnsResolver;
    import org.joda.time.format.DateTimeFormat;
    
    public class AWS_Personalize {
    
    	public static String GetRecommendations(String AWS_Access_Key,String AWS_Secret_Key, String AWS_regionName, String campaignArn, String itemId, Integer numResults, String userId) 
    	{
    
    	// AWS Connection
    		
    	BasicAWSCredentials awsCreds = new BasicAWSCredentials(AWS_Access_Key,AWS_Secret_Key);
    	AmazonPersonalizeRuntime PersonalizeRuntime = AmazonPersonalizeRuntimeClientBuilder.standard().withCredentials(new AWSStaticCredentialsProvider(awsCreds)).withRegion(AWS_regionName).build();
    	
    	//AWS_Personalize Get Recommendations
    
    	GetRecommendationsRequest request = new GetRecommendationsRequest()
    												.withCampaignArn(campaignArn)
    												.withItemId(itemId)
    												.withNumResults(numResults)
    												.withUserId(userId);
    												               
    	GetRecommendationsResult result  =   PersonalizeRuntime.getRecommendations(request);
    	String response_text =result.toString();
    	return response_text;
    		
    	}	
    }
    
  3. The Talend routine needs additional JAR files. Install the following JAR files in the routine:

    • aws-java-sdk-core 1.11.613
    • aws-java-sdk-personalizeruntime 1.11.613
    • apache.commons.logging 1.2.0
    • Jackson core 2.9.7
    • Jackson Annotations 2.9.0
    • Jackson Databind 2.9.7
    • httpcore 4.4.10
    • httpclient 4.5.6
    • joda-time 2.9.4
  4. Add additional Java libraries to the routine. For more information on how to add Java libraries, see the Talend and Amazon Comprehend Integration article in this series.

The setup activities are complete. The next section shows sample Jobs for the functionalities described in the practical use cases.

 

For ease of understanding, and to keep the focus on the integration between Talend and Amazon Personalize, the sample Job uses a CSV file for input and a tLogrow component for output.

 

Talend sample Job for Amazon Personalize

The sample_users.csv file, attached to this article, provides the data for the sample Job. The data from the input file is transmitted to the Amazon Personalize service, and the response is captured. The response from Amazon Personalize service (in JSON format) is parsed, and the recommended item IDs is matched with their corresponding movie names. The final output is printed to console using a tLogRow component.

 

The configuration details are as follows:

  1. Create a new Standard Job called AWS_Personalize_sample_job, or use the sample Job, AWS_Personalize_sample_job.zip, attached to this article.

  2. The first stage in associating the routine to a Talend Job is to add the routines to the newly created Job, by selecting Setup Routine Dependencies.

    image.png

     

  3. Add the AWS_Personalize routine to the User routines section of the pop-up screen, to link the newly created routine to the Talend Job.

    image.png

     

  4. Review the overall Job flow, shown in the following diagram.

    image.png

     

  5. Configure the context variables, as shown below:

    image.png

     

  6. The input file for the Job, sample_users.csv, attached to this article, contains the list of users for which recommendations will be fetched from Amazon Personalize service.

    image.png

  7. Configure the tFileInputDelimited component, as shown below:

    image.png

     

  8. Use the tMap component where the call to Amazon Personalize service is made through Talend routine. Pass the parameters mentioned in the code snippet in the same order as the function call in the tMap component.

    AWS_Personalize.GetRecommendations(context.AWS_Access_Key, context.AWS_Secret_Key, context.AWS_regionName, context.campaignArn, "10", context.numResults,user_input.user ).replaceAll(" ItemList ", "\"ItemList\"").replaceAll(" ItemId ", "\"ItemId\"") 
  9. Configure the tMap component layout, as shown below:

    image.png

     

  10. The output from the Amazon Personalize call is a string in JSON format. The output text is parsed to the variables. Leave the user column empty because you are going to map them directly from the input flow.

    image.png

     

  11. The movie ID from the recommended output is joined with movies.csv to get the movie names. Configure the tFileInputDelimited component, as shown below:

    image.png

     

  12. Using a tMap component, join the movie ID with the movies.csv lookup data.

    image.png

     

  13. Notice that the input data passes to the tLogrow component that translates the output data and displays in the console.

    image.png

    In practical scenarios, the output at this stage can be passed to downstream systems for further processing and storage.

     

Threshold limits for data processing

The latest information about various threshold limits of Amazon Personalize Service can be found on the AWS Documentation, Limits in Amazon Personalize page.

 

Conclusion

This article depicts the use case of integrating Talend with the Amazon Personalize service. In real-time scenarios, data can flow from multiple source systems, such as batch files, web services, queues, or APIs. Talend can integrate all these diverse source systems with the Amazon Personalize service in a straightforward way.

 

Citations

AWS Documentation:

Version history
Revision #:
16 of 16
Last update:
‎09-27-2019 02:54 PM
Updated by:
 
Contributors