Automate file processing from Google Storage with Talend Cloud

Overview

Automate file processing from cloud storage is one of the most common use cases. This article explains how to integrate Google Storage with Talend Cloud for this case.

 

Prerequisites

Scenario

  1. A file is pushed to a Google Storage bucket.
  2. A trigger calls a Cloud function.
  3. The Cloud function calls a Talend Flow.
  4. The Talend Flow retrieves the file to process it based on the parameters sent by the Cloud function.

Google Cloud Storage

Google Cloud Storage is an object storage system on the Google Cloud Platform (GCP), if you are familiar with Amazon, the equivalent is S3.

 

Creating a Bucket

  1. Log in to your Google Cloud account, then go to the Storage page.

    Capture d’écran 2018-06-15 à 11.07.59.png

     

  2. Click Create Bucket.

    Capture d’écran 2018-06-15 à 11.08.09.png

     

  3. Configure your bucket.

    • Name: Name your bucket.
    • Default storage class: Define the type of storage:
      • Multi-Regional: This option replicates your bucket across multiple regions in an area (for example, Europe)
      • Regional: This option replicates your bucket across zones in a specific region
      • Nearline and Coldline: These two options are more of an archive option
    • Location: Specify the location of your bucket.
    • Labels: Set the metadata for your bucket.
    • Encryption: Select the type of encryption you want to use.

    Capture_d’écran_2018-06-15_à_11_09_29.png

     

  4. Once your bucket is created, add a folder to classify your incoming files.

    Capture_d’écran_2018-06-15_à_11_10_17.png

    Capture d’écran 2018-06-15 à 11.10.39.png

     

  5. Create an access key for your application. In Google Cloud Storage, navigate to the Settings section, and click Create a new key.

  6. Retrieve and secure the keys in a safe place.

    Capture_d’écran_2018-06-15_à_11_11_05.png

     

Creating a Talend Job

 

Create a Talend Job to retrieve a file from Google Cloud Storage and display data in the console. Of course in a real life example, a Job is more complex. The demo Job is available in the GSRead.zip file attached to this article.

 

The Job is composed of the following steps:

  1. Create a connection to Google Cloud Storage.
  2. Retrieve the file from the Bucket.
  3. Read the file and display to the console.
  4. Close the connection to Google Cloud Storage.

    Capture d’écran 2018-06-15 à 11.02.09.png

     

  5. Configure the context variables:

    1. tempFolder: the folder used to store temporary data, on Talend Cloud Engine, it is /tmp.

    2. fileKey: the file name from the GCP Bucket. It is composed of the folder and file name.

    3. fileBucket: the Google bucket you created.

    4. gcpAccessKey: the Google Storage access key you created.

    5. gcpSecretKey: the Google Storage secret key you created.

    Capture d’écran 2018-06-15 à 11.01.50.png

     

  6. Configure the tGSConnection component:

    Capture d’écran 2018-06-15 à 11.02.22.png

     

  7. Configure the tGSGet component:

    Capture d’écran 2018-06-15 à 11.02.37.png

     

    This component uses a split because the fileKey looks like cloud-function/sample.csv and you only need the sample.csv.

     

  8. Configure the tFileInputRaw component:

    Capture d’écran 2018-06-15 à 11.02.50.png

  9. Now your Job is configured. You can test it by adding valid parameters to the context.

Publishing a Job to Talend Cloud

For more information on publishing a Job to Talend Cloud, see Connecting Talend Studio to Talend Integration Cloud

  1. Right-click your Job, and select Publish to Cloud.

    Capture d’écran 2018-06-15 à 11.03.15.png

     

  2. Configure your export as needed, then click Finish.

    Capture d’écran 2018-06-15 à 11.03.37.png

     

  3. Once the Job is published, click Open Job Flow.

    Capture d’écran 2018-06-15 à 11.05.13.png

     

    If you are logged in Talend Cloud, you should see your configuration.

  4. Configure the Job as appropriate.

    Capture d’écran 2018-06-15 à 11.05.36.png

    Capture d’écran 2018-06-15 à 11.16.32.png

     

Google Cloud Function

A Google Cloud Function allows you to implement the following logic:

  • Each time you create a file in your bucket, the function will call your Talend Job.

Google Cloud functions are written in JavaScript. The function and package code are available in the gcp-cloud-function.zip file attached to this article.

 

  1. Find your Flow Id. In Talend Cloud, select your flow and check the Flow Id:

    Capture_d’écran_2018-06-15_à_11_18_27.png

     

  2. In your Google Cloud Platform console, go to Cloud Functions.

    Capture d’écran 2018-06-15 à 14.08.15.png

  3. Create a function.

    1. Name: Give your function a name.

    2. Memory allocated: Select the memory needed for your function.

    3. Trigger: In your case, use the Cloud Storage bucket.

    4. Event Type: For each file Finalized/Created.

    5. Bucket: The bucket you created.

    Capture d’écran 2018-06-15 à 14.34.44.png

     

    You can use the inline editor to create the function.

    Capture d’écran 2018-06-15 à 14.38.25.png

    /**
     * Triggered from a message on a Cloud Storage bucket.
     *
     * @param {!Object} event The Cloud Functions event.
     * @param {!Function} The callback function.
     */
    
    function responseCall(error, response, body) {
        console.log(JSON.stringify(body));
        console.log(response);   
    }
    
    
    exports.processFile = (event, callback) => {
    console.log('Processing file: ' + event.data.name);
    
    // Body 
    var parameters = new Object();
    parameters.fileBucket = event.data.bucket;
    parameters.fileKey  = event.data.name;
    var body = new Object();
    body.executable = "<Talend Flow ID>";
    body.parameters = parameters;
    
    var jsonString= JSON.stringify(body);
    console.log(jsonString);
    
    // Call Executions
    // Include the request library for Node.js   
    var request = require('request');
    //  Basic Authentication credentials   
    var username = "<Talend Cloud user>"; 
    var password = "<Talend Cloud Password>";
    var authenticationHeader = "Basic " + new Buffer(username + ":" + password).toString("base64");
    
    //Request
    var options ={
         method: 'POST',
         url : "https://ipaas.us.cloud.talend.com/api/v1.1/executions",
         body: jsonString,
         headers : { 
             "Content-Type": "application/json",
             "Accept": "application/json",
             "Authorization" : authenticationHeader }  
      };  
    
    request(options, responseCall);
    console.log("Done!")
    
    callback();
    };
    

    This function is rather simple:

    • Create a JSON body to send to Talend Cloud, as shown below:
    {
      "executable": "57f64991e4b0b689a64feed0",
      "parameters": {
        "fileKey": "cloud-function/sample.csv",
        "fileBucket": "mgainhao-demo"
      }
    }
  4. You can test your API on the API documentation page, Talend Cloud API-Executions, of Talend Cloud.

    1. Add basic authentication for connection to Talend Cloud.

    2. Create a request.

    Because you are using the module request, you need to update the package.json to add dependencies.

    Capture d’écran 2018-06-15 à 14.35.49.png

     

  5. Click Create. The function is created.

    Capture d’écran 2018-06-15 à 14.40.16.png

    Capture d’écran 2018-06-15 à 14.41.12.png

     

  6. You can access the dashboard by clicking the name of the function.

    Capture d’écran 2018-06-15 à 14.41.19.png

     

  7. Test your function. Create a new file in the Google Cloud bucket.

    Capture d’écran 2018-06-15 à 14.42.14.png

     

  8. Your function is called.

    Capture d’écran 2018-06-15 à 14.42.35.png

     

  9. In Talend Cloud, verify that there are new executions of your Job.

    Capture d’écran 2018-06-15 à 14.42.48.png

     

  10. In the logs, you should be able to see the content of the file in the tLogRow_1 section.

    Capture d’écran 2018-06-15 à 14.43.34.png

     

Conclusion

Using Talend Cloud to process a file from cloud storage is easier with the ability to call Talend API within a Cloud Function.

Version history
Revision #:
22 of 22
Last update:
‎04-14-2019 02:17 PM
Updated by: