Using Data Preparation as part of a Data Ingestion Pipeline

Introduction

This article describes how to use Talend Data Preparation (Data Prep) as part of a data ingestion pipeline. This is a common business use case in which data from various disparate systems is required to be translated to a common format, standardized, reformatted, validated, and then passed in that standard format to other systems.

 

This example shows how you can build a Talend Job to ingest data from a database, convert it to a common format, run a Talend Data preparation against that data, and then pass it on to other systems, in this case exporting as both a Comma Separated Value (CSV) and an MS Excel file. Any of these systems can, of course, be replaced by any number of things, but what will be common will be translating the data to a common format and running a data preparation.

 

Building the Talend Job

The Job you will build is shown below:

job.png

 

Each component will be dealt with separately.

 

Input the Data

In this example, you will input data from a MySQL database. You set up a database table in MySQL using some example data; a section of the data is shown below. This shows the first part of your use case, getting data from a system in a specific format. You could use any number of different systems, but this example uses a database.

 

database.png

 

To represent that data in Talend you need to define a schema, as shown below:

schema.png

 

You will then use the tMysqlInput component to get the data from this table. The component configuration is shown below:

component_config.png

 

Transform the Data

You now need to transform the data into the common format you are using in your use case. This common format is shown below, and contains the following columns:

columns.png

 

This is your common format. In this use case, you want to translate ALL incoming data into this format. To do so, use a tMap component and map the fields as shown below:

tMap.png

 

Run a Data Preparation

The next step is to use a Data Preparation to run against the data. To do so, use the tDataprepRun component. On the component configuration tab, you can either use an existing preparation, or create a new one. In this example, you will create a new one.

  1. On the Component tab, click Create a new one, then select the Edit preparation in your browser icon.

    tDataprepRun.png

     

  2. The Data Preparation home page window opens in your web browser and the Data Preparation you just created is displayed:

     

    dataprep.png

     

  3. Select the Data Preparation to edit it. You can choose any number of things to do to the data; this example's use case is shown below:

    dataprepUpdate.png

     

    This example shows selecting a few simple functions for a small number of columns in the data. You may choose to follow this example or choose your own.

 

Manage the Output

In this example, you will send the data to two different places and formats. You will use the tReplicate component to send multiple copies of the output to one comma separated value file and one MS Excel spreadsheet. This is shown below:

tReplicate.png

 

Configure the components like this:

  1. Replicate the input/output schema in the tReplicate Component:

    tReplicate2.png

     

  2. Configure the final output components like this:

    • For the tFileOutputExcel component:

      tFileOutputExcel.png

       

    • For the tFileOutputDelimited component:

      tFileOutputDelimited.png

       

 

Run the Job

The Job is now built and configured. The next step is to run it, as shown below:

runJob.png

 

Once run, you should get output similar to this:

output.png

 

Examine the Output

In this example, you have a dataset containing 6040 records. You can now examine the output files, either by opening them or using the Talend Data Viewer:

dataviewer.png

 

Comparing this to the input data, you can see that the data has indeed been transformed by running through a Data Preparation.

output2.png

 

Version history
Revision #:
6 of 6
Last update:
‎09-27-2017 01:01 PM
Updated by:
 
Labels (1)
Tags (1)