How to do a full outer join in a Talend batch Job

Overview

The Talend tJoin and tMap ETL components offer in-memory join functionalities like Inner Join and Left Outer Join. However, there is no option for Full Outer Join in those components. This article provides two design examples for how to perform the Full Outer Join logic in your Talend jobs.

 

Environment

The examples below were created using:

  • Talend Data Fabric 6.3.1
  • Oracle JDK Build 1.8
  • Windows 10

 

Prerequisite

The example Jobs use two files, Empdet and Emploc, as shown below. These two file are also attached to this article for reference.

Empdet:

empdet.png

EmpLoc:

emploc.png

 

Example 1—Performing Two Left Outer Joins with Files

In this first example, you will read data from two files and perform two Left Outer Joins with two tMap components. You will reverse the main flow and the lookup flow on the second tMap, and then proceed with filtering any duplicate records in the resulting set. This job will load the lookup records directly from the files. Since the left outer join operation is done twice, this example is recommended for scenarios where the number of records in the two files is predictable, and known up front. This example will not work for tens of millions of records, since it will duplicate the data in memory and then filter out the unique records, which will require a lot of RAM if you are dealing with millions of records in each file.

 

Follow the instructions below to create the job:

  1. Place the following components in your job: tFileInputDelimited, tMap, tUnite, tUniqRow, and tLogRow.
  2. Connect the components as shown below. You are going to do two Left Outer Joins in the two tMaps and while doing so, you will change the order of the main input and lookup files.

    job1.png

     

  3. Configure the tFileInputDelimited components to read the files. The screenshot below shows the configuration for reading one file.

    tFileInputDelimited.png

     

  4. Perform the configuration for the other tFileInputDelimited components.

  5. Configure the tMap_1 component as shown below:

    tMap1.png

     

  6. Define the tMap_2 component as shown below:

    tMap2.png

     

  7. Edit the schema in the tUnite component as shown below:

    tUnite.png

     

  8. The tUniqRow component is used to remove the duplicates. You will have duplicates, since you are doing two left outer joins. Configure the component as follows:

    tUniqRow.png

     

  9. Configure the tLogRow component to display the output in the console. Run the Job and you will see the result of the Full Outer Join.

    tLogRow.png

     

 

Example 2—Use a Database to perform the Full Outer Join

A more elegant approach to performing a Full Outer Join is to leverage a database. The data from the two input files can be loaded into a database like MySQL, and then you can leverage the SQL syntax and power of the database to perform the Full Outer Join. In this example, you will use a MySQL database. However, any compliant RDBMS can be used to achieve this functionality.

 

Follow the instructions below to create the job.

  1. Place the following components in your job: tMysqlConnection, tFileInputDelimited, tMap, tMysqlOutput, tMysqlInput, and tLogRow. This example uses tMap to cleanse the null data. In the emploc.txt file, emp_id has a null value in a few records. A composite primary key cannot hold a null value, so the example uses tMap to convert nulls into a default number (-99).

    job2.png

     

  2. Configure the tMysqlConnection component as shown below. It is recommended to create a shared connection and use it in your data integration Jobs.

    tMysqlConnection.png

     

  3. As shown in example 1 above, configure the tFileInputDelimited components for both the files.

  4. Configure and link a tMysqlOutput component to each tFileInputDelimited, as shown in the Job above, to insert the data into two different tables (you will insert the data into two tables, and then perform a full outer join on the two tables).

  5. You will run the full outer join command through the database, so configure the tMysqlInput as shown below:

    tMysqlInput.png

     

  6. The full outer join query used is shown below:

    "SELECT t1.empid,t1.name,t1.desg,t2.loc FROM dev.tab1 t1
    LEFT JOIN dev.tab2 t2 ON t1.empid = t2.empid
    UNION
    SELECT t2.empid,t1.name,t1.desg,t2.loc FROM dev.tab1 t1
    RIGHT JOIN dev.tab2 t2 ON t1.empid = t2.empid"

     

  7. The tables in MySQL are created with the following structure:

    SQL_File4.png

     

  8. Run the Job and observe the result of the full outer join:

    result.png

     

Note: It is possible to replace the MySQL database with an in-memory database like HSQLDB. Talend provides the tHSQLDBInput and tHSQLDBOutput components that can be configured to behave as an in-memory database tied to the Java process. This enables the Job designer to leverage the Full Outer Join SQL syntax to perform this functionality. The downside is still the use of memory on the execution server to cache all the data from the files into memory. Relational databases, like MySQL, are a better solution since they can leverage the use of cursors and have been optimized to perform full outer joins.

Version history
Revision #:
3 of 3
Last update:
‎09-08-2017 01:56 PM
Updated by:
 
Labels (1)
Tags (1)