How to capture data errors in Talend

Problem Description

One key element for designing a successful data integration (DI) Job is being able to capture errors that occur because of invalid data. When an error occurs, you may see incomplete or invalid data, though you cannot easily identify issues with data causing this behavior.

 

Root Cause

Whenever you get incomplete or invalid data in output, it is caused by one of these reasons:

  • Poor-quality input: data is rejected while being extracted from source systems because it does not meet certain criteria, as when string data appears in a numeric column
  • Rejection during the transformation stage: data may be rejected midstream in DI processes because it fails data quality checks, or it cannot be found in a lookup table
  • Data fails to load into the target database because of database constraints

 

Solution

Here are solutions for capturing invalid data at various stages of the DI process.

 

Capturing rejected records in source data

For data coming from sources that are not compliant with the source metadata or schema, you can use a Reject row from the Input component, as shown below:

KB-DataErrors-1.png

 

In this example, one of the records has an invalid value in the city_id (Integer) column, and it was sent to the RejectedCities output. Notice that for the Reject row to be enabled, the Die on error property check box for the Input component must be clear. For relational database sources, you can use validation rules as outlined in the documentation.

 

Handling records that fail data quality checks

The second scenario is when incoming records fail your data quality checks. For example, you may need to validate multiple columns against regular expressions. You can leverage Talend Data Quality components to see whether they match required patterns. These components provide two output flows: matching data and non-matching data. You can redirect non-matching records to a file or table and inspect why they failed.

 

KB-DataErrors-2.png

 

You can also review Talend documentation for examples using other components in this family, including tSchemaComplianceCheck.

 

Handling records with no matching lookup values

If your validation check is based on a list of values available in another lookup table, you can use a tMap component to join data with the lookup data and set your join model to Inner Join. All records that don’t have a match can be captured in a new output in tMap whose Catch lookup inner join reject property is set to true.

 

KB-DataErrors-3.png

 

Capturing records rejected by the target database

If you are trying to insert or update data in the target database and the data violates any integrity constraints set by the database, you can use the Reject flow from the Output component to capture the records in error. Here’s an example of Retrieving data in error with a Reject link from Talend documentation. Depending on the Output component you use, you may need to disable its ability to process rows in batch. For instance, with tOracleOutput, you must clear the Use Batch Size check box.

Version history
Revision #:
8 of 8
Last update:
‎02-24-2019 11:17 PM
Updated by: