Talend Connect
Virtual Summit
And visit the Customer
& Community Lounge.
May 27-28, wherever you are.

Best way of dealing with rejected records of data cleaning components?

One Star

Best way of dealing with rejected records of data cleaning components?

Hi all,

I have a job where I get a file with a number of records (each row contains a user with each column being phone number, email address, postcode, etc. of the user) and I want to load these records into a database. Before I do it, I have created a flow that has certain data cleaning components (tfilterrow, tpatterncheck...) that make sure the records follow certain rules (name without invalid chars, name longer than 2 chars, postcode no overseas...). Otherwise, these records get 'flagged', that is, they are added to the 'reject' flows of the different data cleaning components.

What I am trying to find out is the best way of dealing with these 'flagged' records. When the job finishes running, I would like to manually analyse these rejects (all rows with names with invalid characters, names longer than 2 chars, etc.) and decide if they are actually rejects instead of false positives (names like 'Jo' are still valid, for example) or even correct them (when they have invalid characters). Once these corrections are made, I would like to add these records to the flow with the 'unflagged' records so that they get added to the database as well.

This can be achieved with an excel file in a quite manual process(get rejects from each component in one tab of an excel file and correct them manually), but I was thinking if there could be some more effective way of doing this. I have tried using Talend Stewardship Console, but given that it is designed for working with Matching rejections it doesnt really work.

Any ideas? 
Best regards

Four Stars

Re: Best way of dealing with rejected records of data cleaning components?

Talking from my experience....
In a previous assignment I did this:
1 - Loaded every row as Strings to a LOAD_<table>, setting a status field as "not processed". Now this LOAD_<table> is your source data.
2  - Process all "not processed" rows, changing their status to "processed" or "with error". 
       At the same time, load any row that's with status "manually validated" without going through the validation process.
3 - A team member built a webservice (you can just go manually through the BD) for support to go through any "with error" or "invalid" row and:
        fix it, marking it as:
                 "manually validated" - if it contains valid data that always fail rules, like your Jo example (this should be avoided as much as possible) 
                  or "not processed" - if they just fixed a postal code and want everything to be validated again
         or just get rid of it (mark as "invalid")
4 - Go to 2 (or 1 if you have more data to load)

Hope it helps


Talend named a Leader.

Get your copy


Kickstart your first data integration and ETL projects.

Download now

Best Practices for Using Context Variables with Talend – Part 1

Learn how to do cool things with Context Variables


Migrate Data from one Database to another with one Job using the Dynamic Schema

Find out how to migrate from one database to another using the Dynamic schema


Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables