Best way of dealing with rejected records of data cleaning components?

One Star

Best way of dealing with rejected records of data cleaning components?

Hi all,

I have a job where I get a file with a number of records (each row contains a user with each column being phone number, email address, postcode, etc. of the user) and I want to load these records into a database. Before I do it, I have created a flow that has certain data cleaning components (tfilterrow, tpatterncheck...) that make sure the records follow certain rules (name without invalid chars, name longer than 2 chars, postcode no overseas...). Otherwise, these records get 'flagged', that is, they are added to the 'reject' flows of the different data cleaning components.

What I am trying to find out is the best way of dealing with these 'flagged' records. When the job finishes running, I would like to manually analyse these rejects (all rows with names with invalid characters, names longer than 2 chars, etc.) and decide if they are actually rejects instead of false positives (names like 'Jo' are still valid, for example) or even correct them (when they have invalid characters). Once these corrections are made, I would like to add these records to the flow with the 'unflagged' records so that they get added to the database as well.

This can be achieved with an excel file in a quite manual process(get rejects from each component in one tab of an excel file and correct them manually), but I was thinking if there could be some more effective way of doing this. I have tried using Talend Stewardship Console, but given that it is designed for working with Matching rejections it doesnt really work.

Any ideas? 
Best regards

Four Stars

Re: Best way of dealing with rejected records of data cleaning components?

Talking from my experience....
In a previous assignment I did this:
1 - Loaded every row as Strings to a LOAD_<table>, setting a status field as "not processed". Now this LOAD_<table> is your source data.
2  - Process all "not processed" rows, changing their status to "processed" or "with error". 
       At the same time, load any row that's with status "manually validated" without going through the validation process.
3 - A team member built a webservice (you can just go manually through the BD) for support to go through any "with error" or "invalid" row and:
        fix it, marking it as:
                 "manually validated" - if it contains valid data that always fail rules, like your Jo example (this should be avoided as much as possible) 
                  or "not processed" - if they just fixed a postal code and want everything to be validated again
         or just get rid of it (mark as "invalid")
4 - Go to 2 (or 1 if you have more data to load)


Hope it helps

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration

Read

Agile Data lakes & Analytics

Accelerate your data lake projects with an agile approach

Watch