I have data that needs to be validated with schema compliance for mandatory fields and date formats; unique row to get rid of duplicates; and tMap to transform fields as well as run checks such as field not null, or date A > date B, or field1 = field2 * field3.
When I first set the validations up, they ran one after the other, but when set up that way all validations are not against the FULL file. Instead, the duplicate validation is only run on the subset that made it past schema compliance; and the tMap is only run on the subset that made it past the duplicate check.
I rearranged things so that there are a series of jobs, each one consisting of input -> validation -> output. Then I have one last job that runs the whole string of validations, so that I get the correct "good" file as output.
Is there a better way to make sure all validations run on ALL rows than what I'm doing?
All validation process filter the rows, can i know why you need to validate all input rows again?
The reason I need to have all rows validated for each test is so that we can tell the client all validations one row fails.
For example, in this scenario:
input file -> schema check -> tMap validations -> output file
if you have 100 rows, 80 make it through schema check, and 50 make it through validations to the output.
But we don't know which of the rows have both issues - fail schema check AND fail tmap validations.
It's better to let the client know up front if they have to fix both issues, rather then have them fix the schema issue, only to have the row in the corrected file then fail in tmap validations.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Watch the recorded webinar!
Part 2 of a series on Context Variables
Learn how to do cool things with Context Variables
Find out how to migrate from one database to another using the Dynamic schema