Seven Stars

[resolved] Remove all duplicate rows from flow (including original)?

I have a series of jobs which perform complex mappings that might result in duplicate keys. I would like to separate out ALL records with a duplicated key to be manually resolved. (tUniqRow only separates out the second and subsequent records with the duplicated key, letting the original record with that key through.)
Is there another way to do this other than:
-- write the mapped flow to a file
-- read it back twice, once in full and once counting the records grouped on the key
-- join them using the key and then filter them on the count?
Ideally, the tSchemaComplianceCheck would do this based on the key fields specified in the schema.
I'm using Integration Suite 3.2.3 with a Java project.
Thanks!
1 ACCEPTED SOLUTION

Accepted Solutions
One Star

Re: [resolved] Remove all duplicate rows from flow (including original)?

An efficient and clean way is to use tAggregateRow to count key column, join to input again by tMap and then filter all row have more than 1. look at attached pic.
4 REPLIES
Community Manager

Re: [resolved] Remove all duplicate rows from flow (including original)?

Hello
(tUniqRow only separates out the second and subsequent records with the duplicated key, letting the original record with that key through.)

Yes, you need use tUniqRow to get unique records and duplicate records, output them to two temp files or memory(tHashOutput) first. On the next subJob, using two tFileInputDelimited components to read records from the two temp files again or tHashInput components from memory, do a inner join and get the unmatched rows.
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
Seven Stars

Re: [resolved] Remove all duplicate rows from flow (including original)?

Hi, shong, your suggestion doesn't seem to achieve exactly what I want. It gives me at the end only the truly uniquely-keyed records but I want all the duplicately-keyed records as well i.e. two sets of data with the total number of records across the two being the same as in my original table.
Thanks
Community Manager

Re: [resolved] Remove all duplicate rows from flow (including original)?

Hello
but I want all the duplicately-keyed records as well

On the second subJob, also get the matched rows. on the third sunJob, merge the matched rows and duplicated rows(store on tHashOutput_2).
Best regards
shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: [resolved] Remove all duplicate rows from flow (including original)?

An efficient and clean way is to use tAggregateRow to count key column, join to input again by tMap and then filter all row have more than 1. look at attached pic.