Remove duplicates with condition

One Star

Remove duplicates with condition

Hi all,
I want to complete the transfer of a specific .txt files in an Oracle DB with a delta update approach.
The challenge at this point is to remove the duplicates from the original file by appling a specific condition. For example:
The file rows are of the format
cust_id,<>,<>,date,<>,<>,...
but there are duplicates rows with the same cust_id but different date.
What I want to accomplish is to transfer only the row that has the maximum date and disregard the rest.
I checked that removing duplicates can be done with the tUniquerow component but this only transfers only the first unique row it finds.
Any suggestion will be appreciated.
Thanks in advance,
One Star

Re: Remove duplicates with condition

Also to add that I tried the solution of sorting the file by date desc before feeding the tUniqueRow but the sorting of the file takes a lot of time to complete (approx 650k records), so I was wondering for a more optimal approach.
Sixteen Stars

Re: Remove duplicates with condition

Without the data sorted, this is not going to be terribly efficient any way you try. Why not try the tAggregateRow component? Aggregate by client_id (and any other columns that are appropriate) and use the Max fucntion for date in the operations section. Then you are essentially combining the sorting with the unique operation.
One Star

Re: Remove duplicates with condition

Thanks for the suggestion. I will try that although I think the result will be the same, since again sorting will be performed. In my initial attempt I got an outOfMemory exception on the sorting process. (I run the talend locally on my PC).
One alternative I've come up with is to write the raw data in a temp table in the DB and then perform a select of distinct cust_id with max date combination. Although again slow (about 20 min to complete) this completed successfully.
Sixteen Stars

Re: Remove duplicates with condition

Sorting in a DB is always going to be quicker. I would have suggested that, but the latency of loading it in, sorting it and then extracting it out again will negate any benefits. If it was me I would do everything I need to do in Talend first, load it all to a db, sort and de-dupe there. There is no point doing memory intensive calculations in Talend if you can do them in a DB and there is no other need to keep the data in memory.
The process of de-duplication requires that data is sorted or that every unique key is stored in memory so that any subsequent duplicate can be thrown away. You could probably write some code in Java using a tJavaFlex to do this. As every row passes through, add to a hashmap with the key being the client_id and the value being the date. Check every row's client_id against the hashmap (do a key lookup). If the key is found and the date is greater than the stored date, replace it, etc, etc. That might improve performance.
I think loading to the DB and doing the dedupe stuff there will be quickest. But I would be interested in the tJavaFlex idea.
One Star

Re: Remove duplicates with condition

Indeed I performed the dedupe in DB level, after sorting the data, and then feeding the tUniqRow component.
This was the most efficient solution.
Sixteen Stars

Re: Remove duplicates with condition

If you deduped in the db, you don't need the tUniqRow surely? You can shave some more time off of this :-)