Six Stars

Remove all duplicate rows

Hi all,

I'm new to Talend and I have a question,

Have any component let me remove all duplicate rows?

Sample data:

Name, Age

HieuDoan, 15

LinhNa, 16

HieuDoan, 20

NamL, 17

 

I have tried tUniqRow but it still keep the first duplicate row and not remove all duplicate:

The ouput I got with tUniqRow :

HieuDoan, 15

LinhNa, 16

NamL, 17

 

The output i need:

LinhNa, 16

NamL, 17

 

Can you give me a suggest? Thanks

8 REPLIES
Community Manager

Re: Remove all duplicate rows

Hi
Not a component can achieve it directly, you can cache the result into memory and do an inner join in the next subjob to get the expected result. eg:
tfileinputdelimimited--main--tuniqrow--unique--thashoutput1
--duplicated--thashoutput2
           |
onsubjobok
          |
tHashinput1--main--tmap--tlogrow
                                 |
                           lookup
                                |
                      thashinput2
thashinput1: read data from thashoutput1
thashinput2: read data from thashoutput2

on tMap: do an inner join based on the keys you defined on tuniqrow, set the 'Catch lookup inner join reject' option as true in the output table.

Regards
Shong

----------------------------------------------------------
Talend | Data Agility for Modern Business
Six Stars

Re: Remove all duplicate rows

Hi Shong,
My data source is Redshift and it have a lots data, where can I cache result into to use later ?
Community Manager

Re: Remove all duplicate rows

Store the data on disk instead of memory if there are lots of data.

----------------------------------------------------------
Talend | Data Agility for Modern Business
Ten Stars

Re: Remove all duplicate rows

Talend do absolutely correct - tUniqRow DO NOT REMOVE all rows with duplicated, it is remove duplicates and leave original unique value

 

Your example - illustrate it great

 

if You want remove all rows which value in table more than 1 time

it could be easy done by SQL query, if You prefer Talend, You can do

 

tRedshiftInput -> tAggregate (count by value) -> tFilterRwo (condition - count == 1)

result will be as You wish :

 

LinhNa, 16
NamL, 17

 

 

as addition:

HieuDoan, 15 and HieuDoan, 20

generally  also 2 people with same name, but different age :-), but this is other topic

-----------
Six Stars

Re: Remove all duplicate rows

If your source is Redshift, why not do this in DB, with a simple query?

 

SELECT *

FROM TABLE

MINUS 

SELECT distinct *

FROM TABLE;

 

Depending on the size of the table, you need to tune the query. However, most likely you can get the best performance doing it within DB.  

 

 

Six Stars

Re: Remove all duplicate rows

Thank Vapukov,
How to do it by SQL query?

Six Stars

Re: Remove all duplicate rows

Did you try?

SELECT *

FROM TABLE

MINUS 

SELECT distinct *

FROM TABLE;

Ten Stars

Re: Remove all duplicate rows

in general case (if we accept You are try to exclude all names which only 1 in database, and not include any of records with more than 1 value)

it work as:

 

SELECT 
     t1.*
FROM table_name t1 INNER JOIN 
     (
     SELECT
            -- next could be changed for MAX depending from logic
            MIN(id) 
     FROM table_name
     GROUP BY "name"
     HAVING count(*) = 1 
     ) t2 ON t1.id = t2.id



code above mean - table have primary key and it name  "id", if table do not have primary key, logic could be different - for example

 

SELECT 
     t1.*
FROM table_name t1 INNER JOIN 
     (
     SELECT
            name 
     FROM table_name
     GROUP BY "name"
     HAVING count(*) = 1 
     ) t2 ON t1.name = t2.name

 

it could be adjusted for real structure, because now I type "theoretical code" Smiley Happy

-----------