Remove all duplicate rows

Five Stars

Remove all duplicate rows

Hi all,

I'm new to Talend and I have a question,

Have any component let me remove all duplicate rows?

Sample data:

Name, Age

HieuDoan, 15

LinhNa, 16

HieuDoan, 20

NamL, 17

 

I have tried tUniqRow but it still keep the first duplicate row and not remove all duplicate:

The ouput I got with tUniqRow :

HieuDoan, 15

LinhNa, 16

NamL, 17

 

The output i need:

LinhNa, 16

NamL, 17

 

Can you give me a suggest? Thanks

Community Manager

Re: Remove all duplicate rows

Hi
Not a component can achieve it directly, you can cache the result into memory and do an inner join in the next subjob to get the expected result. eg:
tfileinputdelimimited--main--tuniqrow--unique--thashoutput1
--duplicated--thashoutput2
           |
onsubjobok
          |
tHashinput1--main--tmap--tlogrow
                                 |
                           lookup
                                |
                      thashinput2
thashinput1: read data from thashoutput1
thashinput2: read data from thashoutput2

on tMap: do an inner join based on the keys you defined on tuniqrow, set the 'Catch lookup inner join reject' option as true in the output table.

Regards
Shong

----------------------------------------------------------
Talend | Data Agility for Modern Business
Five Stars

Re: Remove all duplicate rows

Hi Shong,
My data source is Redshift and it have a lots data, where can I cache result into to use later ?
Community Manager

Re: Remove all duplicate rows

Store the data on disk instead of memory if there are lots of data.

----------------------------------------------------------
Talend | Data Agility for Modern Business
Forteen Stars

Re: Remove all duplicate rows

Talend do absolutely correct - tUniqRow DO NOT REMOVE all rows with duplicated, it is remove duplicates and leave original unique value

 

Your example - illustrate it great

 

if You want remove all rows which value in table more than 1 time

it could be easy done by SQL query, if You prefer Talend, You can do

 

tRedshiftInput -> tAggregate (count by value) -> tFilterRwo (condition - count == 1)

result will be as You wish :

 

LinhNa, 16
NamL, 17

 

 

as addition:

HieuDoan, 15 and HieuDoan, 20

generally  also 2 people with same name, but different age :-), but this is other topic

-----------
Seven Stars

Re: Remove all duplicate rows

If your source is Redshift, why not do this in DB, with a simple query?

 

SELECT *

FROM TABLE

MINUS 

SELECT distinct *

FROM TABLE;

 

Depending on the size of the table, you need to tune the query. However, most likely you can get the best performance doing it within DB.  

 

 

Five Stars

Re: Remove all duplicate rows

Thank Vapukov,
How to do it by SQL query?

Seven Stars

Re: Remove all duplicate rows

Did you try?

SELECT *

FROM TABLE

MINUS 

SELECT distinct *

FROM TABLE;

Forteen Stars

Re: Remove all duplicate rows

in general case (if we accept You are try to exclude all names which only 1 in database, and not include any of records with more than 1 value)

it work as:

 

SELECT 
     t1.*
FROM table_name t1 INNER JOIN 
     (
     SELECT
            -- next could be changed for MAX depending from logic
            MIN(id) 
     FROM table_name
     GROUP BY "name"
     HAVING count(*) = 1 
     ) t2 ON t1.id = t2.id



code above mean - table have primary key and it name  "id", if table do not have primary key, logic could be different - for example

 

SELECT 
     t1.*
FROM table_name t1 INNER JOIN 
     (
     SELECT
            name 
     FROM table_name
     GROUP BY "name"
     HAVING count(*) = 1 
     ) t2 ON t1.name = t2.name

 

it could be adjusted for real structure, because now I type "theoretical code" Smiley Happy

-----------

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration

Read

Downloads and Trials

Test drive Talend's enterprise products.

Downloads