Delete duplicate rows / records in a table or a file ?

One Star

Delete duplicate rows / records in a table or a file ?

Hi !
Is there a solution to delete duplicate rows / records in a inputFile or inputDB ?
Thanks a lot.
Kim
Tags (1)

Accepted Solutions
Employee

Re: Delete duplicate rows / records in a table or a file ?

There's is currently no component able to uniquify automaticaly. However, you can use the tPerlRow for this.
Considering a job with a tFileInputDelimited, a tPerlRow and a tLogRow, linked with row type links. Having an input schema made with the two first colums as key, in the tPerlRow, put the following Perl code:
next if (++$unique_keys->{ $row }{ $row } > 1); # 2 first columns

In the input file, I have:
000048;2030708847018;-000000764930;-0000079000
000108;3597670011292;-000000096000;-0000065000
000108;3597670011292;-000000098000;-0000065000 <= duplicate of previous line considering only the 2 first columns
000017;2030708846752;-000005140270;-0000043000
000048;2030708847759;-000003993560;-0000035000
000032;2030708845274;-000001032030;-0000020000
000102;2030708845311;-000001248220;-0000018000
000104;2030708847063;-000000103710;-0000015000
000032;2030708845274;-000001032033;-0000020000 <= duplicate of the 6th line
000004;2030708845182;-000000969950;-0000013000

In my output, I have
000048|2030708847018|-000000764930|-0000079000
000108|3597670011292|-000000096000|-0000065000
000017|2030708846752|-000005140270|-0000043000
000048|2030708847759|-000003993560|-0000035000
000032|2030708845274|-000001032030|-0000020000
000102|2030708845311|-000001248220|-0000018000
000104|2030708847063|-000000103710|-0000015000
000004|2030708845182|-000000969950|-0000013000

2 lines less.

All Replies
Employee

Re: Delete duplicate rows / records in a table or a file ?

There's is currently no component able to uniquify automaticaly. However, you can use the tPerlRow for this.
Considering a job with a tFileInputDelimited, a tPerlRow and a tLogRow, linked with row type links. Having an input schema made with the two first colums as key, in the tPerlRow, put the following Perl code:
next if (++$unique_keys->{ $row }{ $row } > 1); # 2 first columns

In the input file, I have:
000048;2030708847018;-000000764930;-0000079000
000108;3597670011292;-000000096000;-0000065000
000108;3597670011292;-000000098000;-0000065000 <= duplicate of previous line considering only the 2 first columns
000017;2030708846752;-000005140270;-0000043000
000048;2030708847759;-000003993560;-0000035000
000032;2030708845274;-000001032030;-0000020000
000102;2030708845311;-000001248220;-0000018000
000104;2030708847063;-000000103710;-0000015000
000032;2030708845274;-000001032033;-0000020000 <= duplicate of the 6th line
000004;2030708845182;-000000969950;-0000013000

In my output, I have
000048|2030708847018|-000000764930|-0000079000
000108|3597670011292|-000000096000|-0000065000
000017|2030708846752|-000005140270|-0000043000
000048|2030708847759|-000003993560|-0000035000
000032|2030708845274|-000001032030|-0000020000
000102|2030708845311|-000001248220|-0000018000
000104|2030708847063|-000000103710|-0000015000
000004|2030708845182|-000000969950|-0000013000

2 lines less.
Employee

Re: Delete duplicate rows / records in a table or a file ?

In addition, I would say that we've just developed a dedicated component for this specific task: tUniqRow. The uniqueness is based on the schema columns key.
This component has been developped today (based on your today's post!). It will be downlodable in a few days...
Feel free to email me if you need it urgently!
Regards,
One Star

Re: Delete duplicate rows / records in a table or a file ?

Will the new component have a log output? a customizable warning message ?
for example, the list of the duplicate data, the number of duplicated rows??? to simplify the corrections..... ;-))
Employee

Re: Delete duplicate rows / records in a table or a file ?

tUniqRow has 2 associated global variables : NB_LINE_INPUT and NB_LINE_OUTPUT. To know how many rows were rejected, simply compute the difference. In a tMsgBox, triggered after your main sub-job, put this message:
sprintf(
'%d lines were rejected by tUniqRow_1',
$_globals{tUniqRow_1}{NB_LINE_INPUT} - $_globals{tUniqRow_1}{NB_LINE_OUTPUT}
)

I didn't plan to store the rejected rows. The possible issue is that they might be numerous and consequently take a lot of memory.
One Star

Re: Delete duplicate rows / records in a table or a file ?

maybe not the reject rows, but maybe the number of the line.... to correct the problem, it is usefull...
which line is rejected? every time the second one, the third and so on?
Employee

Re: Delete duplicate rows / records in a table or a file ?

Currently, the tMap component offers the possibility to have two outputs, one being the complement of the other one. As an illustration, create a job with a tFileInputDelimited, a tMap, a tFileOutputDelimited (for uniq rows) and a tLogRow (for duplicate rows). See the associated screenshot.
My input file is still:
000048;2030708847018;-000000764930;-0000079000
000108;3597670011292;-000000096000;-0000065000
000108;3597670011292;-000000098000;-0000065000 <= duplicate of previous line considering only the 2 first columns
000017;2030708846752;-000005140270;-0000043000
000048;2030708847759;-000003993560;-0000035000
000032;2030708845274;-000001032030;-0000020000
000102;2030708845311;-000001248220;-0000018000
000104;2030708847063;-000000103710;-0000015000
000032;2030708845274;-000001032033;-0000020000 <= duplicate of the 6th line
000004;2030708845182;-000000969950;-0000013000

In the tFileInputDelimited, I name the 4 input columns: col1, col2, col3 and col4.
Put a row link between the tFileInputDelimited and the tMap.
In the tMap:
1. add an output table, name it "uniq"
2. add a constraint on the "uniq" output : ++$uniq_keys->{$row1 }{$row1 } == 1
3. add another output table, name it "duplicate" and check the "reject" checkbox
4. drag'n drop all the input columns in "uniq" and in "duplicate"
5. save your tMap configuration by clicking "OK"
Outside of the tMap, link it to the tFileOutputDelimited with the row link "uniq" and link it to the tLogRow with the row link "duplicate".
In the tFileOutputDelimited, you'll have all uniq rows, in the run job output (Perl standard output), you'll see the duplicate rows:
000108|3597670011292|-000000098000|-0000065000
000032|2030708845274|-000001032033|-0000020000

One Star

Re: Delete duplicate rows / records in a table or a file ?

Merci ;-))
One Star

Re: Delete duplicate rows / records in a table or a file ?

The tMap component is magic !!! :-)
One Star

Re: Delete duplicate rows / records in a table or a file ?

For large files you may find that
next if (++$unique_keys->{ $row }{ $row } > 1); # 2 first columns
might take a toll on your memory usage as you add more and more unique keys. You can use a DBM backed storage in similar code using MLDBM (or other similar Perl modules).
http://search.cpan.org/~chamas/MLDBM-2.01/lib/MLDBM.pm
One Star

Re: Delete duplicate rows / records in a table or a file ?

hi,
Regarding uniqueness checking of data to be loaded i have a problem : tuniqrow component checks the uniqueness based on the columns indicated only in the present file the job is processing.
The job flow is as follows:
tfileList-->tfileinputdelimited-->tmap--->tuniqrow--(uniques)-->verticadb

if table has data if the present file has data like
id name 4,d
1 a 5,g
2 b 1,a
3 c
the job i have configured inserts all value so resultant table has
id name
1 a
2 b
3 c
4 d
5 g
1 a-----> i don't want this value

Is there any logic to achieve this case since vertica db does not check key constraints during loading time
Seven Stars

Re: Delete duplicate rows / records in a table or a file ?

You should create a new topic for a new question.
Unfortunately, tVerticaOutput does not have an "Insert if not exist" option for "Action on data" so your options are:
1. Use the "Insert or Update" or "Update or insert" option if that's suitable; or
2. Read the existing data as a lookup to tMap, sending only inner-join rejects to Vertica.
One Star

Re: Delete duplicate rows / records in a table or a file ?

hi

Can u show the settings in tmap to achieve this condition:
"2. Read the existing data as a lookup to tMap, sending only inner-join rejects to Vertica."

with regards,
banu
Seven Stars

Re: Delete duplicate rows / records in a table or a file ?

Read scenario 1 and 2 of the tMap help.
One Star

Re: Delete duplicate rows / records in a table or a file ?

hi ,
"2. Read the existing data as a lookup to tMap, sending only inner-join rejects to Vertica."
i have used the above condition but the problem is for very first time when we run the job the db table will be empty so in lookup zero rows will be got so the input data will not be compared and hence no data is inserted.

tfilelist---->tfileinputdelimited-----tmap-----tverticaoutput
|
|
tverticainput
How to solve it ??
Moderator

Re: Delete duplicate rows / records in a table or a file ?

Hi,
Link tUnite component to tfileinputdelimited to unite your input data from tfilelist.
filelist---->tfileinputdelimited---->tUnite---->-----tmap-----tverticaoutput
|
|
tverticainput
In addition that, checkout "Trim all the String/Char columns" option in Advanced Setting of tVerticainput.
Let me know if it is OK with you
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: Delete duplicate rows / records in a table or a file ?

Hi all,
I have the two files,example
File1:                File2                                                                         
column1           column1                                              
1                       1                                                                                                      
2                        2                                                                                                     
3                        5
4                        6
from these I want to get only the fields from each file which are not exist in the other.
I used flow as
tFileInputDelimited(1,2)---------->tUnite------->tSortRow------->tUnique------->tFileOutputDelimited.
I'm getting output as
column1(using unique)                                                                 column1(using duplicate)
          1                                                                                                           1
          2                                                                                                            2
           3
           4
           5
            6
Required output is
  column1(from file1)                column1(from file2)
              5                                        3
              6                                        4
i.e, I want to delete the fields which are exist in the other as I mentioned above.
Please anyone provide the solution.
Thanks in advance.

Re: Delete duplicate rows / records in a table or a file ?

Hello, I've got something, but it's a bit complicated (got to read twice the same file, and there are two tMap).
Key is to get the right type of join/reject on tMap.
First you read the two files, get an inner join between them, and catch inner join reject in logger#1.
Then you catch output reject, and through an inner join with a second reading of file#2, you catch another time the inner reject in logger#2.
Then you've got two outs with your columns.
   See screenshots.
One Star

Re: Delete duplicate rows / records in a table or a file ?

Thanks for your reply.
But how to get look up while joining from tFileInptDelimited to tMap.

Re: Delete duplicate rows / records in a table or a file ?

I don't know if it's what you mean, but if you want to switch from a 'main' to a 'lookup' link, just right-click on it, then "set as main"/"set as lookup"
One Star

Re: Delete duplicate rows / records in a table or a file ?

I followed the same settings as above screenshots,but could not get.
Again the same results coming...
column1(using unique)                                                                
          1                                                                                                         
          2                                                                                                          
          3
          4
          5
          6
But I want ouput as
column1(using duplicate)
          1                                                                                                           1
          2                                                                                                            2

Re: Delete duplicate rows / records in a table or a file ?

Could you add some screenshots of your job/ tmap settings/ input files ?
One Star

Re: Delete duplicate rows / records in a table or a file ?

Thanks for your reply,
Ya I got the output from file1 as
Column1
3
4
But in the same job I want to get unique values from file2 also as
5
6
Wt could I do?