Remove files that as the same content

Five Stars

Remove files that as the same content

Hello ! 

 

I'd like to remove all files that have the same content but keep one.

The final result should be files with all different content.

My file name format is fileName_timestamp.csv

For exemple :

My directory  looks like this : 

- fileName_t1m3st4mp.csv

- fileName_0th3rt1m3st4mp.csv

- fileName_4n0th3rt1m3st4mp.csv

 

Content in my files looks like this :

 

fileName_t1m3st4mp.csv

This is a content

fileName_0th3rt1m3st4mp.csv

This is a content

fileName_4n0th3rt1m3st4mp.csv

This is a different content

 

When i run the job :

fileName_0th3rt1m3st4mp.csv should be deleted

 

Now my directory should only have :

fileName_0th3rt1m3st4mp.csv

- fileName_4n0th3rt1m3st4mp.csv

 

using Talend ESB 7

 

If you have any suggestion, please do !

 

Thanks !


Accepted Solutions
Eleven Stars

Re: Remove files that as the same content

removeduplicate.JPG

it worked . Removed duplicate files. Try once.

Regards
Abhishek KUMAR

View solution in original post


All Replies
Eleven Stars

Re: Remove files that as the same content

Try with tFileList , tMemoriseRow tFileCompare and tFileDelete .

 

Not sure if these are part of ESB

Regards
Abhishek KUMAR
Five Stars

Re: Remove files that as the same content

Thanks for your response ! 

Those components are indeed in ESB.

 

I need to compare each files with all the others, i'm not sure how i can do that with a FileCompare component since it only allow 1 input.

Can you guide me through your thinking ?

 

Best regards,

 

Eleven Stars

Re: Remove files that as the same content

You are right with tFileCompare you might have some issues.

1) Actually you need to get the checksum of each file using

2) Find files having same checksum and delete the duplicate file.

tFileList --> tFileProperties(MD5 option) --> tFileOutput

onSubJobOK

tFileInput --> tUniqRow (getDuplicate filename based on checksum) --> tFlowtoInterate --> tFileDelete

This should work.
Regards
Abhishek KUMAR
Five Stars

Re: Remove files that as the same content

Here you're mainly checking the file name not the actual content.

 

I think i found something. I can log content and filename independently but can't find a way the get both of them at the same time.

My goal here is get a output that contains all the file names and file content. (fileName;fileContent)

I guess i'll be able to use a tUniqRow to check duplicate content once i've figured out this.....

JLHZkWa.png

Highlighted
Eleven Stars

Re: Remove files that as the same content

tFileProperties will get checksum based on Filecontent not filename.
Regards
Abhishek KUMAR
Eleven Stars

Re: Remove files that as the same content

removeduplicate.JPG

it worked . Removed duplicate files. Try once.

Regards
Abhishek KUMAR

View solution in original post

Eleven Stars

Re: Remove files that as the same content

did it solved your problem ?
Regards
Abhishek KUMAR
Five Stars

Re: Remove files that as the same content

What's the component you renamed "selectMD5Option" ?
I'll try that

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Have you checked out Talend’s 2019 Summer release yet?

Find out about Talend's 2019 Summer release

Blog

Talend Summer 2019 – What’s New?

Talend continues to revolutionize how businesses leverage speed and manage scale

Watch Now

6 Ways to Start Utilizing Machine Learning with Amazon We Services and Talend

Look at6 ways to start utilizing Machine Learning with Amazon We Services and Talend

Blog