Four Stars

Checking duplicate in the content of 2 or more records

I have a stream of files coming continuously from a server. And I need to load them one by one in HDFS. Now that I have multiple files coming, I need to check if content of 2 files are same, then I need to ignore the content of that file and reject it.

Could anyone please help me in achieving this one.

 

Thanks

  • Big Data
  • Data Integration
8 REPLIES
Eleven Stars TRF
Eleven Stars

Re: Checking duplicate in the content of 2 or more records

As a trivial solution:

- tFileInputRaw to read both files as a single column + tMap to make an inner join.

- If number of records on the result = the number on the input, there is no difference.


TRF
Seven Stars

Re: Checking duplicate in the content of 2 or more records

Hi,

 

You can create a route mediation with a cfile and a cIdempotentConsumer component.

 

https://help.talend.com/reader/YtJvt25ynUgZ~sfL~L5dAg/2Gc7F~jW8E2LUycJ0xpLdQ

 

Eric

Four Stars

Re: Checking duplicate in the content of 2 or more records

Thanks for replying.

 

I tried the approach, taking two tfileinputraw for 2 different files and in the tfileinputraw we have only object as the datatype and couldn't take the row_count.

 

I have the situation like this:

i have multiple records incoming from the server which I need to validate if the contents are same or not. May be using tfileinputraw we have limitation of taking the number of input files.

 

Thanks for the suggestion and if you could please elaborate the solution if I didn't get the approach.

Eleven Stars TRF
Eleven Stars

Re: Checking duplicate in the content of 2 or more records

Sorry, I mean tFileInputFullRow instead of tFileInputRaw


TRF
Four Stars

Re: Checking duplicate in the content of 2 or more records

Thanks for replying. 

 

I looking for a solution in Talend Open Studio. 

 

Thanks

Eleven Stars TRF
Eleven Stars

Re: Checking duplicate in the content of 2 or more records

tFileInputFullRow is available in TOS

TRF
Four Stars

Re: Checking duplicate in the content of 2 or more records

Sorry it went to you, I was sending that message to Eric.

 

Anyways, in tFileInputFullRow  will contain will only one file. what should be the approach if I am doing the comparison between some 50-60 files.

 

thanks

Eleven Stars TRF
Eleven Stars

Re: Checking duplicate in the content of 2 or more records

1 parent job to iterate over the file list (tFileList) + 1child job (called using tRunJob) receiving the current filename (to avoid an auto comparison) to iterate over the same file list and compare files 2/2.

Need some thinking for optimization (don't compare file1/file2 and file2/file1 for example). 


TRF