I have a stream of files coming continuously from a server. And I need to load them one by one in HDFS. Now that I have multiple files coming, I need to check if content of 2 files are same, then I need to ignore the content of that file and reject it.
Could anyone please help me in achieving this one.
As a trivial solution:
- tFileInputRaw to read both files as a single column + tMap to make an inner join.
- If number of records on the result = the number on the input, there is no difference.
You can create a route mediation with a cfile and a cIdempotentConsumer component.
Thanks for replying.
I tried the approach, taking two tfileinputraw for 2 different files and in the tfileinputraw we have only object as the datatype and couldn't take the row_count.
I have the situation like this:
i have multiple records incoming from the server which I need to validate if the contents are same or not. May be using tfileinputraw we have limitation of taking the number of input files.
Thanks for the suggestion and if you could please elaborate the solution if I didn't get the approach.
Sorry it went to you, I was sending that message to Eric.
Anyways, in tFileInputFullRow will contain will only one file. what should be the approach if I am doing the comparison between some 50-60 files.
1 parent job to iterate over the file list (tFileList) + 1child job (called using tRunJob) receiving the current filename (to avoid an auto comparison) to iterate over the same file list and compare files 2/2.
Need some thinking for optimization (don't compare file1/file2 and file2/file1 for example).