One Star

[resolved] Difference in data rows

How can I find the difference between data in two data files?
I have two files that contain rows of customer data, with the first column being the primary key.
Suppose the first file (reference.csv) has the following records:
1,John,Smith,04/02/1970,36000
2,Jane,Doe,09/23/2000,12000
3,Richard,Johnson,12/02/1990,29000
And the second file (compare.csv )has these following records:
1,John,Smith,04/02/1970,77777
2,Jane,Doe,09/23/2000,12000
3,Richard,Johnson,12/02/1990,29000
4,Mary,Jones,03/12/1956,52000
I would like to have an output file (difference.csv) with the following records:
1,John,Smith,04/02/1970,77777
4,Mary,Jones,03/12/1956,52000
So, the job would look for any rows in compare.csv that differ from or do not exist in reference.csv, then output only those rows to difference.csv.
In the previous example, the output showed the first record from compare.csv because the fifth column changed from "36000" in reference.csv to "77777" in compare.csv.
It also showed the fourth record from compare.csv because it did not exist in reference.csv.
1 ACCEPTED SOLUTION

Accepted Solutions
One Star

Re: [resolved] Difference in data rows

My 2 cents, with help of answer that I saw in another thread:
all fields that are used as foreign keys, must be initialized. If you have null values, it will be "join rejected".
NULL != NULL
3 REPLIES
One Star

Re: [resolved] Difference in data rows

Never mind, I solved the problem. It wasn't working before because there was a problem with the format of the new data file, but now it works. Solution:
I used the two input files as inputs to a tmap, with the newer data file (compare.csv) being the main row and the older one (reference.csv) being the lookup input.
I used every field from the main input (compareIn) as a foreign key in the lookup input (referenceIn) so that it compares entire rows,
then used every field from compareIn as an output (differenceOut).
I activated the "Inner join" on referenceIn and also activated "Inner Join reject" on differenceOut,
so that it would reject all rows from compareIn that are identical to any rows in referenceIn,
and include only all other rows.
I'm not sure if this is the most logically sound way to perform the operation. I'm very new to Talend Open Studio. If there's a more elegant solution, please post it or where to find it.
It works perfectly, however.
One Star

Re: [resolved] Difference in data rows

Hi Mark,
I tried to do the same thing what u did. But i am not able to see constraints or joins in tMap. Please help me.
My requirement is to compare a DB table with a flat file.
I converted the DB table to flat file and compare with the other positioned file.
Regards,
Manoj.V
One Star

Re: [resolved] Difference in data rows

My 2 cents, with help of answer that I saw in another thread:
all fields that are used as foreign keys, must be initialized. If you have null values, it will be "join rejected".
NULL != NULL