Five Stars

how to do Data comparison(source & destination) and decide if its new data or existing data

Hi,

 

I want to implement one functionality in which need to compare source data with destination and depending on the result would like to insert or update at destination tables.

 

How we can implement this in Talend.

 

Thanks

6 REPLIES
Twelve Stars TRF
Twelve Stars

Re: how to do Data comparison(source & destination) and decide if its new data or existing data

Hi,

Depending of the target, you may have the choice to decide the operation (insert, update, insert or update, update or insert, upsert and so on).

So, in these cases, you don't have to decide to split the flow giving 2 data sets with 1 operation (insert / update).

If you don't have this facility on your target (Excel for example), you need to read the existing data first, then using a classical tMap with a lookup, get the matching rows on a 1rst flow (for update) and rejected rows on a 2nd flow (for insert).

 

Hope this helps.


TRF
Five Stars

Re: how to do Data comparison(source & destination) and decide if its new data or existing data

Hi,

 

My Target is postgressql and even if 1 field id changes i would like to update and for new records would like to insert.

 

Do you have any example for this implementation.

Six Stars

Re: how to do Data comparison(source & destination) and decide if its new data or existing data

Hi there,

 

If you need a database independent way to know if records in a table have changed, using the free version of Talend (the subscription versions already have CDC components for this) then you can use the following general approach:

 

  • Run your query as normal, using whichever database input component is appropriate
  • Feed the flow through a tAddCRCRow, flagging all of the fields you wish to track for changes in the "Use in CRC" column
  • Store the resultant CRC values for all records in a separate database table ("latest"), along with a suitable unique key for the record. Don't bother storing anything more than this, as it'll take up unnecessary space, and slow down everything else you'll be doing. The table should be emptied first.
  • You'll now be able to compare the CRC value for each record against the one from the previous run ("previous", see below) to identify:
    • New Records - Will have a record in the "latest" table, but not the "previous" one
    • Updated Records - Will have records in both the "latest" and "previous" tables, but the CRC value will be different
    • Deleted Records - Will have a record in the "previous" table, but not the "latest" one
  • Use the above to handle the different cases in your job, using the unique key to query the source database and retrieve the original records.
  • Copy the records in the "latest" table to a "previous" one for comparison on the next run. This table should be emptied first.

 

Things to be aware of:

 

  • If you ever change the fields used in the CRC, or even their type, then the CRC values you have stored will all be different, even if the underlying data hasn't changed, so in this case you'll need to alter the CRC fields after a successful run of the job, and then re-run it, ignoring the changes, to populate the "previous" table
  • You must ensure that only one instance of the job can ever be run at a time. As it's emptying tables and moving records etc., this will go horribly wrong if two jobs are trying to do this at the same time. As this is so important, I'd recommend using a "lock" of some sort, e.g. creating an empty file on disk somewhere, and then deleting it once the job has completed, then if this file is present when your job runs, you know there is another instance running, so just exit immediately. You could also do something similar with records in a database table, but in all honesty, the lock file approach is simpler and quicker.
  • I'm really busy at the moment, and have no time/desire to provide an actual job for the above which you can just pick up and use. If nothing else, there are likely to be a multitude of questions as a result of doing so. To implement this sort of thing you need to fully understand what you're doing, so if how you'd do the above isn't immediately obvious to you (I hope I've explained it clearly, and from a Talend perspective it only really requires simple database components, plus some basic SQL knowledge, specifically JOINS, to compare the "previous" and "latest" records) , then please spend the time necessary to learn these components and concepts, rather than just asking someone on this forum to do it for you. If you put this into a production environment and anything does go wrong, then you'll need to know why, and how to deal with it, otherwise you're going to look a fool, and if I were your employer, you'd have some very serious questions to answer.

 

Regards,

 

 

Chris

Six Stars

Re: how to do Data comparison(source & destination) and decide if its new data or existing data

Just though I'd preempt any comments/questions regarding generating the CRC/checksum directly in the database instead of in the Talend job.

 

Most DBMSs have a checksum or similar function, and using this is definitely much more efficient than returning a full result set from a query, and generating a CRC for each row in Talend.

 

However, each DBMS has its own approach, with everything from MSSQL Server's simple, elegant CHECKSUM function, to MySQL's awful "convert every individual column to a string, handling NULLs correctly, concatenate them all together, and generate a CRC or MD5 hash from this" approach.

 

Not all data sources will have this of course, and using the approach detailed above, you could even do change detection on data from a spreadsheet or CSV file.

 

Doing things in a Talend job means you can also join multiple data sources (of completely different types if necessary) and carry out any processing, filtering etc. before generating a CRC of the columns you're interested in tracking changes for, which is really powerful.

 

If you do decide to generate your checksums on the database server, avoid just doing a CHECKSUM(*) for an entire table. This might appear to be a nice quick way to do things, but if you add/remove fields on the table/query, then the checksums returned for each row will change immediately. Instead, either use a view which explicitly defines the columns to be returned, in which case you *can* do a CHECKSUM(*) on that (recommended) or specify the individual fields in the CHECKSUM function. Doing this will mean that you can add new fields and they wont affect the checksums until you're ready to update the view/query, and can carry out a refresh of the "previous" date as detailed above.

 

 

Five Stars

Re: how to do Data comparison(source & destination) and decide if its new data or existing data

Thanks for the information. 

 

Just to clarify i am not expecting job implementation from anyone here. just requesting direction to some problems which we are facing in implementation. I think this community is to share problems and is someone faced similar issues in the past and all, etc.. we can discuss and help each other..

Six Stars

Re: how to do Data comparison(source & destination) and decide if its new data or existing data

Hi there,

 

Absolutely, and my apologies if it sounded like I was suggesting you were expecting this. If I'd felt you were looking for a full implementation rather than just help/suggestions as to how your goal might be accomplished, then I'd simply not have replied to your post at all.

 

There have been posts recently from people who are certainly looking for someone to do their work for them, and they've blatantly ignored perfectly good, straightforward suggestions as to how they might accomplish their task, and instead asked again for a full solution. My comments in this regard were very much intended for them.

 

I use the free versions of Talend products for a good deal of my day-to-day work, and so think it's only fair that I try to spend an hour each day giving back to the community via this forum. I'm therefore more than happy to suggest approaches which have worked or me in the past, and help where I can. If that involves exchanges over many days, to work through a problem, then I'm happy to spend the time doing this, as long as I feel that it's a two way process.

 

 

Regards,

 

 

Chris