I have a source file which I receive on daily basis and it has data appended from previous day's data and some updated records. Since it is file there is no key column or any date fields. How can I perform Incremental load using Talend. Is there any component to fetch most recent data from file. Please assist.
Thanks a lot.
One way is to keep track of the records that are read earlier at the db level. When the job read the file again, it can fetch the value from the db and using that you can change you header value.
You can keep the header value in the context that can be updated on each run.
Hope that helps !
Thanks for your reply. Is there any function like MD5 in talend to fetch the updated data and new data from source file. Based on MD5 checksum value will compare with source and target, changes/null in checksum value will treat as new or update.
Please let me know if we have any MD5 similar in Talend.
Yes, Talend have a component tAddCRCRow which calculates a surrogate key based on one or several columns and adds it to the defined schema.
For details you can refer the documentation at the below link -
Thanks for your prompt reply. I am trying to learn and explore more on talend. I tried using tAddCRCRow and initial load it generated a unique numeric value in CRC column for all the records. Next run I have few new records and changes to existing records. for new records it is getting new crc value but for updated once no change. How to compare the previous(first load) target file and current source file to fetch updated records.
Thanks in advance...
What I know and have faced in our implementation, new CRC do get created whenever there is a change in the value of any column.
However we faced the same problem with tAddCRCRow when we were pushing more than 1 million of records. With that many number of records there is a probability, we may get duplicate CRC.
So, in that case its not a reliable solution.
You'll need to store the CRC value for each record somewhere, usually a database table created for this, and then do a lookup and compare them to see if they've changed since the job was last run.
However, you'll need a unique identifier for each record to store along with the CRC, so you can then look up the previous value, and you've stated that there is no key for the input data.
If, and only if, you can guarantee that the order of the records in your source file won't change, and there's absolutely no chance of records being deleted, just new ones added to the end of the file, then you could add a new integer field into the flow, probably with a tMap, then use Talend's Sequence function to populate this for each row, and use this as your key when storing and looking up the previous CRC value.
Just out of interest, if your source data has no unique key, how do you intend updating the existing records in your database?