Get records from a DB table without a timestamp column incrementally

One Star

Get records from a DB table without a timestamp column incrementally

Hi buddies,
Do you know how to get records from a DB table without a timestamp column incrementally? Now I am facing such a thing ~ The source tables do not have a timestamp column to indicate when this record is created, but I need to get the records every night incrementally...
Anybody can help? ~~
Thank you so much!! Smiley Happy
Four Stars

Re: Get records from a DB table without a timestamp column incrementally

Hi,
Usually when timestamp is not available, unique key column is reference key. It is also assumed that there are no previous deletions or updates in the source data for past records.
Assuming that above is the case. You create a system table and after completion of your initial load, take the max_id from your source and put it inside the table with timestamp.
- During next execution, select all the records from source having ID > max_id. This will give you all the new records.
- Once this step is done, update your system table with new max_id and timestamp.
Repeat this process and you would get incremental records.
This is one scenario and there could be other scenarios as well.
Please put your design and source data details and we can identify more options.
Vaibhav
One Star

Re: Get records from a DB table without a timestamp column incrementally

Hi Vaibhav,
Thank you so much!
Yes, "max ID" is a good way, but, how to get the updated records?
We have many tables, but most of them do not have "CreateTime" or "UpdateTime" column. We cannot get the change in triggers, because that will impact the performance of the production database ~ we can only do data extracting termly.
And, the "max ID" may have some issues, for Oracle, we use sequences, and the sequences have cache, it means, if there are 2 threads are using this sequence for new ID (these 2 threads are both for inserting data to the same table), thread1's current indexes are from 10001 to 10010, thread2's current indexes are from 10011 to 10020. You know, it is possible that the record with ID 10015 is inserted into the table before the record with ID 10005.
So, do you have a way to handle these restrictions?
Thank you so much, Smiley Happy
Four Stars

Re: Get records from a DB table without a timestamp column incrementally

Usually when you don't have a timestamp in the source column many times it becomes tricky to identify updated and deleted records.
Regarding how to identify the changed, updated or deleted reocrds?
I am not sure how much records you have in your source data. But if the records are not in millions, then you can do a compare of source to target and identify
- Reject records from source and
- Reject records from target
- tMap will help you in this, but to get the deleted records you will have to swap main and lookup.
This will way you will get deleted and updated records from source.
Regarding the thread issue...
You will have to identify what is your business scenario and how can you identify the .
Or you can try to capture last record from the source and set it as a benchmark for further data retrieval.
Vaibhav
One Star

Re: Get records from a DB table without a timestamp column incrementally

What I am giving is not a Talend Solution. You can create a CDC (db2) or Materialized View (in case of Oracle) that captures the changes on the table. Also products like Oracle GoldenGate solution will mine through the archive redo log files and capture the changes and send to a different table.
Updates and Inserts can be captured.  Trick is to capture the deletes which can be done either by above solutions or compare the entire source to target which can be expensive when dealing with multi million rows. 
Five Stars

Re: Get records from a DB table without a timestamp column incrementally

To do this properly, you need some form of Change Data Capture on your database beit old-style triggers and log tables, timestamps on rows or some other mechanism provided by the RDBMS.
Failing this, you can really only resort to extracting the data in full and then net-changing this data set against your last data set.
You can then pass on the net changes to where they need to go.
This is a flexible and reliable way of aciving your goal, provided that you can accept the cost of the bulk extract. In my experience it's often acceptable for reasonably large tables that you need to extract not too frequently. Of course, if you need to extract a huge table every 5 minutes, then this is a non-starter and you really do need to go back to the source database and request some propper CDC.