I am integrating data from 2 countries now and I have this problem.
Problem is that my scd start and scd end field get messed up in dimension having some SCD fields. (therefore I get duplicates in fact table)
So I made a visual example.
I load first country 2017 then 2018. And after that I load second country 2017. (I cannot load all years at once because of the volume). This is what I get:
scd_start scd_end version Active surrogate_Key_UUID business_key SCD_Attribute Country
2017-02-20 2018-02-12 1 false 9a5d41b5-5db8-43fa-922e-21e30addee1f A001 333 UK
2018-02-12 2017-01-09 2 false 2dc88bf2-20c9-48ec-b6ad-f2407f8d77a4 A001 444 UK
2017-01-09 9999-12-31 3 true aaaa8732-a49f-41bd-8e19-3ad7deefb062 A001 333 USA
As you can see, because I am loading older data than already is there, I get scd_end older that scd_new.
I want to ask, if you can help me solve this issue?
PS. I don't want to used SCD component from talend, so I need to do this with SQL logic.
What checks should I implement and how to do the updates?
If there is already some written material to this issue, could you please refer me?
Thank you very much!
This is an old problem that I can remember experiencing in the days that I used pure SQL for DI (....with maybe a bit of Informatica later on). Essentially you need to handle this yourself very strictly. I don't believe that the SCD component will do it for you (I may be wrong about that).
Essentially you need to check....
1) Is my record part of an existing dimension? If it is not, then just add it with a "end of time" end date. If it is, goto the next step.
2) Is my record's start date after the last record's start date? If yes, treat it as a normal new record. If not, goto the next step.
3) Is my record's start date older than the oldest record's (of that dimension) start date? If so, add your record and edit the currently oldest record's start date to be the end date of the new record. Done. If not, goto the next step.
4) If you've got this far you are essentially slotting your new record inbetween others in your dimension. You will need to either split a single record into 2 (if the new record's start date is after the old record's start date AND the new record's end date is before the old record's end date) OR have to end a record earlier and start the currently following record later (according to your new record's start and end dates).
This is not easy, but as long as you follow those rules (I think I have covered them all), you should be OK.
Introduction to Talend Open Studio for Data Integration.
Practical steps to developing your data integration strategy.
Create systems and workflow to manage clean data ingestion and data transformation.