Inserting older records into dimension than already inserted

Five Stars

Inserting older records into dimension than already inserted

Hello,

 

I am integrating data from 2 countries now and I have this problem.

Problem is that my scd start and scd end field get messed up in dimension having some SCD fields. (therefore I get duplicates in fact table)

So I made a visual example.

I load first country 2017 then 2018. And after that I load second country 2017. (I cannot load all years at once because of the volume). This is what I get:

 

scd_start           scd_end      version        Active             surrogate_Key_UUID                              business_key        SCD_Attribute    Country
2017-02-20    2018-02-12       1               false      9a5d41b5-5db8-43fa-922e-21e30addee1f        A001                        333                UK
2018-02-12    2017-01-09       2               false      2dc88bf2-20c9-48ec-b6ad-f2407f8d77a4          A001                        444                UK
2017-01-09    9999-12-31       3                true       aaaa8732-a49f-41bd-8e19-3ad7deefb062        A001                        333                USA

 

As you can see, because I am loading older data than already is there, I get scd_end older that scd_new.

 

I want to ask, if you can help me solve this issue?

PS. I don't want to used SCD component from talend, so I need to do this with SQL logic.

What checks should I implement and how to do the updates?

If there is already some written material to this issue, could you please refer me?

 

Thank you very much!

Highlighted
Sixteen Stars

Re: Inserting older records into dimension than already inserted

This is an old problem that I can remember experiencing in the days that I used pure SQL for DI (....with maybe a bit of Informatica later on). Essentially you need to handle this yourself very strictly. I don't believe that the SCD component will do it for you (I may be wrong about that). 

 

Essentially you need to check....

1) Is my record part of an existing dimension? If it is not, then just add it with a "end of time" end date. If it is, goto the next step.

2) Is my record's start date after the last record's start date? If yes, treat it as a normal new record. If not, goto the next step.

3) Is my record's start date older than the oldest record's (of that dimension) start date? If so, add your record and edit the currently oldest record's start date to be the end date of the new record. Done. If not, goto the next step.

4) If you've got this far you are essentially slotting your new record inbetween others in your dimension. You will need to either split a single record into 2 (if the new record's start date is after the old record's start date AND the new record's end date is before the old record's end date) OR have to end a record earlier and start the currently following record later (according to your new record's start and end dates).

 

This is not easy, but as long as you follow those rules (I think I have covered them all), you should be OK.