slowly changing dimension & inferred records in dimension

hi,
i am getting my head around Talend and the way things are supposed to be done.
so far i find it to be a tool with loads of finesse.
but i am not sure that i understand how to create a job that will tackle the following scenario.
a table from a operational system is used to as a source for a fact table in a star DB schema, this is table: F_ORDERS
together with this fact table there are other three dimensional tables, each has a source table from the operational system:
D_DATE
D_COLOUR
D_SIZE
so each dimensional has a surrogated key created as an incremental integer (DATE_ID, COLOUR_ID, SIZE_ID)
these are related between fact and dimensional tables as foreign keys to maintain integrity.
ideally, the processing would be:
load dimension tables using Slowly Changing Dimension process (or alike)
load fact table, replace operational system keys with dimensions' surrogated keys, derive any fact to report upon and save into fact table.
QUESTION:
how can Talend handle a situation when a new value is found in the fact table but does not exist in the dimension table?
how is it possible to insert those new values into the dimension table, and pick-up their newly assigned surrogated key values to continue process this fact table?
thanks,

Nicolas
www.brainpowered.net
7 REPLIES

Re: slowly changing dimension & inferred records in dimension

since nobody replied
i will try to expand:
let's assume that my dimension D_COLOUR was loaded yesterday evening.
and this morning a new clothing range started to be produced with the colour AQUATIC_BLUE; but there is no code for it on our dimension table until next morning when the dimension is refreshed.
but during the day we are extracting data.
so how can i assign a value/surrogated key for the new colour, setting a flag to indicate that this records requires attention/details on the dimension table
while
processing the fact table.
i am trying to understand the Talend way of doing this.

thanks

Nicolas
Moderator

Re: slowly changing dimension & inferred records in dimension

Hi Nicolas,
To answer your question, yes Talend can handle the fact that dimension tables get delayed.
There is several ways to deal with it:
1- Usually if your dimension tables are delayed, facts related to these dimensions should be rejected.
Because DWH is about ensuring quality of data and strong integrity you should reject the fact that have no match in dimension tables.
You can then use a recycle mecanisme the day after to reload previous rejects.
For example you can use a tUnite to merge daily facts with rejected facts and try to reload them using the standard loading of DWH.
This way you won't have to developp a specific recycling mecanism.
2- You want to ensure artificial reference integrity between facts and dimension.
You should first detect Ids from facts that are not present in dimensions (existing+current load).
Based on this list you should create new rows that you append to the daily dimension load inserting them using the tSCD component (Type 1, 2, 3 supported and it will take care of surrogate keys).
Then based on the same list of "special care" dimension rows, make an update on this newly inserted rows by setting a Delayed_flag column.
You have to do it in two steps since tSCD doesn't take care of delayed dimensions.
The day after you should remove all rows with delayed_flag and start over, so if the dimension is found=>ok, else you recreate an artificial dimension with a delayed_flag column.

I would recommand 1- since 2- will make dimension tables different from reality. And your reports will show that you have bought xxx number of "Generic article yyy" instead of you have bought xxx number of a red pull-over from brand zzz
In order to have facts and dimensions in sync you could use some ODS or temp storage.
You can write a bugtrack with a feature request of tSCD to handle delayed dimensions in the component.
Adding a tFact componant that take care of retieving tek keys from dim tables before loading could be also interesting.
regards,
Benjamin

Re: slowly changing dimension & inferred records in dimension

thanks bboutros,
i would seem that it is necessary to extract the unique value and test their existence against each dimension.
and i will be posting a feature request to see how tSCD can be enhanced to support delayed dimension records.
regards,
Nicolas

Re: slowly changing dimension & inferred records in dimension

bboutros,
i was trying to login into the bugtracker bu ti keep receiving this error message:
SYSTEM WARNING: session_destroy() : Trying to destroy uninitialized session
SYSTEM WARNING: Cannot modify header information - headers already sent by (output started at /home/www/talendforge.org/bugs/core/error_api.php:166)
i will try again, at a latter date.
One Star

Re: slowly changing dimension & inferred records in dimension

Can some one share the details of how you can implement this step by step. I don't know while loading the fact table how we can possibly tag the rejects into a separate table to make them available for loading in the next run.
Seventeen Stars

Re: slowly changing dimension & inferred records in dimension

You can only use surrogate keys between fact and dimension if you are absolutely sure the dimension will not change within a time slice of a fact. In this case you would need for one fact data set more than one surrogate key to the dimension.
I would suggest you combine the facts with the dimension by the natural key of the dimension and a between clause for the fact timestamp between the dimension time validity.
One Star

Re: slowly changing dimension & inferred records in dimension

Hi nicolasdiogo,
I don't know whether this would resolve your problem or not..
Have you tried using the "Catch lookup inner join reject" property available in the output table property? If you connect the dimension (lookup) table with the source data using inner join, those data which do not have the key will be rejected and you can get those data in the output table using the above property set to true... Check this out to store the rejected rows seperately