Loading fact and dimension table in same subjob

One Star

Loading fact and dimension table in same subjob

Let's say I have a log file that contains several feilds, one of them is a text field and I want to create a dimension table based off of the contents of that text field. What I want to do is have talend parse that file and first check the text column against a dimension table, if it doesn't already exist add a new column with that text, once done (regardless of whether it had to add anyting or not) do a lookup on that table and return the id for the record and then import that into a fact table.
I know I can do this in 2 subjobs, the first parsing through and checking against the dimension table, the second doing the lookup and then loading the data, but is there a way that I can do this all in 1 subjob so it doesn't have to parse the same file twice?
- Peter
One Star

Re: Loading fact and dimension table in same subjob

Hi Peter,
I'm not sure if this can all be done in one subjob, however if you want to minimise the number of passes of the main input file you could consider doing more in the first subjob i.e.
- perform the lookup against the dimension table
- write all matched fact records to the fact table (since you have the key from the lookup)
- add the new dimension rows from the unmatched facts
- write unmatched fact records to a file
then in subjob 2 you will only be parsing a subset of the original file i.e. those fact records that didn't have an existing dimension. It could prove a lot more efficient if you have a larger number of matches than non-matches.
Another alternative is to generate the surrogate key within Talend rather than the DB. This would mean that every time you fail to match to the dimension you could call a routine that generates a new unique surrogate key. You can then use the generated key for both dimension and fact table inserts in the same subjob. This would mean only a single pass of the input file.
Regards,
Rick
One Star

Re: Loading fact and dimension table in same subjob

Hi Peter,
I have had a go at running this in a single subjob and managed to get it to work.
I read in the fact data and immediately write all matched facts to the fact table.
Nonmatched facts are passed to a tmap that writes the new record to the dimension table that I have set to commit every 1 and insert 1 row at a time (in advanced settings). This tmap also copies the fact data to a second output flow.
The fact data passes to a third tmap that joins to the dimension table. I have set the dimension table to reload on every row, so that it will pick up the newly inserted dimension row. This tmap writes out to the fact table with the new key added.
This did work for me on very small volumes, so I can't really say that this will be more efficient than either of the methods above. I also have a reservation regarding how the DB and Talend handle the insert then the query and will it always be in synch? i.e. is there the possibility that the query will begin and complete before the dimension insert completes, thus causing a failure?
Regards,
Rick
One Star

Re: Loading fact and dimension table in same subjob

Hi Peter,
I have had a go at running this in a single subjob and managed to get it to work.
I read in the fact data and immediately write all matched facts to the fact table.
Nonmatched facts are passed to a tmap that writes the new record to the dimension table that I have set to commit every 1 and insert 1 row at a time (in advanced settings). This tmap also copies the fact data to a second output flow.
The fact data passes to a third tmap that joins to the dimension table. I have set the dimension table to reload on every row, so that it will pick up the newly inserted dimension row. This tmap writes out to the fact table with the new key added.
This did work for me on very small volumes, so I can't really say that this will be more efficient than either of the methods above. I also have a reservation regarding how the DB and Talend handle the insert then the query and will it always be in synch? i.e. is there the possibility that the query will begin and complete before the dimension insert completes, thus causing a failure?
Regards,
Rick

Rick,
I think I understand what you're saying here, but is it possible you could put up a screen cap of the job so I can see the layout and thus be sure I'm understanding you correctly?
- Peter S
One Star

Re: Loading fact and dimension table in same subjob

Here you go, hope this helps....
One Star

Re: Loading fact and dimension table in same subjob

Here you go, hope this helps....

For the "not matched" loader are you using items that failed an inner join?
- Peter S
One Star

Re: Loading fact and dimension table in same subjob

Hi Peter,
Yes it is an inner join, with the NotMatched being the rejects from the inner join.
Regards,
Rick