ETL: how to update my datawarehouse quickly

One Star

ETL: how to update my datawarehouse quickly

Hello everybody,
I wonder how to use the ETL of Talend to update a datawarehouse from another database (with a lot of datas).
Just for information, I am a newbie to the use of ETL.
I think it will take too much time if I reload all my datawarehouse every night.
I have seen that it exists the possibility to use the instruction "insert or update" but is it enough to have a good performance?
Does it exist another technical to update only the difference between the database and the datawarehouse?
Thank you in advance for yours answers,
Chris
Employee

Re: ETL: how to update my datawarehouse quickly

In your case, i think the best thing is in fact to have like a modification date in the main tables of your database source.
If you have / or can have this field here, then you can retrieve easily the last data modified, so it can be fast enough.
If you do an "insert or update" only with all rows, the performances will be really bad. You need to filter the rows first.
If you have for example 10 millions rows in your database, if you add a good filter with the last data / last modified data, maybe you will have like 1000 rows only, so an insert / update will be ok and fast enough.
Nicolas
One Star

Re: ETL: how to update my datawarehouse quickly

Thank you for your answer.
I understand your solution with a date, but sometime it is not possible to modify the database of the client to add a date.
In this case, what are the best solution?
Chris
One Star

Re: ETL: how to update my datawarehouse quickly

Maybe the last question :-)
If I want to apply your solution, ie to find the difference with a date of modification, what is the object to use in talend and how do that?
I actually use 3 objects:
- tMySQLInput
- tMap
- tMySQLOutput
Maybe in the tMap object but I don't find how to make a condition to import or not a new line.
Thank you,
Chris
One Star

Re: ETL: how to update my datawarehouse quickly

Hi there.
I don't know much about MySQL, but we use Oracle. We created CDC tables to capture changed or inserted rows on the source database. You can also use triggers to copy write the record to your CDC table on update or insert. Create an extra column in your CDC table to indicate if it was an insert or update. Use the CDC table as source, and clear it everytime after your load. Hope this helps.
Philip
One Star

Re: ETL: how to update my datawarehouse quickly

chris23
The most important think is to use a good index.
If you use date modification put a index on it.

DN
One Star

Re: ETL: how to update my datawarehouse quickly

I have found a solution with the use of the Talend object MySQLSCD.
I don't know how to add an image here so I will explain my solution.
3 elements:
- tMySQLInput -----> tMap -----> MySQLSCD
With tMap, I choose which column I need to update in my datawarehouse.
And with MySQLSCD I am able to define when my job have to reload the datawarehouse.
Now I wonder how is it possible by my job to compare the client database and the datawarehouse with a good performance.
I think that we have to install the talend job on the client server and a cache is used with the copy of the datawarehouse on the client server (where is the database of the client) and the comparison is done localy.
Then, all new values (which follow the rules defined in MySQLSCD ) are updated to the datawarehouse.
Is it right?
Chris