Hi, I am loading 25 tables, from mssql to postgres. Each table approximately contains 1 million rows. Now, I load it first using regular components.
Once, the full load is done, I want to run the delta load, meaning I want to use cdc on sql server and use the changed data to either update or insert data in the tables of postgres
I can do this using SCD components of talend or make my own SCD flow.
I prefer using the SCD component as it takes care of a lot manual update/insert code I have to write otherwise
I see the performance with SCD is not as good as the manual update/insert code
What can I do to enhance the performance of SCD? I have read in a few places, that SCD is slow, if yes, why would this component be exposed when SCD coding could be done manually?
I would prefer to go with SCD as maintaining type1,type2, type3 and hybrid type is managed by the SCD component. This would also prevent a lot of other bug prone code in the manual update/delete route
Please suggest as to how I can go about this?
all depends from number of daily changes
yes, SCD is slow, but for example:
You have 500 000 customers, for an average company it could be up to 2000 customers changed per day
and this with 5 SCD operations per seconds give us 400sec, 15min - not dramatical
but if You try to use the same technics for 100k daily updates - it is another case
Thanks for the reply. Let's say I decide to run my job every 15 minutes and the change in data is in terms of 10-15 rows every run.
And rarely , I might get a bulk change say around 5000 rows.
Will using the SCD under such circumstances, be still okay?
There is no standard benchmarkign number for the scenario you have explained. There are multiple factors which will decide the performance in this case. Some of the basic factors are CPU utilization, memory utilization, network bandwidth. Also we need to keep in mind that you need to factor the system resources for both Talend and the database you are using for the SCD.
A performance bottleneck in one single area can create problems in your plan.
Please refer the Talend reference architecture to get some high level system resources to be allocated while installing Talend.
But you need to talk to your DBAs and see where you can tune at DB level also. Even after all these precautions, sometimes there will be bottle necks. In those cases, you will have to see how you can parallelize your tasks, the areas where you can fine tune the flow etc.
It is a continuous exercise but create some PoCs and verify in which stage you are facing the bottle necks and then you can plan the tuning exercise from that stage.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Learn how to do cool things with Context Variables
Find out how to migrate from one database to another using the Dynamic schema
Pick up some tips and tricks with Context Variables