Am working in data anonymisation project and using Talend as tool to anonymise the data.
The design is to do anonymisation as in situ.
As anonymisation is required only for the PPI fields we are taking one table as source which has annoymised data and trying update the same fields for mutiple tables (6 to be exact) in the same job.
As data volume is enormous in the output tables (ranging from 11 Million - 48 Million ) in the output table the load was not possible in one go. Hence we used the thread logic which would divide the job into 8 thread to fasten the process.
Hence, there is a restartabilty factor involved.
As an observation 10,000 X 8 threads from input is taking 2 hours approaximately. Which is very slow.
Hence, we tried to divide into 6 different jobs where in one job updates each table. The tables having 24 Mil and 48 Mil tables were still slow. Therefore, we used the insert method for the two tables.
Somehow the above improvisation reduced the time but we have a concern. If we follow the insert design it would need to have all fields of the tables in the schema. Hence addition of any new fields in future would need a code change. This concern has been raised, hence, the design is rejected.
We also though for the tOracleOutputBulkExec however it has restart-ability issues.
Would you kindly kind have a look at the attached design and recommend us input from your end to make the design efficient.
Please let me know for any information have missed to provide.
Student Loan Company, 100, Bothwell Street, Glasgow
Solved! Go to Solution.
We have used the default Cursor size in the TDBInput componet which is 1000
The commit is happening after a specific number of rowstoprocess we are running on the child job one. ( say 01 Million )
Now 1 million records are sourced from input table.
Once it goes for update to output table more or less records would be matched. Hence for some tables updates would be done in thousands and some it would be done in multi millions
Eg of one Run:
Source Input Table - 1 Million ( join of 2 tables )
Target Output 1 - 01 Million ( one of the source table )
Target Output 2 - 700 k
Target Output 3 - 150 k
Target Output 4 - 900 k
Target Output 5 - 1.48 Mil
Target Output 6 - 745 k
Commits happens after this for one thread, like wise it will do for all 08 threads.
Please let me know for any other questions.
Also something to consider, are the threads processing data in one table/partition or is a given thread trying to update all partitions? If partitions even exist. If you table to be updated in partitioned in some way, try and get each thread to only update one partition at a time.
You could probably increase the TDBInput cursor size to 10K or 100K, maybe even 1M.
Usually I use 1M sometimes I have even used 2M, just depends on how many columns are in the DB table and your Oracle instance's resources.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Learn how to make your data more available, reduce costs and cut your build time
Read about OTTO's experiences with Big Data and Personalized Experiences
Take a look at this video about Talend Integration with Databricks