When i run a subjob, the main table has more than 200+ thousands records. I used "Lookup Model:Reload at each row" to filter Lookup table in tMap. Also added conditions in main table. However, the whole of job still will take about 20min.
I did some filers through globalmap in lookup table.
In tMap, implement "Left Out Join" and "Reload at each row" lookup model.
Complete to run the whole of job will take so long time(20min).
Do you have any idea to do the performance tunning?
BTW are there any good ways to use connection pool in Talend?
Thank you in advance!
Solved! Go to Solution.
The data is from different database.
I think i have already added enough conditions to filter in the DB, at the same time, did some deals in the tMap, so i have no idea how to improve the performance through other ways!
OK, it looks like you are doing the sort of things I would try to speed things up. Another thing to test is where the bottleneck is. Can you disconnect your DB write/update components and just run the code where the join is happening. Does that massively speed things up? If so (as I have seen before) this could very well be a insert/update issue. Also, are you inserting or updating.....or both?
OK. Somewhere where you might get a slight improvement is by removing the joins on the columns you are filtering your lookup on. They are not needed. You will only be returning rows that hold the same values as the row that is in coming in in your main flow of data. Therefore you can remove those from the query and from the join. This shouldn't resolve this issue, but will be more efficient.
However, before doing that are you sure that loading the lookup on every row is the most efficient way of doing this? Sometimes it is without question, but if your main flow is 200,000 rows, that is 200,000 queries fired to the DB. I am assuming you have probably looked into this, but it is worth considering if you haven't....and easier to test before removing the joins I advised you to remove above.
I tried to remove the joins, however looks like there is no more efficient.
And i think i have to load the lookup on every row because the lookup table has 428,634,831 rows. Or else it should be more slower.
OK, do you know how many rows are returned for every row that the lookup is fired for? In theory you should only return the rows that you need if you are going to fire 200,000 queries. If you are returning more rows and having to filter them in Java, maybe you can tailor your filtering in your query?
Another solution to this will require some playing around, but may just work. I suspect that you can filter preemptively find some values to filter your lookup query on, before you carry out your lookup once. To do this you will need to change the job to dump your main flow into a tHash component (store it in memory). You have some columns that you need to join on or filter by. Now, if you load the main flow preemptively you *may* be able to find some data to use to filter the lookup by. For example, lets say that your lookup data has a column called "alphabet" which holds every letter from a to z. However, your main flow only returns "a", "g" and "y". If you know this before running the lookup query, you can add an "IN" filter to the query and pass in your comma separated list of "a", "g" and "y" to the SQL query. This might remove 3/4 of yoru lookup data. If you are only loading this once and into memory, you will lose the latency of firing the query for every row and the number of rows to be checked each time is also massively reduced.
Another thing I have just noticed, do you need to match ALL ROWS on the match model? If you don't then switch this off. It will mean that all data will need to be checked for every incoming row.
Just only one row is returned for every row that the lookup is fired for.
You are right the best practice is to find some values to filter my lookup query and carry out load once which is more efficient.
However, if I can not find some values to filter my lookup query. Can I use "Parallelization" to execute the job? Then use the connection pool to control the tMSSqlConnection. How do you think? Or do you have any idea?
Using parallelization will help in this scenario, but it could also slow things down. The rule of thumb for the number of threads is the number of cores - 1. If you start with this and tweak (up and down) you will find an optimum. This also depends on how much other work your machine is doing.
Connection pooling will not help you here. The connection component will handle connections for every db component in your job....if they are hooked up to the connection component. Connection pooling would only really be of use in the situation where the job is running multiple times in parallel.
Yeah, we'd like to run the job in multiple times in parallel.
However, seems connection pooling can not be used in TBD 6.2.1. Right?
I don't understand how connection pooling would help you. The Connection component maintains the connections for the job. If you have jobs running in parallel (ie multiple instance of a job NOT a section of the job as demonstrated in your screenshots), the Connection component in each instance will handle this. Connection pooling is only really a requirement for a webservice where thousands of instances of a service can be using connections concurrently. This is handled quite easily using Spring and has been supported for quite a while now.