I am reading a 10 million record set of file in Talend and have transformation based on certain lookup file joins.
The join file dataset are also in millions . This is causing a huge performance break while running the job.
For perforamce enhancement ,
a) I haven't used any sorting component.
b) I am using a Temp dir option for tmap.
c) In large lookup files , i am only reading that columns that are required for lookup.
d) Increase the JVM size to 32GB.
e) Have also tried with parallel components (TPartitioner, tcollector etc).
After trying all of the above ,steps still the job performance has not improved. Its failing with Out Of memory issue.
Version : Talend 5.2.1
I have also tried the above steps in Talend 6.1.1 open studio without partition components as they are not available.
Please guide , what more can be done to improve the performace.
Make sure you limit the amount of BigDecimal data type in your schemas. Avoid it if you don't need it. On Java 8, long data type on 64 bit systems are wide enough to only use BigDecimal for monetary calculations.
I have seen the link given , but there's one problem wherein if i have lookup attached to the flow , those lookups are loaded again and again for every batch run.
how can this be avoided , so that lookup's are loaded only once.
If it is a small set of data, then you can cache the data into tHashOutput component and use a tHashInput on the lookup. Thus, it will be looking in memory.
If you are looking huge volume of data that is dependent on the input, then better reload at each batch to avoid loading a huge dataset and cause your job to run out of memory.
The volume of data you are looking up will determin how much RAM you job will need.
You may need to play with various design to figure out what works best for you.
i have a source file file 2 millions data and 5 lookup file with same quantity (2 millions recordset) . i am providing a batch of .5 millions i.e 4 batch will run .
This means 2 millions recordset for 5 lookup will load 4 times .
1) Will this reloading of lookup consume more memory?
2) When the second batch comes in play , will it consider the first batch loaded lookup file data as well for looking up? if yes , then this means that it will become a one to many relationship if we are assuming unique join values from all the lookup.
2) It depends on your design.
I am using a load once option in tmap for the lookup model . how should i make sure that data should be reloaded freshly from lookup when a new batch starts.
If you are doing as per the KB article, each time the loop iterates, the tMap will trigger the lookup again and load again. So you set your tMap to Load Once. The batch iteration helps with that. However, other challenge is that you are looking up from a file. It is harder to do this logic with a file since you generally need to read a whole each time. If you can stage the content of the file in a Staging DB, then design of the job becomes simpler with select statements with upper and lower bound limit.