Dear Talend Support Team, We have a huge input file with more than 4 mio. rows in it. This file is read by tFileInputPositional and afterwards its data flow is linked to tMap. There are in addition lookups with database tables but theses tables don't contain many rows. The problem is the enormous memory consumption. We need a way to keep the memory moderately. Is there a way to read the huge input file in parts and than process it and after all read the rest?
Hi Hilderich, In order to solve memory problem, in tMap you can save the records in file system. Any ways when tfileinput component reads the file, it can't read all the rows at a time. It reads in chunks of records and then goes to tMap. Your tMap component is the one who collects all the records in memory/file system, works on join operation and pass it to next component after processing. Storing intermediate records in file system will help you to solve the memory problem. This option is available in property setting in the input section of tMap (top third icon from left at input side) Thanks Vaibhav
Hello Vaibhav, Thanks for your answer. I forgot to mention that this option (store temp data to file) is already in use. Unfortunately the memory consumption has not improved. When the job is in process I can observe the temp files written to disk but the consumption is still on its maximum. The problem might be the last tMap component before the data are stored into the database. But on this final tMap there is no lookup designed and therefore I cannot save the flow temporarily to disk again. Any other ideas? Kind regards, Hilderich
Hi, you can try disabling part of job which will help you to understand which component or section of job is consuming memory..or also you can try to break one job into small subjobs and pass data from parent to child or use files in between processing... Performing all tasks in single job is not optimized way to deal with large amount of data and joins... even you can distribute join processing in stages if possible. Vaibhav
The bottleneck is component tDenormalize. Without this there is no memory consumption up to its limit. Any suggestions how to replace it by a more efficiently approach?. btw: Your image attachment function here is defect - I cannot attach any images anymore.
We need to group the data structure but we skip field "LKZ" from grouping. By this we get the values for "LKZ" comma separated and that is what we want. This all can be done and is realized already by tDenormalize in the job above.
--- just an idea... you can put a tfilterrow component before tdenormalize and distribute rows based on particular key value which does not oppose the grouping functionality required by tdenormalize... then you can have two tdenormalize component in main and reject flow... there by dividing the memory usage onto two components... also can use sort component before tdenormalize to give him sorted data so as to process quickly...