Four Stars

Reading large flat files in stream mode: doubts

Hi,
There is a scenario where we need to read millions of records from database and have lookup with multimillion records...
- To ease the lookup, used save lookup data on temp path using variable inside tMap
- in order to save memory for main data to be loaded into memory, use filesystem as an input in place of data, where the data from database table is saved in flat file before starting the process.
Question is, if we leverage the the feature of enable the reading data from the file stream, will it help to solve memory related issues in previous case? Previous case means, reading data from file / data base without stream.
Will this load data from flat file in continuous stream mode and have a lookup with the data available in the temp path using tMap?
https://help.talend.com/search/all?query=tFileOutputDelimited&content-lang=en
Vaibhav
5 REPLIES
Seventeen Stars

Re: Reading large flat files in stream mode: doubts

I would create a dedicated lookup tables containing only the essential columns and an index which points exactly to the search condition.
Now I would use this table (probably prepared in a previous job) and set as load type "Reload Each Row" with using globalMap expressions to configure the lookup where conditions.
With this scenario you are avoid memory leaks and it can run very fast (of course not as fast as you hold the lookup in the memory it self).
Four Stars

Re: Reading large flat files in stream mode: doubts

I would create a dedicated lookup tables containing only the essential columns and an index which points exactly to the search condition.
>> I can't have index on lookup column in this case
I am having doubt, whether loading data using file input stream would enhance performance as compared to normal data loading from file/database?
Vaibhav
Seventeen Stars

Re: Reading large flat files in stream mode: doubts

Hi Vaibhav,
I do not think so, using a file input stream improve your performance. Because the stream must be streamed through e.g. to its end to reach the record you need. A database does this task much more effektive than you could implement.
If your main database does not allow creating additional tables or creating indexes you could try using a database like Java DB or H2. Both databases are InMemory databases and provides a veritable speed. You could copy the massive lookup data into such database.
Reading a file as lookup with the option Reload each row is definitely a performance killer.
Four Stars

Re: Reading large flat files in stream mode: doubts

Thanks Jan,
Ok, so based on this what I could identify is to break the lookup as well as source data into the chunks/batches and execute it... this would certainly improvise the performance of job execution and could manage millions or any number of records very quickly...
I will try this breakup and sure that it will help for managing large data in lookups as well as source...
Thanks 
vaibhav
Seventeen Stars

Re: Reading large flat files in stream mode: doubts

Of course you should do that. If the source does not depend on chunks of lookup data, wipe out the lookup records not needed.