I would like to optimize the job attached to run faster. It currently takes 11-12mins to run.
Advanced Settings : Xms256M Xmx60240M
I have additional transforms that start with the same input file that are also running this length of time. Would really appreciate some guidance on how to improve this job, i.e. is there a way to not have to read row by row and just pull information needed immediately into the job?
Most of elapsed time comes from tFileInputDelimited for reading BillOfMaterial.tab file, and you read it twice for a total of 10 minutes.
You should try to add a tHashOutput after the 1st tFileInputDelimited and replace the 2nd tFileInputDelimited by the corresponding tHashInput.
Depending on the effective size of the input file, it should be significantly faster.
i've tried to put the tFileInputDelimited into tHashOutput and it doesn't work. It stops processing after 4M records.
I've also tried using the tBufferOutput with Dynamic Schema but I don't know how to convert it back to multiple field schema once I call it using the tBufferInput.
What if you have just a tFileInputDelimited in your job (just to confirm the required time to read is due to the file size, not to the operations achieved in the following tMap)?
If it runs for 5 minutes, you don't have lot of choices, else share your both tMap in case of (Map_Cols and tMap_9).
When reading the file, it takes more than 5 mins without the operations.
I'm not sure why it would take so long to just read the file even with the record count. It's just a text file. I expect this file to grow considerably so would like to figure out a way to optimize.
I made smaller files using 3M records per file as max.
design changed to process files... tFileList > InputDelimited > .. > .. > tBufferOutput
onsubjobok --> tBufferInput >>>> tHashInput
end of job is called using the tFileLIst to do the lookups in each file then append to the same file
It has reduced processing time to 7mins but that is still too high.