we have a job in windows that is polling for a network folder for new and updated files.
tWaitForFile is used for triggering the job. See attached screenshots for configuration. We have an issue related the tWaitForFile component.
Typically we get a number of files arriving to the folder within relatively short period of time e.g. 9 files arriving within 5 minutes. What happens is that Talend start processing the first file. During the processing more files appear to the folder. At the end tWaitForFile is executed the right number of times (execution count = file update count) but some files are executed twice and again some files are not executed at all.
Somehow Talend mixes which files it has done and which are not. Do we have something wrong in our configuration or what could be the issue?
Many thanks in advance.
Could you please clarify few things...
- the folder would be empty initially?
if yes, that mean files are newly inserted.
if no, that means files are getting updated every time. would any file be updated twice?
you can try pushing/archiving the executed file in a separate folder using "((String)globalMap.get("tWaitForFile_1_FILENAME"))" so it wont confuse between executed and unexecuted files.
If the any file is updating twice then the second time it will be considered new as we have moved already executed files.
Let me know if this solves your issue.
file level triggers (especial over network) it always source for issue
small example - series of ls -l command
as You can see - file already there ... but continue growing
what happens, if Job start read it when it still 0? File system have cache for file operations, network add one more layer ...
better way - put Talend Job on same server with files - still not 100% warrant from all collisions, plus adopt Your original process for write file to other folder, and rename it at the end. Rename operation - do not transfer data, just change link to file, so work much more faster and reduce collisions.
the best way - send files to Message Queue, big choice, but supported by all Talend Studios:
it warrant mechanism similar with database transactions
Thanks for the quick reply.
I have been thinking about this time period when file is seen in folder but not yet complete. I agree that this definately can be an issue but I'm not fully convinced that would be the issue here. Basically there are 2 reasons
For Ajinkya_Gonnade's message: we have tried both ways clearing the folder before adding and just overwriting existing files.
Quick fix would be to build a custom loop that picks any file found from the folder, process it and move file to an archive folder afterwards then restarts. In our case where the frequency of file creation is low the loop could just pick any file without worrying that new incoming files would cause that one file is never processed.
But again I would of course like to use the standard compnents instead.
That's exactly what I mean by this... and loop is obvious for continuous iteration.
"you can try pushing/archiving the executed file in a separate folder using "((String)globalMap.get("tWaitForFile_1_FILENAME"))" so it wont confuse between executed and unexecuted files.
If the any file is updating twice then the second time it will be considered new as we have moved already executed files."
Also printing the current execution filename will let u know which files are executed and how many times.
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Learn how to use an API-First Approach to Modernize your Applications
Take a look at this technical overview video of Talend API Designer
Find out how to get started with APIs