One Star

Traversing directories and matching with file lists

Hi everyone,
I would like to build a Job that processes files delivered into a dynamic directory structure via RSYNC and match the file names with a "already imported" file list to avoid duplicates. How can that be done with Talend?
The long story:
I have a directory somewhere which is synchronized with the file system of a remote server via RSYNC.
Within that directory, the RSYNC process maintains a dynamic directory tree (meaning: it removes or creates directories at will).
Within the directories, the RSYNC process places files.
Now I want to build a Talend Job that traverses that dynamic directory tree and makes a list of all files therein.
Naturally, some of these files will have been processed before - so I must now match the file list with a "already imported" list based on the file names (the file names are unique).
The remaining files from the list I must sort by file age (oldest first), uncompress them (the files come as .gz), and import their content to a database.
The file names of all processed files must now be added to that "already imported" list.
Any ideas on how that can be done?
Thanks
Matt
3 REPLIES
Five Stars

Re: Traversing directories and matching with file lists

design job like below. 
tFileList---iterate---tFileProperties---tMap-----Uniquefiles----tFileUnArchive---(A)
                                        |                                                          
                                        Lookup of old filelist                                      

(A)---tFileInput---tMap--tDb
|
OnsubjobOk--add to Processed file`s list.


in tFileList you can select sub directory to traversing so need not to worry about dynamic directory creation just provide parent directory path. 
Hope this will hep you..
One Star

Re: Traversing directories and matching with file lists

Hi Umesh,
thank you very, very much for your help.
I am currently trying to implement the Job as you suggested - but I seem to have some issues with it.
For instance, what does tDB mean? Any what to use the second tMap for?
Thanks
Matt
Five Stars

Re: Traversing directories and matching with file lists

tDb means your database component, and first tMap is for lookup and finding new files from existing list it will give you new zip files then you can tFileUnArchive it then you can use tFileInput to processed recently tFileUnArchive file . and second one to do any transaction if needed.