simultaneous ftp download

One Star

simultaneous ftp download

Dear all,
I set up a job that downloads files from a ftp server. Attached you see my job.
FtpFileList lists all zip files - afterwards some information is transformed (e.g. abs_path).I could avoid problems with ftp timeouts using hash input/output component (see http://www.talendforge.org/forum/viewtopic.php?pid=128456#p128456).
What about the ftp files?
- In meantime users could delete files (.zip)
- more than 100 files at the same time are possible
What about the job?
- I use crontab which executes the download job regularly
- it could happen that two jobs are executed at the same time
- local folders use job related names (unique naming for destination folders)
When a job is running, everything works fine. But when two jobs are running simultaneously, zips are multiply downloaded!
See my job design on second screenshot.
Anyway, my thougths are:
Before a zip is going to be downloaded, add ".progress" extension (these files are ignored initially at ftp file list component).
Second, I test if the zip still exists (could be deleted by user or is renamed by another job).Then check if the file already has the extension ".progress". If not, rename it to ".progress" and download it. Unfortunately it doesn't work and honestly I don't have any further ideas. I even don't know how to google it... couldn't find any related topics. Is my job design suboptimal?
Does anyone have ideas? Maybe another approach is better...
Seventeen Stars

Re: simultaneous ftp download

It sounds like you have found a good solution. To rename a file to flag this file as in processing state is a commonly used method. What does not work, this way I have done for a couple of month and it should work.
One Star

Re: simultaneous ftp download

dear jlolling,
it doesn't work when two jobs are executed simultaneously. This happens when the first job could not complete the download. But against my thoughts a renaming does not prevent jobs to download the same files. Let's say we have 10 files and Job 2 starts after Job 1 has downloaded 3 files. I get this result:
Job 1:
file 1 - 10 (all files are downloaded!)
Job 2:
file 4 - 10
Ideally both jobs would share the remaining files. So my expected result would be (for example):
Job 1:
file 1-3
file 5, 7, 10
Job 2:
file 4, 6, 8, 9
The logging of the jobs tells me that files couldn't be renamed or could not be founded. I have no idea why it doesn't work...
One Star

Re: simultaneous ftp download

How about creating a log (file or db) that keeps track of what every new job run should do. When a job starts, it gets the contents of the working directory (list of all files). Let's say when it starts, there are 5 files; all 5 files (by name) will be logged as being processed by the job run. It downloads each file in succession. If another job starts while the first job is running, here's what your job should do: get a list of all files in the directory (and let's say at this point, there are 8 files total). Then the job will check the log to see if any of the 8 files are being processed. It finds 5 of the same file names in the log; so the second job adds a log that it's processing files 6, 7 and 8 only.
As for the file names changing mid-stream, the only suggestion I can come up with is to set up a similar logging mechanism that verifies that the contents of the files are not processed twice against the final system of record. For example, file name is changed from Customer1.zip to Customer2.zip. Process the files normally (per logic above). When all files are downloaded, process each file internally into a staging table or file. If your files contain customer records with a primary key, your job would then check to make sure you're not entering records for a particular file twice into the Customer table; if the file name change has a significance to your business, you could use that information to do updates or deletes or whatever you need to do... But only a log (table or file) will help you do that...
One Star

Re: simultaneous ftp download

For every job run, you could create a GUID (unique id for the job run) and write the files to a log file with the GUID id
GUID File Name
1 Customer1.zip
1 Customer2.zip
1 Customer3.zip
1 Customer4.zip
1 Customer5.zip
When the second job runs, it appends to the log file as follows (use tMap or other to component to not insert the same file name twice):
GUID File Name
1 Customer1.zip
1 Customer2.zip
1 Customer3.zip
1 Customer4.zip
1 Customer5.zip
2 Customer6.zip
2 Customer7.zip
2 Customer8.zip
One Star

Re: simultaneous ftp download

dear willm,
the approach you describe also came up. But in this case you may also could use job related extensions (instead of renaming to .whatever rename to .j##time-stamp##).
Your job design makes sure a file won't be downloaded twice - that is correct. But I liked the idea to have further jobs sharing the remaining files (which would increase efficiency).
Anyway, me come to mind following problem during I designed the job for rename file to .progress and download it. You may attend to the second screenshot. We could ignore the tFileExists for the zip and take a look at this stream:
tFileExists (.progress) --->(if) (.progress not exists) tFtpRename --->(main) tFtpGet
I guess my problem is the usage of the main link at the last step. When I use "Subjob Ok" then it would download the file even when the .progress exists before (the renaming would not happen, but the download). It's not related to the if-link to the tFtpRename component.
I'm not sure if the main link really does what it should do. What I want is: when .progress not exists, rename AND download it. When it already exists NEITHER try to rename it NOR download it.
I would try to also use an if-link with the same condition I did for tFtpRename. My hunch is that my approach is correct but my job design doesn't fit into.
One Star

Re: simultaneous ftp download

Dear all,
I just tested it out, again.
Downloading big files, won't cause any problems. Remaining packages are divided correctly - no duplicates in all current job folders. Fine.
Downloading small files (max. 1 MB size) cause trouble. The jobs doesn't share the remaining packages correctly. In this case I could found all remaining files in both job folders (see above).
For me this means:
a) job design is correct
b) small files = short time between check if zip/progress still exists, rename and download. This explains some of the error messages that occur (saying could not find zip, renaming error).
Anyone? I guess it means either you work with job related lists/extensions or you accept risk of duplicates...
One Star

Re: simultaneous ftp download

This is obvious - but I'll state it nonetheless :-)... The reason you have a problem with small files is that they are processed fast BUT there's latency (lag) in the round-trip communication with the FTP server from your Talend Job Execution Server.
How about?
+ At the start of the job, get the list of file names on FTP and log them locally on your Talend Job Server (if the filename had already been added to the local log, don't duplicate); for new files, give them a status of 0 (not Downloaded) in your log
+ Next, iterate through the file names in your log and select a filename that has not been downloaded (status = 0). BEFORE you even go to the FTP server, mark the file status as 1 (in Progress) in your local log; go fetch the file...
+ Once FTP is complete, change the file's status to 2 (Downloaded)
+ Iterate to the next file name in your log that has a status = 0, repeat...
This way if more than one job is running at a time, they'll share the workload. Updates to your log will be very fast, which will minimize if not completely resolve your latency issue with the FTP server.
One Star

Re: simultaneous ftp download

Thanks willm. The approach you describe here, makes sense to me and sounds good (like the idea of a multiple-level-logging).
The explanation was not as obvious as you thought, thanks for that. What I can do on my side, is to mount this directory on my Job Exection Server to reduce latencies significantly. Initially I thought it might be a problem of the job design, but after my tests and the feedback here, this doesn't seem to be the case. Unfortunately the idea of mounting it came up lately and honestly I found the problem interesting (and you won't be able to mount folders in every situation).
So, thanks for all your ideas and feedback. It helped me a lot!
Employee

Re: simultaneous ftp download

My solution for this topic is list: