[resolved] How to pick a file from S3 with latest date

One Star

[resolved] How to pick a file from S3 with latest date

Hi All,
For every 2 hours i used to get a new file in S3 and i have to take latest file depends on time from S3.
EX : My_File_20141104000001.csv
      My_File_20141104030001.csv
can some one help me how to implement this using talend.
Thanks in advance.
Rajesh

Accepted Solutions

Re: [resolved] How to pick a file from S3 with latest date

This is a very common task that is not super easy to implement in Talend. 
Please have a look at my example job below and let me know if this helps you, or if I can assist further Smiley Happy
To accomplish getting the newest file, we will get a list of files then get the properties for each of them. We will then sort the file properties by "mtime" or the last modified time and then grab the oldest for further processing. 
1) tFileList: this component is configured to look for files that start with my chosen string
2) tFileProperties: this component will retrieve the properties for each file. 
3) tBufferOutput: this component will store the file properties in memory so we can sort them once we've got info on all the files.
4) tBufferInput: this component will read from the buffer we populated with file property information
5) tSortRow: this component will sort the files by mtime descending (meaning the oldest file will be first in the list)
6) tSampleRow: this component is how we grab only the first row coming out of tSortRow


All Replies
One Star

Re: [resolved] How to pick a file from S3 with latest date

Hi Rajesh
Let me first ensure, if I've captured your requirements correctly:
1. Your source folder is fixed.
2. You intend to run your job every 2 hrs.
3. On every execution, you wish to pick the latest file (irrespective of its name).
Your confirmation would help formulate a solution in a better way. Smiley Happy
MathurM
One Star

Re: [resolved] How to pick a file from S3 with latest date

Hi MathurM,
Thanks for your reply
1. Your source folder is fixed.
My Source folder is fixed
2. You intend to run your job every 2 hrs.
My job has to be run for every 2 hrs
3. On every execution, you wish to pick the latest file (irrespective of its name).
Always my file name will be same i,e (My_File) and my job has to pick only the file which starts with (My_File) depends upon latest date
Thanks
Rajesh

Re: [resolved] How to pick a file from S3 with latest date

This is a very common task that is not super easy to implement in Talend. 
Please have a look at my example job below and let me know if this helps you, or if I can assist further Smiley Happy
To accomplish getting the newest file, we will get a list of files then get the properties for each of them. We will then sort the file properties by "mtime" or the last modified time and then grab the oldest for further processing. 
1) tFileList: this component is configured to look for files that start with my chosen string
2) tFileProperties: this component will retrieve the properties for each file. 
3) tBufferOutput: this component will store the file properties in memory so we can sort them once we've got info on all the files.
4) tBufferInput: this component will read from the buffer we populated with file property information
5) tSortRow: this component will sort the files by mtime descending (meaning the oldest file will be first in the list)
6) tSampleRow: this component is how we grab only the first row coming out of tSortRow

One Star

Re: [resolved] How to pick a file from S3 with latest date

Hi JohnGarrettMartin, I feel with your above solution, we kind of drifted away a bit from the original problem.
Hi Rajesh,
I would suggest you try an approach on the lines of the job shown below.
Here, 
1. We first create a start flag (assigning it a value, say 'T')
2. Using tFileList component, we iteratively extract all the files from the source folder. This component, itself allows us to sort the order of the files. We can sort the files on 'modified date', & also arrange them in 'ASC or DESC' order. In present case, we choose 'DESC.
3. Further on, we arrange to iteratively process each of the file based on a 'IF' condition i.e. the 'FLAG' equals 'T'
4. On successful processing of the file, on a 'OnSubjobOk' link we change the 'FLAG' to say 'F'.
5. As a result, after the successful processing of the first file, the flag would be changed from 'T' to 'F'. Hence, no-more fulfilling the 'IF' condition & no further files would be processed.
This way, we can achieve the processing of only the latest file in the source folder on every execution.
hope this helps. Smiley Happy
MathurM
Four Stars

Re: [resolved] How to pick a file from S3 with latest date

Hi,
Do you have rights to move file from s3 bucket to another folder?
if yes, then once the files are processed, move it to archive folder, this is much simpler than implementing work arounds...
Vaibhav