One Star

[resolved] Using tFileUnarchive on nested folder structure in S3

Here's a simple job which I have built.

Here's a simple job which I have built.
Example job layout showing how to conditionally unzip files from S3
As normal, I connect to S3 and then I list all the relevant objects in the bucket using the tS3List and then pass this to tS3Get.

In the above job I set tS3Get up to fetch every object that is iterated on by the tS3List component by setting the key as:

((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and then downloading it to:

"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
The extra bit I've added starts with a 
Run If
 conditional link from the tS3Get which links the tFileUnarchive with the condition:

((String)globalMap.get("tS3List_1_CURRENT_KEY")).endsWith(".zip")
Which checks to see if the file being downloaded from S3 is a 
.zip
 file.

The tFileUnarchive component then needs to be told what to unzip, which will be the file I just downloaded:

"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and where to extract it to:

"C:/Talend/5.6.1/studio/workspace/S3_downloads"
This then puts any extracted files in the same place as the ones that didn't need extracting.

From here I can now iterate through the downloads folder looking for the file types I want by setting the directory to 
"C:/Talend/5.6.1/studio/workspace/S3_downloads"
 and the global expression to 
"*.txt"
 in this case as I wanted to read in only the txt files (including the zipped ones) I had in S3.

Finally, i then read the delimited files by setting the file to be read by the tFileInputDelimited component as:

((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))
And in my case I simply then printed this to the console(in my original job I have tmap and the output table).

Now the issue here is :

The first is that I have a bucket on S3 suppose 'Analysis' and inside that I have a month wise folder like 'May2015', 'June2015' and so on. So the tfileUnarchive is extracting with the folder i.e in the path I have specified i.e C:/Talend/5.6.1/studio/workspace/S3_downloads/May2015/File.txt and while trying to iterate it using tfilelist I am not able to find the file as my tfilelist is searching in the the s3_downloads only i.e C:/Talend/5.6.1/studio/workspace/S3_download. So how can I go inside the folder in my tfilelist.

 Also here I have shown how to do it from local system but if I want to run it through EMR cluster then how do I achieve it.Thats means how do I change the path sturcture or something by which I will be able to run it through EMR.

Any help on this is greatly appreciated.
2 REPLIES
Community Manager

Re: [resolved] Using tFileUnarchive on nested folder structure in S3

So the tfileUnarchive is extracting with the folder i.e in the path I have specified i.e C:/Talend/5.6.1/studio/workspace/S3_downloads/May2015/File.txt and while trying to iterate it using tfilelist I am not able to find the file as my tfilelist is searching in the the s3_downloads only i.e C:/Talend/5.6.1/studio/workspace/S3_download. So how can I go inside the folder in my tfilelist.

Check the 'Includes subdirectories' box to includes the sub directories. 
Best regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: [resolved] Using tFileUnarchive on nested folder structure in S3

Hi I have done that and now its reading the files within the directories.
But how do I do the same through the EMR.
How can I define the path there. I use a path /home/work/talend/
Any help on this.