Splitting File into Several Smaller Files in HDFS

Eight Stars

Splitting File into Several Smaller Files in HDFS

I'm trying to split a file in HDFS into several smaller files, that will also be stored in HDFS. I can do this locally with the tFileOutputDelimited component, which has the "Split Output in Several Files" Option under "Advanced Settings". However, the tHFDSOutput component doesn't have that option. I need to split the input file so that each output file contains one record, and all output files are stored in HDFS. Can anyone point me in the right direction. Thanks!

 

David


Accepted Solutions
Sixteen Stars

Re: Splitting File into Several Smaller Files in HDFS

Just replace the HDFSOutput component with the tHMap and that should do it. Although I would probably add a tMap before that to filter out bad records....so long as you know what "bad" looks like, this will save a lot of time and processing.


All Replies
Sixteen Stars

Re: Splitting File into Several Smaller Files in HDFS

An easy method that jumps to mind (I've not tried this) is to read the data from your file using the tHDFSInput and read it into a tFlowToIterate. This will iterate over each row and convert each of your columns into globalMap variables with a key of "row_name.column_name". Connect the tFlowToIterate to a tJava. In that component, generate your filename. I'm guessing you can use something from the data row for this. If the row feeding your tFlowToIterate is called "row1" and your filename columns are called "name" and "extension" (I know, your data will not look like this), you can set your filename globalMap variable like this....

globalMap.put("filename", ((String)globalMap.get("row1.name"))+"."+((String)globalMap.get("row1.extension")));

Once this is done in your tJava (do whatever you need to do to generate a unique filename), connect it to a tFixedFlowInput component. Generate your schema in that component and set the column values to use the globalMap values stored in the tFlowToIterate. Then connect the tFixedFlowInput to the tHDFSOutput. Set the filename of the tHDFSOutput to ((String)globalMap.get("filename")).

 

It will look something like this...

Screen Shot 2018-02-05 at 22.45.11.png

Eight Stars

Re: Splitting File into Several Smaller Files in HDFS

Thank you for the detailed reply. I really appreciate it!

 

Can I ask a follow-up question? Instead of splitting the file into one file per record, is there a way that I can send just one record at a time to another component, either using a buffer, or with tJavaRow?

 

The issue is that I'm mapping records using tHMap. I have all the raw records in one large file. If I send that file to tHMap directly, it will throw an error on the first bad record it sees, and not process any other records. I've found that if I split the records so that it processes one file (containing one record) at a time, it will only throw out the bad records, and keep processing the good ones. I'm doing this locally, but I need to do it in HDFS, hence my question. However, there are hundreds of thousands of records, and HDFS really prefers to work with large files instead of many small files. So, is there a way I can extract one record at a time from the large file and send it to the tHMap component directly, without first making an new file? I've tried everything I can think of, but just can't get it to work.

 

Thanks again for your reply.

 

David

Sixteen Stars

Re: Splitting File into Several Smaller Files in HDFS

Just replace the HDFSOutput component with the tHMap and that should do it. Although I would probably add a tMap before that to filter out bad records....so long as you know what "bad" looks like, this will save a lot of time and processing.

Eight Stars

Re: Splitting File into Several Smaller Files in HDFS

Thanks again for the quick response. Your suggestion works just fine: I took out the tJava and used the tFlowToIterate ITERATION context variable to create the (mapped) output filenames. I've been struggling with this problem for quite awhile, and I really appreciate your advice on how to solve it.

Sixteen Stars

Re: Splitting File into Several Smaller Files in HDFS

Not a problem. Experiment with your solution. It might not be the most efficient way of achieving your goal. It just seemed a logical potential when I understood what you wanted