[resolved] part-0001, part-0002, part-0003.... in tHDFSOutput

Seven Stars

[resolved] part-0001, part-0002, part-0003.... in tHDFSOutput

ok, Finally I was able to figure out how to design MapReduce job using Talend and run it on hadoop cluster. The very first job was pretty simple... I was trying to create a copy of existing file on HDFS, this is how the job looks like -
tHDFSInput ------ tMap ------ tHDFSOutput
So, I had everything setup on Talend and Hadoop cluster... Designed the job and executed it on hadoop using oozie scheduler (its pretty good, I must say), job executed successfully. I went to my output folder /user/hdfs/output/temp folder to see the file... job did create file on the location but in small chunks....
1) part-0001.txt
2) part-0002.txt
3) part-0003.txt
4) ......
This is confusing and disturbing at the same time.... If I've output created on HDFS in chunks, how I'm supposed to read this output as a file in first place???? If there is a case where I need to create output file which will be used as an input for the next step.... there is no way I can read these chunks as whole file.
I believe there is something missing with this tHDFSOutput component. Detailed explanation on my concerns would be really appreciated.

Accepted Solutions
Employee

Re: [resolved] part-0001, part-0002, part-0003.... in tHDFSOutput

Hi,
I'm glad to see you finally have figured your issue out.
About your question, there is a simple answer:
--> The number of output files equals to the number of reducers if your map/reduce job contains a reducer at least.
--> #output_files = #reducers
--> The number of output files equals to the number of mappers (number of input blocks) for a map-only job.
Then, it's a Hadoop implementation you can tweak using different hadoop properties to set the number of reducers/mappers but in any case, your data files will be chunked.
How to figure out your issue?
On the cluster side, you can use the command 'hadoop fs -getmerge inputfolder/ outputfilename'
Within Talend, you can design a Data Integration Job (Standard Job in the palette) in order to list the files contained in a folder, iterate over them, and read each of them in order to send the data into another sink (hdfs, local file, database or any other storage). tHDFSList -- Iterate --> tHDFSInput -- row1 --> tFileOutputDelimited (Append mode) would be a correct design.
I hope my answer is going to help you.
Cheers,
Rémy.

All Replies
Seven Stars

Re: [resolved] part-0001, part-0002, part-0003.... in tHDFSOutput

By the way, I'm using Talend 5.3 Enterprise Edition...
Employee

Re: [resolved] part-0001, part-0002, part-0003.... in tHDFSOutput

Hi,
I'm glad to see you finally have figured your issue out.
About your question, there is a simple answer:
--> The number of output files equals to the number of reducers if your map/reduce job contains a reducer at least.
--> #output_files = #reducers
--> The number of output files equals to the number of mappers (number of input blocks) for a map-only job.
Then, it's a Hadoop implementation you can tweak using different hadoop properties to set the number of reducers/mappers but in any case, your data files will be chunked.
How to figure out your issue?
On the cluster side, you can use the command 'hadoop fs -getmerge inputfolder/ outputfilename'
Within Talend, you can design a Data Integration Job (Standard Job in the palette) in order to list the files contained in a folder, iterate over them, and read each of them in order to send the data into another sink (hdfs, local file, database or any other storage). tHDFSList -- Iterate --> tHDFSInput -- row1 --> tFileOutputDelimited (Append mode) would be a correct design.
I hope my answer is going to help you.
Cheers,
Rémy.
Seven Stars

Re: [resolved] part-0001, part-0002, part-0003.... in tHDFSOutput

Hey rdubois,
Thanks for reply. I understood, but still not convinced - will work on this further and get back to you.

Re: [resolved] part-0001, part-0002, part-0003.... in tHDFSOutput

Hi,
Please let me know if you figured out anything though the above method will work it will be time consuming and also will move the block out of hdfs and back to append to single target.
If I have to do the same for multiple set of files it will be more time consuming.
Thanks,
Swami.
Seven Stars

Re: [resolved] part-0001, part-0002, part-0003.... in tHDFSOutput

In 5.4.1 version, tHDFSoutput component has "Merge configuration" option which would solve this problem. Please see the screen shot attached. I'm pretty sure this option was not available in earlier version. Hope this helps.