split input file into multiple output files

I have a Perl program that takes an input file with a date, not necessarily in order, and outputs multiple files with all the same dates together.
Here's an example of the file
data, date
"lots of data","20090101"
"more data","20090105"
"even more data","20090101"
"you guessed it, data!", "20090105"

and the output files:
file: out_20090101
data,date
"lots of data",20090101
"even more data", 20090101
file: out_20090105
data,date
"more data","20090105"
"you guessed it, data!", "20090105"

My question is how would you guys design a Talend job to do this type of operation?
I've considered using a tMap to filter by dates, sending each row to a separate tFileOutputDelimited component.

I don't like this solution mainly because I would need 31 output tables. Can any of you guys think of a better way?
Tags (1)
8 REPLIES
Employee

Re: split input file into multiple output files

Here comes my first solution. The input file is read 31+1 times (so hopefully, it's a short file), but you only use simple components.
- first we list all available dates
- then we read the file one more time and filter on the first date
- then we read the file one more time and filter on the second date
and so on for all available dates
Employee

Re: split input file into multiple output files

Here comes my second solution (that of course I do like much better), but it's really a trick I must admit...
The "trick" is that the tPerlRow changes the output file handle for each row:
@output_row = @input_row;
if (not defined $outputs{ $input_row }) {
open($outputs{ $input_row }, '> /tmp/out_'.$input_row);
}
$output_FH_tFileOutputDelimited_1 = $outputs{ $input_row };

As the output file path has no importance, I've set it to /dev/null
The advantage is that you read the input file only once, you don't need to sort anything. The drawback is that it's not very good for maintenance.

Re: split input file into multiple output files

thanks plegall!
as an aside, the second solution is very similar to the perl script I wrote to solve this issue.
Unfortunately (for my sanity-- Im a perl guy) I'm stuck using Java at my current position. So the second solution is not possible, to my knowledge.
The first solution, while creative and working-- the files Im dealing with are between 200MB and 5GB, -- reading them 31 times might be a bit... slow.

my solution is also a bit of a trick--
I just call my perl script Smiley Wink

any other ideas would be most welcome. Its an interesting (yet difficult) problem to solve with Talend.
Six Stars

Re: split input file into multiple output files

Bring in the "data" and "date" to a tFlowToIterate, which iterates over a tJava with something like this in it:
try {

String output = row1.data + "," + row1.date;
// Should make the path a context variable
String filename = "/tmp/out_" + row1.date;
String[] bashCommand = new String[]{"sh","-c","echo " + output + " >> " + filename };
Process child = Runtime.getRuntime().exec(bashCommand);
} catch (IOException e) {
}

On Windows you would need to change 'sh -c' to a ""cmd /c", this is pretty similar to one of Pierricks suggestions, only in Java form and using the OS to append to avoid a ton of tFileOutputs . You will still need to write the header.
Josh
One Star

Re: split input file into multiple output files

Here is a java translation of plegall's second solution. Very good idea... (but as he mentioned not the best way for maintenance):
// start code of tJavaFlex
java.util.HashMap fileHandles= new java.util.HashMap<String,Object>();

// main code  of tJavaFlex
row2.date= row1.date;
row2.data= row1.data;
if (fileHandles.containsKey(row1.date)) {
CsvWritertFileOutputDelimited_1= (com.csvreader.CsvWriter)fileHandles.get(row1.date);
} else {
String fileName= context.testData + "/" + jobName + "/out_"+row1.date;
CsvWritertFileOutputDelimited_1= new com.csvreader.CsvWriter(
new java.io.BufferedWriter(new java.io.OutputStreamWriter(
new java.io.FileOutputStream(
fileName, false),
"ISO-8859-15")), ',');
fileHandles.put(row1.date, CsvWritertFileOutputDelimited_1);
String[] headColu= new String;
headColu = "data";
headColu = "date";
CsvWritertFileOutputDelimited_1.writeRecord(headColu);
}

// end code in tJavaFlex
for (Object fileHandle: fileHandles.values()) {
((com.csvreader.CsvWriter)fileHandle).close();
}

Bye
Volker
One Star

Re: split input file into multiple output files

Hi, I'm trying to do the same thing in Talend with Perl on Windows, but /dev/null doesn't exist. I tried 'NUL', but Talend gets a error
" faile to write row at C:\Program Files\Talend\TIS_TE-All-r17347-V2.4.2\workspace\.Perl\AFFILIATES.job_GeneralAdExport_0.2.pl line 2887."
Any advice for doing this with a tPerlRow on Windows?
Thanks,
Barry
Employee

Re: split input file into multiple output files

Hi, I'm trying to do the same thing in Talend with Perl on Windows, but /dev/null doesn't exist. I tried 'NUL', but Talend gets a error

Instead of the unix "/dev/null", use "C:/temp/talend_dev_null" if "C:/temp" exists. The difference between the Unix /dev/null is that an empty C:/temp/talend_dev_null file will be created.
One Star

Re: split input file into multiple output files

Any ideas on how to make the solution by Volker Brehm better for maintenance? More flexible?