One Star

Write records into multiple files based on a field in the data - large

Hi,
( I've asked this same question on an old related topic e.g. but have duplicate here to raise the visibility)
I am new to Talend and have a use case to split very large incoming files into individual files based on a key field in the data itself. 
I have seen a few solutions on here and other forums based around using tUniqueRow to identify the keys contained in the data, then tFlowToIterate to iterate through each key 
However I suspect that approach will not be performant for my case :-
* Input volumes are very large > 1 billion rows daily
* Set of key values found in each source file is variable
* Set of possible key values is known but quite large - around 200 currently - and growing as new values added to the source system. 
* Data distribution is very skew - a small number of key values (2-3) make up the bulk (up to 50%) of the data, while some key values may number only in the hundreds 
Obviously I don't want to iterate through a billion row dataset x 200 times.
Is there any way for Talend to manage writing to a variable number of multiple output files concurrently, writing each row to the appropriate output file?
( For anyone who's come from an Ab Initio background, basically I'm looking for behaviour akin to Ab Initio's Write Multiple Files component. )
Many thanks,
Andy
7 REPLIES
Fifteen Stars

Re: Write records into multiple files based on a field in the data - large

The short answer is "not if you don't want to use some Java". The longer answer is "anything is possible in Talend if you can write Java".
Unfortunately to do what you want to do will require you to pre-emptively set up an output file component for each of the key values you are looking for....if you want to use standard Talend components.
However, I don't believe it would be too hard to use a tJavaFlex to build a solution where a new FileWriter is instantiated per new key value that is received. Each new FileWriter could be kept in a HashMap (this is off the top of my head by the way, so will not be the optimum solution) and only closed when the final record is received. Every time a matching key value is passed to the tJavaFlex, the appropriate FileWriter would be used to write to the correct file. 
The thing you would have to worry about would be memory. This would potentially be a massive issue with >1 billion records. 
I may give this a try since I think it might be a good idea for an Exchange component. I'm not promising anything straight away, but it sounds like a problem worth tackling. 
Rilhia Solutions
Five Stars

Re: Write records into multiple files based on a field in the data - large

I think, sometimes, the challenge is a little bit bigger than “How do I do this in Talend?” and maybe 1 Billion rows per day is one of these? This is a lot of data by anyone's standard.
I'm assuming that this data is not in a single multi-gigbyte or terabyte file.
If you were doing this in Ab Initio, you'd probably have paid £1M+ license fee and a huge amount for the hardware needed to support it's scaleable processing to deal with this type of challenge.
If this was a 1 million or 10 million row file, then iterating through the data might be an inelegant but pragmatic approach; but this is unlikely to cut it for data of this size.
If you have enough hardware at you're disposal, then maybe you could parse large files many times. Or maybe there is an alternative approach to meet the requirement.

I think a much bigger understanding of the requirements and constraints is need to really answer this question.
One Star

Re: Write records into multiple files based on a field in the data - large

Thanks for your comments. I suspected "write Java" might be one of the answers. Unfortunately we don't have that level of Java ability in our team. 
I agree that managing memory is likely to be an issues with a large number of open files and large volumes of records.

Correct, the data does not arrive in a single file. We receive thousands of files per day, which we batch up for our ETL process. In total this averages 5-600 Gb per day, though it can peak at near 1Tb and will continue to grow organically over the coming months and years.

It wasn't my intention to compare Talend unfavourably to Ab Initio, I realise they are coming from quite different angles. However, Talend does claim to handle Big Data ...

We'd likely run such a job on our Cloudera Hadoop cluster, which is fairly modest as Hadoop clusters go. Iterating through the data is almost certainly not going to cut it. There is a possible solution using Pig before it gets to Talend (though not volume tested yet) though really our intention is for Talend to be our ETL tool of choice on Hadoop. 
Five Stars

Re: Write records into multiple files based on a field in the data - large

I have no doubt that Talend can handle 1bn rows per day, it just seems to me that, for this data volume, the full requirement needs to be understood as there are some architectural questions to be answered.

With many files arriving throughout the day, your splitting could, for example, be done on much smaller data sets meaning that multiple passes of the data sat may be reasonable, when splitting, and memory becomes less of an issue (for a single process).
And of course, you could process multiple input files simultaneously.

I wasn't trying to suggest a comparison of Talend vs. Ab Initio.

It was really just a reference to the hardware requirement for Ab Initio to do its thing. You wouldn't be running Ab Initio on a Windows Server with 16GB of RAM. If you're planning on having a number of high-spec Talend servers, you can process your data accordingly.
Fifteen Stars

Re: Write records into multiple files based on a field in the data - large

I agree with tal00000, I think you need to approach this from a "divide and conquer" direction. You already have your data divided into multiple files, it would make a great deal of sense to use several machines to process batches of these files concurrently. Maybe there is a reason why this can't be done?
With regard to Java, I feel I must impress upon you just how important a skill Java is with Talend. Talend simply generates Java code. Each job is essentially a Java application. While you can create basic to medium level integration jobs with Talend not knowing any Java, as soon as you want to take the next step, you really need to start looking at the code. Things like debugging are made much easier and making use of third party APIs can really open the door to some impressive functionality. As an example, I have put together this little demo of how you *could* solve your problem using two Talend components and a bit of Java. 
I tested this on 10,000,000 records over 30 random groups. It took about 140 seconds to finish on my laptop. So it is not super fast, but should easily be able to cater for your 1 billion records in a few hours on one machine. If you have this running on several machines concurrently, then it will work a lot quicker.
Here is a screenshot of the job....


The tFileInputDelimited is just reading the 10,000,000 record file. It does nothing special. The code in the tJavaFlex is what splits the recordset into a file per key. You will have to excuse the column names I have used since I was just doing this to see if I could. 
The key column is "newCoumn". 
The tJavaFlex is made up of 3 sections; StartCode, MainCode and EndCode. I will show each of those sections below....
Start Code
//Below we create a HashMap which will contain FileOutputStream objects
java.util.HashMap<String, java.io.FileOutputStream> fileOutputStreamMap = new java.util.HashMap<String, java.io.FileOutputStream>();

Main Code
//Set the groupKey variable
String groupKey = row1.newColumn;
//Create your file row by concatenating the values in the row to a String
String tmpVal = groupKey +";"+
row1.newColumn1 +";"+
row1.newColumn2 +";"+
row1.newColumn3 +";"+
row1.newColumn4 +";"+
row1.newColumn5 +";"+
row1.newColumn6+"\n";

//Convert the String to a byte array
byte[] contentInBytes = tmpVal.getBytes();

//If the FileOutputStream for the groupKey exists do below...
if(fileOutputStreamMap.containsKey(groupKey)){
fileOutputStreamMap.get(groupKey).write(contentInBytes);
}else{ //If the FileOutputStream for the groupKey does not exist, do below...
//create a new FileOutputStream for the groupKey and add it to the HashMap
fileOutputStreamMap.put(groupKey, new java.io.FileOutputStream("C:/Talend/OpenSource/6.2/TOS_BD-20160510_1709-V6.2.0/workspace/"+groupKey+".csv",false));
//write to the file
fileOutputStreamMap.get(groupKey).write(contentInBytes);
}

End Code
//Create an Iterator for the HashMap keyset
java.util.Iterator<String> it = fileOutputStreamMap.keySet().iterator();
//Iterate over the HashMap Key Set and close the FileOutputStreams
while(it.hasNext()){

java.io.FileOutputStream tmpStream = fileOutputStreamMap.get(it.next());
tmpStream.close();
}

This code will produce a file per groupKey with the data split according to the groupKey. You will also notice that the Java is relatively simple.
Hope it helps
Rilhia Solutions
One Star

Re: Write records into multiple files based on a field in the data - large

Wow, thanks for that Richard. I will give that approach a try with my dataset and see how I go. Yes, that's certainly a lot less Java than I was expecting to be required for this problem. While we have no hard-core Java programmers in our team we've no problem dipping our toes into the water, so would be comfortable implementing something like this.
Many thanks!
Fifteen Stars

Re: Write records into multiple files based on a field in the data - large

I'd be interested to hear if it works. You will need to take into consideration potential memory issues (but you can adjust the RAM permitted to be used by Talend jobs) and may need to think about concatenating files if you "divide and conquer" using multiple machines to process several files in parallel.
Rilhia Solutions