Batch processing in talend job.

Four Stars

Batch processing in talend job.

Hi team,

I need to implement batch processing in my talend job. how can i achieve it. scenario as below.

Suppose i have 30000 records in my file and i need to process 1000 records at one time and after that next 1000 records will process.

How can i achieve this scenario. 30000 records means 30 batches of records. Please help me with this scenario. I am using Talend data fabric 6.4.1 version.

 

Thanks,

Bhushan 

Forteen Stars TRF
Forteen Stars

Re: Batch processing in talend job.

Hi,

Redirect your input to a tFileOutputDelimited.

Enter the output filename, tick the option "Split output in several files" from the "Advanced settings" and enter the value of 1000 into the field "Rows in each output file". This will create n files based on the filename with 1000 in each.

On the next subjob, use a tFileList to iterate over this file list to get records from each file.


TRF
Four Stars

Re: Batch processing in talend job.

Hi TRF,

Thanks for the reply. I am getting xml files as a source and each xml file contains 30000 records. I don't want to create multiple input files i just need to create batches of one xml files and each batch processes one by one. Each batch contains 1000 records. How can i divide or how can i create batch of 1000 records?

 

Thanks,

Bhushan.

Forteen Stars TRF
Forteen Stars

Re: Batch processing in talend job.

You can go with arrays or lists (maybe with a little of Java code), but the solution I've proposed is the simplest and you don't have to be afraid about performance as tFileInputDelimited (or Output) are very fast.


TRF
Four Stars

Re: Batch processing in talend job.

Hi TRF,

Thanks for the reply. Can you please give java code to create batches of records cause i am not a java guy. If other option possible please suggest that also.

I am not able to divide actual xml file into multiple files cause my job will not support it. 

 

Thanks,

Bhushan.

Forteen Stars TRF
Forteen Stars

Re: Batch processing in talend job.

As soon as you know how to consume your input XML file, you are able to produce as many CSV files as needed then to consume these files without a single line of Java code:

 

tFileInputXML --> tFileOutputDelimited (to produce n files)
| |(on subjob OK)
| tFileList -(iterate over the CSV files list)--> tFileInputDelimited --> next components to proceed

Consider this solution as a good approach as you don't have to code anything by yourself, especially if you're not a Java developer.


TRF
Four Stars

Re: Batch processing in talend job.

Hi TRF,

Thanks for reply. This approach is not suitable cause i have more than 1000 source files to process so i need to divide number of records in job itself in batches and process them one by one.

 

Thanks,

Bhushan.

Sixteen Stars

Re: Batch processing in talend job.

Well you have a couple of choices, you can either do it all in memory (if you have enough) or you can do as @TRF suggested and output to a file or a database table (the database table would be my first choice). I'm assuming that in-memory is your preferred choice. In which case you will need to use some Java. I have some which I have used in the past which will definitely work, but you will need to understand Java and the tJavaFlex component to use it.

 

First of all, take a look at the Java code for the Job. Your components are linked by "rows". Each "row" has a rowStruct class. This is useful to you. If your row is called "row1", your rowStruct class will be "row1Struct". You can use this class to store your rows in an ArrayList. If you want to batch your rows up you can use a combination of ArrayLists inside a HashMap. The code below shows how I would write my data in batches of 10 to a tJavaFlex......

 

Start Code

//Create your HashMap
java.util.HashMap<Integer,java.util.ArrayList<row1Struct>> map = new java.util.HashMap<Integer,java.util.ArrayList<row1Struct>>();

//Create a rowCount variable and a currentBatch variable   
int rowCount = 0; 
int currentBatch = 0;  

//Create your first instance of the your array to hold your first batch of rows
java.util.ArrayList<row1Struct> array = new java.util.ArrayList<row1Struct>();

Main Code

//If the rowCount is a multiple of 10, create a new array and increment the current batch
if(rowCount%10==0 && rowCount!=0){
	map.put(Integer.valueOf(currentBatch), (java.util.ArrayList<row1Struct>)array.clone());
	currentBatch++;	
	array = new java.util.ArrayList<row1Struct>();
}

//For each row increment the rowCount
rowCount++;

//Important - Create a new row1Struct object and 
//copy your row data to it
row1Struct tmpRow = new row1Struct();
tmpRow.newColumn = row1.newColumn;
tmpRow.newColumn1 = row1.newColumn1;

//Add your tmpRow to the array
array.add(tmpRow);

End Code

//At the end, catch any array that hasn't already been added the HashMap (map)  
map.put(Integer.valueOf(currentBatch),array);   

//Add the map to the globalMap to be used later  
globalMap.put("map", map); 

That will store your data in batches of 10 records. To retrieve them by batch, use another tJavaFlex like below....

 

Start Code

//Create your HashMap object and set to be what is contained in your globalMap
java.util.HashMap<Integer,java.util.ArrayList<row1Struct>> map =  (java.util.HashMap<Integer,java.util.ArrayList<row1Struct>>)globalMap.get("map");

//Retrieve a batch from the HashMap. YOU WILL NEED TO MODIFY THIS TO SUIT YOUR REQUIREMENT. I have hard coded it to only batch 0
java.util.ArrayList<row1Struct> array = (java.util.ArrayList<row1Struct>)map.get(0);

//Create an iterator to iterate over the batch
java.util.Iterator<row1Struct> it = array.iterator();

//Start a While loop
while(it.hasNext()){

Main Code

//Here I am simply printing a column value from the row, but you can treat data returned here as in any other Talend data.

System.out.println(it.next().newColumn);

End Code

//Here we simply close the while loop.
}

Obviously this takes a little bit of code, but it does give you the exact control you want. The other thing to think about is the memory consumed. However this should easily manage 30000 rows.

Forteen Stars TRF
Forteen Stars

Re: Batch processing in talend job.

I can't understand why this approach is not suitable.

If you have 1000 input files to proceed, just add a tFileList before the tFileInputXML and that's all.

Each file will be divided into 1000 records chunks, then each chunk will be proceed (maybe loaded into a database or anything else depending of what you have to do).

That's a very common design when you want to deal with a limited and controllable number on records.


TRF
Forteen Stars TRF
Forteen Stars

Re: Batch processing in talend job.

Did this help?
If so, thank's to mark your case as solved.

TRF
Four Stars

Re: Batch processing in talend job.

Hi TRF,

Sorry for the late reply. I have not tried given solution cause of urgent deliverables. I will let you know when i will try this solution.

Thanks for the answer mark.

 

Thanks,

Bhushan