One Star

OutOfMemoryError: GC overhead limit exceeded on large XML files

hi,
i am using talend 3.0.4 (mandatory i think because spagobi 3.6 have a talendengine v3.0.4), one job is extracting data using tfileinputxml on sax mode (if i use other modes i get heap out of memory errors wich i think is worse) from large xml files going up to 2gb now but might get bigger in the future)
its quite a simple job ( tfileinputxml ---> tmap (no processing just mapping fields) ---> tmysqloutput ), i even tried raising xmx to 2048 but that didnt help.
i also tried something ive seen here on the forum, i've put 1000 on the nbr of line buffer to tmap and 1000 commit limit for tmysqloutput... this too didnt help.
first : i would like to know if i can use a more recent version with spagobi 3.6 ( im afraid to make big complex jobs to find later that i cant deploy them to the server or that there would be compability problems )
second : if there is a way to solve this problem ( i did have a problem 2 days ago to copy large files using tfilecopy, found out there was a bug and was fixed for later versions, so i downloaded the fixed filecopy.jar and replaced the one i had and it worked like a charm )
thank you.
3 REPLIES
One Star

Re: OutOfMemoryError: GC overhead limit exceeded on large XML files

Your version of Talend is older than mine, but usually I have some errors with large XML.
One approach we developed here is to split the XML file into smaller chuncks (usually never larger than 64mb) using some java code in a routine and then we process all the files in a sequence.
This allow me to use the common XML parser (way faster than SAX) and allow better XPaths in the schema definition.
The function I use is (only works if your loop tag in the xml doesn't appear anywhere inside the file):
	public static boolean split_file(String filename, int maxpart, String tagname, String roottag, String nsdeclaration){
FileOutputStream fout = null;
PrintStream outstream = null;
Scanner s = null;
int part=0;
int partsize=0;
boolean partnew=true;
String partfile, suffix, token;
partfile = filename.replaceFirst("\\.xml$", "");
try {
s = new Scanner(new FileInputStream(filename),"utf-8");
s.useDelimiter("</" + tagname + ">");
while (s.hasNext()) {
if(partnew){ //begin a new part file
suffix = String.format("_part%04d.xml",part);
fout = new FileOutputStream (partfile + suffix);
outstream = new PrintStream(fout);
if (part>0){ //insert leading tags
outstream.println("<?xml version=\"1.0\" encoding=\"utf-8\"?>");
outstream.println("<" + roottag + " " + nsdeclaration + ">");
}
partsize=0;
partnew=false;
}
//just append tokens
token = s.next();
outstream.print(token);
//if not last chunk append closing tag
if (token.indexOf("</" + roottag + ">")<0) outstream.println("</" + tagname + ">");
partsize += token.length();
if (partsize > maxpart) { //time to wrap it up
outstream.println("</" + roottag + ">");
outstream.close();
outstream = null;
fout.close();
fout = null;
part++;
partnew = true;
}
}
//dump the remaining part to out
outstream.close();
//fout.close();
return true;
} catch (Exception e) {
System.out.println(e.getMessage());
if (s != null) {
s.close();
}
if (outstream != null) {
outstream.close();
}
return false;
}
}
One Star

Re: OutOfMemoryError: GC overhead limit exceeded on large XML files

Thank you for this neat code, this might save my project ! luckily my looping tag does not show in the data tags Smiley Happy
do you suggest i add a new routine and then call it in a tjava componement or create a new componement altogether? i ask this because later i will have to deploy the jobs on the spagobi server talend engine, i dont know what exactly will be deployed !
i'm a bit new to tweaking talend to fit my needs
EDIT : i did create a new routine and called the function from a tjava componement with the help of a tfilelist, it works like a charm. now with xml files of a max size of 60m the parsing works smoothly with no heap or gc exeptions.
as for spagobi deployement i will see that later when i setup the server.
One Star

Re: OutOfMemoryError: GC overhead limit exceeded on large XML files

Sorry the delay to anwer, but I usually add a tJava in a tPreJob component.
Thiago