One Star

large xml to vertica tables

I am trying to take a large xml file (550MB) and convert it into a vertica db with multiple tables. The xml file is actually a concatenation of thousands of other xml files. These mini files have can have one of a approximately a dozen different formats. So the question is, how can I go about accomplishing this. I have attempted to load the large xml through the File XML wizard, but I get an out of memory error. I wrote up a bug on this and it was deemed to be by design. So what are my options for achieving my goal? I could break the massive xml into the individual small xmls in order to avoid the memory problem. But this (I assume) would lead to an extremely long processing time on the vertica insert side as (I assume) only a single row would be inserted at a time. This is not how vertica likes to process data, which is why I was hoping to process the entire file at once.
Any thoughts or suggestions?
4 REPLIES

Re: large xml to vertica tables

how much memory are you allowing for this job?
a simple option would be - increase it.
failing that - split your main file and parallel process all these many tables.
One Star

Re: large xml to vertica tables

My workstation has 6G of ram and I bumped the java Mx option up to 4G. I submitted a bug on this and they said it was simply WAD with no recommendation of what amount would be able to be able to handle the xml file.
What do you mean parallel process all the many tables? I was thinking of taking the large xml and attempting to split it into it's dozen "types" of xml files and process each one. As a first attempt at breaking the size up into something that can be handled without running out of memory. This process will eventually get moved to a server environment with more memory and processors. Is there a way to tell talend to perform different ETL processes in parallel?
Seventeen Stars

Re: large xml to vertica tables

In TOS 5.2.0 you find in tFileInputXML the advanced option the option Generation Mode. Here you can switch to an SAX based parser which does not hold every node in the memory and therefore should not increase the memory demands.
One Star

Re: large xml to vertica tables

Jolling-- Even when I tried to do this using XAR generation with a big XML file (300 Mo) mode I got the same memory error. what should I do If reading a big xml file with tFileInputXML is not posssible ?