One Star

[resolved] Split a large XML file into small files with talend

Hi,
I am trying to integrate data from a large XML file (300 Mo). Is there a way to do it with talend ?
1 ACCEPTED SOLUTION

Accepted Solutions
One Star

Re: [resolved] Split a large XML file into small files with talend

Seiif - Can you do a simple job where you use a tFileInputFullRow to read the XML file and spit out to a tLogRow? If that works - which means your job will run, you can parse it using the 'cruder' Talend-specific solution that I mentioned above.
Let me know if you can do this...
12 REPLIES
Four Stars

Re: [resolved] Split a large XML file into small files with talend

What is the problem that you are facing in doing this?
Vaibhav
One Star

Re: [resolved] Split a large XML file into small files with talend

The problem is that I can't load the XML File (300Mo ) to the medatadata XML.
Every time I try to do this talend craches
Six Stars

Re: [resolved] Split a large XML file into small files with talend

I have done this using the Perl library TWIG and just used a tSystem to call perl/twig and split the XML.
One Star

Re: [resolved] Split a large XML file into small files with talend

Jholman , Coud you give me more details about this please
One Star

Re: [resolved] Split a large XML file into small files with talend

Hi Seiif - Before suggesting alternatives (below), have you changed your XML parser to SAX in tFileInput, increased your heap size for the job and tried it? DOM parser is very memory intensive whereas SAX is not...

Like jholman, I've done this using sed utility in a shell script (.sh) on the filesystem, called from a tSystem. Using sed, I looked for a particular tag (open tag for the XML), and wherever I found it, I extracted the text between.
Another cruder method I did recently was reading the file as plain text (tFullRow), looking for these markers in the XML, marking them with an increment counter (sequence), and then split the file using tMap. This was for queue data that needed to be processed for each 'row'. 

One Star

Re: [resolved] Split a large XML file into small files with talend

Hi Willm, I have chcnaged my XML parser to SAX in tFileInput , and I incresased the heap size for the job , but I still have the same problem.
thanks for your precious suggestion
One Star

Re: [resolved] Split a large XML file into small files with talend

Seiif - Can you do a simple job where you use a tFileInputFullRow to read the XML file and spit out to a tLogRow? If that works - which means your job will run, you can parse it using the 'cruder' Talend-specific solution that I mentioned above.
Let me know if you can do this...
One Star

Re: [resolved] Split a large XML file into small files with talend

It works with tFileInputFullRow. I will try the cruder and tell you about the results. Thanks Willm
Six Stars

Re: [resolved] Split a large XML file into small files with talend

Please see the relevant documentation for Twig here : http://search.cpan.org/dist/XML-Twig/tools/xml_split/xml_split
It also provides a mechanism for merging them back together again.
One Star

Re: [resolved] Split a large XML file into small files with talend

Josh - would the execution server for your job (using Twig) need to have Perl installed?
Thanks.
Will
Six Stars

Re: [resolved] Split a large XML file into small files with talend

Yes, you would need a Perl install, you can install twig with CPAN. If you are on Windows, ActivePerl should work fine.
One Star

Re: [resolved] Split a large XML file into small files with talend

Hi,
I found another solution for the heap space error.The approach is reading the big XML as CSV file with tFileInputDelimited componenet and then passing data to tFileOutputXML or tAdvancedFileOutputXML which split it into small XML files.
The next step is to integrate data in this files in a database. My problem now is to find a way to schedule treatment for all this files applying theorems process management.
Thanks for your help Smiley Happy
Seif