One Star

[resolved] Split really big xml file in multiple XML files

Hello,
I have to split a 160 Go XML file.
I found a solution in this topic : http://www.talendforge.org/forum/viewtopic.php?id=25072
But my file is so big (160Go...) that I can't use tFileInputXML: I face an OutOfMemory error.
So I wonder if there is another way to split huge XML files using Talend ? (or maybe a little program that I can run from the tSSH component)

Just for your information this what the XML file looks like:
<ExampleDatabase>
<DatabaseEntry>
A lot of things.
</DatabaseEntry>
<DatabaseEntry>
Other things
</DatabaseEntry>
<DatabaseEntry>
Other things again
</DatabaseEntry>
</ExampleDatabase>

I want to split it between two <DatabaseEntry>.

Thank you.
1 ACCEPTED SOLUTION

Accepted Solutions
One Star

Re: [resolved] Split really big xml file in multiple XML files

Thank you for your help Mbaroudi !
I just find that this morning : http://linux.die.net/man/1/xml_split
This linux command split the file in file of the chosen size and keep the sml structure.
But I think I'm going to try your way Mbaroudi (so the job will be running correctly on Windows if needed)
Pikerman : sorry I'm don't no much about php (create a topic about this).
4 REPLIES
Seventeen Stars

Re: [resolved] Split really big xml file in multiple XML files

In Talend 5.3.1 this component has an advanced option: Generation Mode: Fast and low memory consumption (SAX).
One Star

Re: [resolved] Split really big xml file in multiple XML files

Yes, I know I already use Sax. But even with it, 160 Go XML files are way too big.
One Star

Re: [resolved] Split really big xml file in multiple XML files

Hi,
You can use XSLT to split a huge xml file by Talend tXSLT component :
Source code:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:param name="startPosition"/>
<xsl:param name="endPosition"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="header">
<xsl:copy>
<xsl:apply-templates select="DatabaseEntry"/>
</xsl:copy>
</xsl:template>
<xsl:template match="DatabaseEntry">
<xsl:if test="position() >= $startPosition and position() <= $endPosition">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:if>
</xsl:template>
</xsl:stylesheet>

(Note, by the way, that because this is based on the identity transform, it works even if header isn't the top-level element.)
You still need to count the DatabaseEntry elements in the source XML, and run the transform repeatedly with the values of Parameters $startPosition and $endPosition that are appropriate for the situation .
One Star

Re: [resolved] Split really big xml file in multiple XML files

Thank you for your help Mbaroudi !
I just find that this morning : http://linux.die.net/man/1/xml_split
This linux command split the file in file of the chosen size and keep the sml structure.
But I think I'm going to try your way Mbaroudi (so the job will be running correctly on Windows if needed)
Pikerman : sorry I'm don't no much about php (create a topic about this).