One Star

TFileInputXML or TFileInputMSXML with a large complex xml file

Hello,
I'm currently working on the extraction of data from a large xml file (~600Mb) with a complex (and recursive) structure. The dtd of the xml is the following:
<!ELEMENT address ( city | country | province | street | zipcode )* >
<!ELEMENT africa ( item+ ) >
<!ELEMENT age ( #PCDATA ) >
<!ELEMENT annotation ( author, description, happiness ) >
<!ELEMENT asia ( item+ ) >
<!ELEMENT australia ( item+ ) >
<!ELEMENT author EMPTY >
<!ATTLIST author person NMTOKEN #REQUIRED >
<!ELEMENT bidder ( date, time, personref, increase ) >
<!ELEMENT bold ( #PCDATA | emph | keyword )* >
<!ELEMENT business ( #PCDATA ) >
<!ELEMENT buyer EMPTY >
<!ATTLIST buyer person NMTOKEN #REQUIRED >
<!ELEMENT categories ( category+ ) >
<!ELEMENT category ( name, description ) >
<!ATTLIST category id ID #REQUIRED >
<!ELEMENT catgraph ( edge+ ) >
<!ELEMENT city ( #PCDATA ) >
<!ELEMENT closed_auction ( seller, buyer, itemref, price, date, quantity, type, annotation ) >
<!ELEMENT closed_auctions ( closed_auction+ ) >
<!ELEMENT country ( #PCDATA ) >
<!ELEMENT creditcard ( #PCDATA ) >
<!ELEMENT current ( #PCDATA ) >
<!ELEMENT date ( #PCDATA ) >
<!ELEMENT description ( parlist | text )* >
<!ELEMENT edge EMPTY >
<!ATTLIST edge from NMTOKEN #REQUIRED >
<!ATTLIST edge to NMTOKEN #REQUIRED >
<!ELEMENT education ( #PCDATA ) >
<!ELEMENT emailaddress ( #PCDATA ) >
<!ELEMENT emph ( #PCDATA | bold | keyword )* >
<!ELEMENT end ( #PCDATA ) >
<!ELEMENT europe ( item+ ) >
<!ELEMENT from ( #PCDATA ) >
<!ELEMENT gender ( #PCDATA ) >
<!ELEMENT happiness ( #PCDATA ) >
<!ELEMENT homepage ( #PCDATA ) >
<!ELEMENT incategory EMPTY >
<!ATTLIST incategory category NMTOKEN #REQUIRED >
<!ELEMENT increase ( #PCDATA ) >
<!ELEMENT initial ( #PCDATA ) >
<!ELEMENT interest EMPTY >
<!ATTLIST interest category NMTOKEN #REQUIRED >
<!ELEMENT interval ( start, end ) >
<!ELEMENT item ( location, quantity, name, payment, description, shipping, incategory+, mailbox ) >
<!ATTLIST item featured ( yes ) #IMPLIED >
<!ATTLIST item id ID #REQUIRED >
<!ELEMENT itemref EMPTY >
<!ATTLIST itemref item ID #REQUIRED >
<!ELEMENT keyword ( #PCDATA | bold | emph )* >

<!ELEMENT listitem ( parlist | text )* >
<!ELEMENT location ( #PCDATA ) >
<!ELEMENT mail ( from, to, date, text ) >
<!ELEMENT mailbox ( mail* ) >
<!ELEMENT name ( #PCDATA ) >

<!ELEMENT namerica ( item+ ) >
<!ELEMENT open_auction ( annotation | bidder | current | initial | interval | itemref | privacy | quantity | reserve | seller | type )* >
<!ATTLIST open_auction id ID #REQUIRED >
<!ELEMENT open_auctions ( open_auction+ ) >
<!ELEMENT parlist ( listitem+ ) >
<!ELEMENT payment ( #PCDATA ) >
<!ELEMENT people ( person+ ) >
<!ELEMENT person ( address | creditcard | emailaddress | homepage | name | phone | profile | watches )* >
<!ATTLIST person id ID #REQUIRED >
<!ELEMENT personref EMPTY >
<!ATTLIST personref person NMTOKEN #REQUIRED >
<!ELEMENT phone ( #PCDATA ) >
<!ELEMENT price ( #PCDATA ) >
<!ELEMENT privacy ( #PCDATA ) >
<!ELEMENT profile ( age | business | education | gender | interest )* >
<!ATTLIST profile income NMTOKEN #REQUIRED >
<!ELEMENT province ( #PCDATA ) >
<!ELEMENT quantity ( #PCDATA ) >
<!ELEMENT regions ( africa, asia, australia, europe, namerica, samerica ) >
<!ELEMENT reserve ( #PCDATA ) >
<!ELEMENT samerica ( item+ ) >
<!ELEMENT seller EMPTY >
<!ATTLIST seller person NMTOKEN #REQUIRED >
<!ELEMENT shipping ( #PCDATA ) >
<!ELEMENT site ( regions, categories, catgraph, people, open_auctions, closed_auctions ) >
<!ELEMENT start ( #PCDATA ) >
<!ELEMENT street ( #PCDATA ) >
<!ELEMENT text ( #PCDATA | bold | emph | keyword )* >
<!ELEMENT time ( #PCDATA ) >
<!ELEMENT to ( #PCDATA ) >
<!ELEMENT type ( #PCDATA ) >
<!ELEMENT watch EMPTY >
<!ATTLIST watch open_auction NMTOKEN #REQUIRED >
<!ELEMENT watches ( watch* ) >
<!ELEMENT zipcode ( #PCDATA ) >
My problem is when I want use an xml file metadata for my xml file, a java heap space error is generated during the creation of the "schema viewer". Nevertheless, I try to use tFileInputXML or TFileInputMSXML components with the SAX generator and, it works for simple structure but not recursive one.
Do you know if it exists a way to extract data from such a xml in a different and simplier way with Talend than all extract with Xpath Query?
Thank you.
1 REPLY
One Star

Re: TFileInputXML or TFileInputMSXML with a large complex xml file

Hello,
Did you resolve the issue?
I am also looking at huge files (>500 mb) file to be processed.
Thanks,
Sairam