XML parsing with unknown structure

One Star

XML parsing with unknown structure

Please have a look at the XML file structure below. We are looking for a way to parse this document without the need to specify the exact structure of the document. Is there some kind of way within the standard components from Talend where we can parse this document?
Basically what I would like to do is the following:
# Start parsing the document.
# Loop through the <products>
# For each element inside a product, find the next (new) element and save corresponding value and parent ID, etcetera.
EXAMPLE XML STRUCTURE
<?xml version="1.0" encoding="utf-8"?>
<ONIXmessage>
<product>
<a001>A14528039</a001>
<a002>01</a002>
<productidentifier>
<b221>02</b221>
<b244>3790827584</b244>
</productidentifier>
<productidentifier>
<b221>03</b221>
<b244>9783191072551</b244>
</productidentifier>
<b246>01</b246>
<b012>BB</b012>
<series>
<seriesidentifier>
<b273>01</b273>
<b233>Set-ID</b233>
<b244>C181</b244>
</seriesidentifier>
<b018>Contributions to Management Science</b018>
<b019>1386</b019>
<b020>1236</b020>
</series>
</product>
</ONIXmessage>

It would be great if we have a way to parse this and get the following output.
EXAMPLE OUTPUT
ID   Parent  Key                Value
1 A001 A14528039
2 A002 01
3 Productidentifier
4 3 b221 02
5 3 b244 3790827584
6 Productidentifier
7 6 b221 03
8 6 b244 9783191072551
9 b246 01
10 b012 BB
11 Series
12 11 seriesidentifier
13 12 b273 01
14 12 b233 Set-ID
15 12 b244 C181
16 11 b018 Contributions to Management Science
17 11 b019 1386
18 11 b020 1236

In other words: we don't now what elements we can expect in the XML structure. The component just should create a table containing (sub)elements and their corresponding key. I think a lot of people want a component like this and and my opinion I think it is very strange that an ETL tool like Talend does not have this.
Seventeen Stars

Re: XML parsing with unknown structure

What you need is something like a normalizer for XML which returns all tags with values and the path to the tag. Which such kind of component (it should use a SAX parser without memorizing the DOM) you should be able to retrieve everything - even unknown structures. I will take a look if I can create such component.
One Star

Re: XML parsing with unknown structure

We have already created some java which does the parsing. The problem is that we want to wrap the java code into a Talend component. It would be great if the component has an input parameter (the xml file to be parsed) and as an output a row containing key/value. How do we create such a component?
One Star

Re: XML parsing with unknown structure

Hi jlolling, can you please provide an update? What are the steps to take to convert our java code into a component?