Extract entire XML element (with subelements)

One Star

Extract entire XML element (with subelements)

Hello,
I'm new to Talend (just a week) and I have a situation that I cannot solve. I'm working with XML files, trying to read and write stuff.
Description of the problem:
I have and XML that can have 3 types of elements (not at the same time) at the same location in the tree.
For example:
<bill>
<product>
<price>22</price>
<book>Bla bla1</book>
</product>
<product>
<price>23</price>
<wine>Bla bla2<wine>
</product>
<product>
<price>2</price>
<cheese>Bla bla3<cheese>
</product>
</bill>

What I want to do is to cycle through the <product>, filter the results and output the entire XML without the products that are <book>, for example.
The idea is that I cannot output the info normally using a schema because at the same location in <product> I can have <book> or <wine> or <cheese>.
The output XML would look like this:
<bill>
<product>
<price>23</price>
<wine>Bla bla2<wine>
</product>
<product>
<price>2</price>
<cheese>Bla bla3<cheese>
</product>
</bill>

I already cycle through products and I've already put in the condition.
I've also already tried to extract the node <product> as a 'Document' and output it to the other XML, but it doesn't work and throws this exception:
Error on line 2 of document  : Content is not allowed in prolog. Nested exception: Content is not allowed in prolog.

Can I extract the <product> element with all its subelements and output it to another XML?
I'm kinda out of ideas. The only thing I can think of is to copy the entire XML file and somehow delete the nodes that I don't want, but that would only be a temporary solution. I still want to "compose" the output XML.
So, do you have any suggestions?
P.S.: Any idea is appreciated.
One Star

Re: Extract entire XML element (with subelements)

Hi,
When reading the source xml, make sure you click the "Get node" checkbox in the tFileInputXML (see screenshot)
This makes is possible to read through all products and filter out any product that hasn't got a <book> element.
Regards,
Arno
One Star

Re: Extract entire XML element (with subelements)

Will try that right now.
I was always wondering what's with that checkbox.
Thanks for the suggestion and the quick reply!
One Star

Re: Extract entire XML element (with subelements)

You're welcome.
Hope it helps.... if not: I'd be glad to help you out further.
Best regards,
Arno
One Star

Re: Extract entire XML element (with subelements)

Ok, I've tried it and it doesn't quite work. I've tried to import the node as a Document and write it back as a Document, but the problem is that I can only put it as the element value (between the tags).
More precisely it outputs the XML node as text into the output XML. And the "<" and ">" are transformed into "<" and ">".
So now I'm back to square one, I guess.
A wild try would be to actually do a one-on-one association between the input file and the output file (using all the tags) and then select the option to exclude non existant tags in the output file. This would be a "brute force" attempt.
Any other ideas? Maybe a Talend component that I haven't used?
P.S.: Talend is a great product and it has a great community. Thank you.
One Star

Re: Extract entire XML element (with subelements)

I have the same issue as mentioned in this post:
http://www.talendforge.org/forum/viewtopic.php?id=9926
One Star

Re: Extract entire XML element (with subelements)

Hi,
I still think the solution with get nodes should work.
Screen shot 1 is an overview of the complete extraction job. On screen 2 you'll see the output of the tLogRow, you could replace the tLogRow with a tMap to transform (and filter out rows with a test in the product column)
Screen 3 shows the settings for the tInputFileXML while the last screenshot shows the settings for the tExtractXMLField component.
Hope this helps.
Regards,
Arno

Re: Extract entire XML element (with subelements)

Hi andreibranescu,
Use tFileInputXML:
Create you columns:
Price, Book, Wine, Cheese.
Loop XPath Query = "/bill/product"
Column | XPath query
Price | "Price"
Book | "Book"
Wine | "Wine"
Cheese | "Cheese"
Join this by a main row into tFilterRow:
In TFilterRow add the column Book
and the condition will be:
InputColumn = Book
Function = Empty
Operator = Equals
Value = null
Then attach the filter row to what you want for testing purposes add it to a tLogRow.
you do not need to bring in the nodes (As this will look if there is a value in the Book field it will not bring it through).
Regards,
Brandon
One Star

Re: Extract entire XML element (with subelements)

Hello,
First of all, thank you for your suggestions. They are very helpful.
I have some questions for you, and I will show you my ideas.
I still think the solution with get nodes should work.

What is the type of your 'product' variable (the one that you get as node)? Is it a Document, or a String, or something else?
Join this by a main row into tFilterRow:

I don't know exactly what you mean by this. Did you mean that I should use tJoin, or just connect the XMLinput to the filterRow?
I have two solutions, although none of them is perfect.
Solution 1 - simple
This solution is illustrated in the images 4 through 6. What I do is try to read all the tags (like brandon_stubbs93 suggested) from the XML and then filter the books. I then output all the tags with the advancedXML module and I uncheck the "Create empty element if needed" option, so the fields that are empty won't be shown. This gives me the right result, but if I do want some fields that are empty to appear, this solution will not work anymore.

Solution 2 - more complex
In this solution (images 1 through 3) I also read all the tags, I filter the rows and than I filter the columns. So, I end up with tables where only the wanted elements appear, and not the empty ones. I can then put these tables into the XML using advancedXML in append mode. The only problem is that in append mode, I am forced to select a Group element, and if the loop element is directly in the root node, I cannot do that, because the root node cannot be selected as a Group element. To test this, I simply nested the <product> into another element and I got the result I wanted. But this was for the test file. In the real file, the loop element is directly into the root, so I can't use this solution.
What do you think about these solutions?
In the complex solution, can I replace advancedXML with writeXMLField? If I could, how would I do that? I didn't yet understand how writeXMLField works.
In the first one, can I limit the elements that will not be written prior to passing them to the advancedXML? Can I accomplish this with tMap?
Thanks again!

Re: Extract entire XML element (with subelements)

Hi Sorry I meant just connect the XMLinput to the filterRow
I uncheck the "Create empty element if needed" option, so the fields that are empty won't be shown. This gives me the right result, but if I do want some fields that are empty to appear, this solution will not work anymore.

All you do for this is add a tMap before your tAdvancedFileOutputXML.
In the tMap the fields you want to display even if they have null values just add the following:
row2.book==null?"":row2.book
You will need to still keep "Create empty element if needed" UNCHECKED or else all the nulls will create the elements.
I hope this helps.
Regards
Brandon
One Star

Re: Extract entire XML element (with subelements)

Yes, that helps me to eliminate the filterRow element and use a tMap instead.
But, any ideas regarding the xmlWriteField?
One Star

Re: Extract entire XML element (with subelements)

Hi,
Yeah the 'product' variable is of type String. This gets treated by the tExtractXMLField as XML document.
As seen in the screenshots I added an output XML component, to create the file. Note the fact, as Brandon said, that you should uncheck the "Create empty element if needed" on the Advanced settings tab of the tAdvancedFileOutputXML component.
Regards,
Arno
One Star

Re: Extract entire XML element (with subelements)

I get it now!
Thank you!