Highlighted
One Star

Using Document data type for XML processing in Talend

I'm confused about the standard procedure for processing XML files in Talend. I read about the recommendation to use the Document data type as show in the attached screenshot. But there seems to be another way: a separate schema column per xml node. What is the difference between the both approaches. Are there advantages/disadvantages? Are there scenarios where only one of both leads to success? What is the preferred way for XML-to-XML transformations with many loops? Do both ways run with mappings (tMap/tXMLMap)?
1 REPLY
One Star

Re: Using Document data type for XML processing in Talend

ohofrichter - I will attempt to answer your questions - gurus - keep me honest...
In the beginning, there was tMap. It was (and is) the main transformation component within Talend. If you had XML files and needed to do transformations on the elements within the file and either write to a DB or to another XML file, you had to extract the XML elements as separate fields using say the tFileInputXML or tFileInputMSXML (or other XML input components) and pass them to the tMap for transformation. Then tXMLMap was introduced for directly receiving XML documents and performing transformations in one step.
Because tMap / tXMLMap are not starting components, what's the difference in reading a file as individual fields vs. a document prior to passing to tMap/tXMLMap?
When you use the tFileInputXML component for example, you can extract the individual elements (fields) by setting up your schema with those fields. Coming out of the component, you'd have a row of data with the individual fields exposed (much like reading a table or flat file). This can be passed into a tMap / tXMLMap for transformation. (Yes, the tXMLMap can receive data rows with individuals fields! - and this could be for a scenario where you want to transform and directly create an XML document from the transform step).
Alternatively, when you want to simply read a XML file but not extract the elements in the read step, you set the tFileInputXML component to read the XML as a 'Document' in your schema. When you do this, the XML is read as-is and passed into the tXMLMap component. In this case, you'd NOT be using the tMap component, because it can not parse XML documents - just 'tables' of data. Once you get it in tXMLMap, you can join with other data from databases or XML files, do transformations and generate XML documents or 'tables' of data.
Now on your other question - "a separate schema column per xml node". Every XML file has a root node - the base of the file. Then that has children nodes etc... which make it nested. What you then have to examine is if you can 'extract' all the nested elements of the XML with a single XPath query. The XPath query will define, from the root, the path to get all the fields. Take a look at the attached files. In one, you have 1 Xpath query, because you can map from the root node all down to the child node in the XML (above the lowest elements) and get all the elements. In the other image, you see that we have 2 child nodes that are siblings, not nested one in the other. In this case, we need separate XPath queries to get the respective elements.
What do you do in either case? The tFileInputXML is designed to have one Root XPath query. So if you used it on a file that has sibling child nodes as shown on here, you will not get all the data in the XML *correctly*. In this case, you need a component that can read the separate nodes, and for this, you could use the tFileInputMSXML component. The 'MS' in the name indicates 'multi-schema' - hence it's able to read and create separate schema from 1 file using defined XPath queries.
Because we have the tXMLMap component which is very versatile and powerful, you can also parse single-schema and multi-schema files directly within the component. Within the tXMLMap component, you can read in an XML document, and you can set loops and groups to read the XML correctly. Looking at the example data in the attached XMLs, you'd set loops at the 'product' and 'keywords' child nodes... For more, see examples at the bottom of https://help.talend.com/search/all?query=tXMLMap+operation&content-lang=en
Hope this helps...