Four Stars AT
Four Stars

Read XML file where field names are not xml nodes

Hi, 

 

I am pulling data from an API that returns data in an XML format. I am now trying to put that data in to a CSV. 

The XML data is not in a standard format where the filed names are either the xml nodes or an attribute in the node. The way the XML is structured is that the filed names are all listed first under <columnNames> and then each value is under <rows>. The standard tXMLMap and tExtractXMLField do not meet the needs for me. Wondering if anyone else has come across a similar XML and has found a way to make it work. 

Appreciate any help on this. 

 

Here's a sample of what the XML looks like. 

<?xml version="1.0" encoding="UTF-8"?><root><count>3312</count><name>Service Request Report</name><columnNames>Incident ID</columnNames><columnNames>Reference #</columnNames><columnNames>Subject</columnNames><columnNames>Status</columnNames><columnNames>Source Hierarchy</columnNames><columnNames>Category ID</columnNames><columnNames>Service</columnNames><columnNames>Severity</columnNames><columnNames>District</columnNames><columnNames>Lattitude</columnNames><columnNames>Longitude</columnNames><columnNames>Date Created</columnNames><columnNames>Date Last Updated</columnNames><columnNames>Department</columnNames><columnNames>Escalated</columnNames><columnNames>Year</columnNames><columnNames>Quarter</columnNames><columnNames>Week</columnNames><columnNames>Total Incidents</columnNames><columnNames>No Of Days </columnNames><rows>41191</rows><rows>170824-000522</rows><rows>My San Jose -General Request-City</rows><rows>Closed</rows><rows>CX Console</rows><rows>City</rows><rows>General Request</rows><rows/><rows/><rows/><rows/><rows>'2017-08-24 15:45:30'</rows><rows>'2017-08-26 23:01:36'</rows><rows>Police</rows><rows/><rows>2017</rows><rows>3</rows><rows>34</rows><rows>1</rows><rows>6</rows><rows>41194</rows><rows>170824-000523</rows><rows>My San Jose -General Request-City</rows><rows>Closed</rows><rows>CX Console</rows><rows>City</rows><rows>General Request</rows><rows/><rows/><rows/><rows/><rows>'2017-08-24 15:53:24'</rows><rows>'2017-08-24 15:53:27'</rows><rows>Information Technology</rows><rows/><rows>2017</rows><rows>3</rows><rows>34</rows><rows>1</rows><rows>6</rows><rows>41196</rows><rows>170824-000529</rows><rows>My San Jose -General Request-City</rows><rows>Closed</rows><rows>CX Console</rows><rows>City</rows><rows>General Request</rows><rows/><rows/><rows/><rows/><rows>'2017-08-24 15:54:15'</rows><rows>'2017-08-26 23:01:35'</rows><rows>Police</rows><rows/><rows>2017</rows><rows>3</rows><rows>34</rows><rows>1</rows><rows>6</rows><rows>41198</rows><rows>170824-000531</rows><rows>My San Jose -General Request-Other</rows><rows>Closed</rows><rows>CX Console</rows><rows>Other</rows><rows>General Request</rows><rows/><rows/><rows/><rows/><rows>'2017-08-24 15:59:06'</rows><rows>'2017-08-26 23:01:34'</rows><rows>External</rows><rows/><rows>2017</rows><rows>3</rows><rows>34</rows><rows>1</rows><rows>6</rows></root>

7 REPLIES
Twelve Stars

Re: Read XML file where field names are not xml nodes

That is not nice XML to work with. I would suggest trying to use the tExtractXMLField component (https://help.talend.com/reader/jomWd_GKqAmTZviwG_oxHQ/Pc9RZi6IiWw1Kb~Jf0bE0Q) with some XPath queries.

 

Having looked at the XML you gave as an example, it looks like the "row" elements holding the values (these should probably be "columns" within "row" elements) are just repeated for several actual rows. I suspect that if you have 10 columns identified above, every 10 "row" elements will likely represent a real row, with the values in the same order as the "column" data above. So if you have 30 "row" elements, that represents 3 actual rows. Not nice to work with, but you can certainly achieve your goal using a tExtractXMLField and XPaths

Rilhia Solutions
Four Stars AT
Four Stars

Re: Read XML file where field names are not xml nodes

You got it right. The column names are listed in the beginning in columnNames elements and the data or the values are under rows elements. And I agree this is not a nice XML but this is what I got Smiley Sad

 

I used the tExtractXMLField and it gives me all the rows and the column names. I am not sure how to create the mapping between the column names and the rows. Do you have any example of how this can be done?

I am guessing this needs working with the code instead of the designer view. I am fairly new to Talend so would need some hand holding here.

Thanks.

Twelve Stars

Re: Read XML file where field names are not xml nodes

Are you writing your job specifically for this XML (ie with the columns you have shown) or for this type of XML layout with varying columns? I suspect it is the second. If it is the second way, there are a couple of things to consider. First is, how have you managed to get your column and row data using the tExtractXMLField component? Did you simply loop on "column" and then loop on "row" (returning a row for every column and every row)? That is probably the approach that I would use for an XML file which isn't fixed (ie the column numbers can change). While looping through the columns I would keep a count of how many rows there are and concatenate each value with the next (with a comma between them). That would sort the CSV  header row (if you need on). To do this you can simply use a tMap and a tMap variable where you append the column values to it. These do not drop values between rows, so by the end of the columns, your tMap variable will be the complete CSV header row. The next component would be a tAggregateRow. You would simply group by a hard coded 1 (set up in the tMap) and then use the "LAST" function to return only the last row (the completed Header row).

 

Once that is done, you can use the same sort of logic with the row data BUT you will need to make sure you create a new row for every n "row" values concatenated. You can do this in a similar way, but your "group key" (used for the tAggregateRow) will need to increment by 1 for every complete row.

 

I've not necessarily given you everything you will need for this, but I think I have given you enough to work with. The best way to learn Talend is to try things out. See if you can extrapolate from what I have said here and see if you can get closer to your required solution. If you have any problems after trying something, give us a shout  

Rilhia Solutions
Eight Stars

Re: Read XML file where field names are not xml nodes

Hi,

 

Here is a suggestion:

1-Transform your file with a tReplace components to get "proper" records. The expected output is:

?xml version="1.0" encoding="UTF-8"?>
<root>
<count>3312</count>
<name>Service Request Report</name>
<columnNames>Incident ID</columnNames>
<columnNames>Reference #</columnNames>
<columnNames>Subject</columnNames>
<columnNames>Status</columnNames>
<columnNames>Source Hierarchy</columnNames>
<columnNames>Category ID</columnNames>
<columnNames>Service</columnNames>
<columnNames>Severity</columnNames>
<columnNames>District</columnNames>
<columnNames>Lattitude</columnNames>
<columnNames>Longitude</columnNames>
<columnNames>Date Created</columnNames>
<columnNames>Date Last Updated</columnNames>
<columnNames>Department</columnNames>
<columnNames>Escalated</columnNames>
<columnNames>Year</columnNames>
<columnNames>Quarter</columnNames>
<columnNames>Week</columnNames>
<columnNames>Total Incidents</columnNames>
<columnNames>No Of Days </columnNames>
<record>
<rows>41191</rows>
<rows>170824-000522</rows>
<rows>My San Jose -General Request-City</rows>
<rows>Closed</rows>
<rows>CX Console</rows>
<rows>City</rows>
<rows>General Request</rows>
<rows/>
<rows/>
<rows/>
<rows/>
<rows>'2017-08-24 15:45:30'</rows>
<rows>'2017-08-26 23:01:36'</rows>
<rows>Police</rows>
<rows/>
<rows>2017</rows>
<rows>3</rows>
<rows>34</rows>
<rows>1</rows>
<rows>6</rows>
</record>
<record>
<rows>41194</rows>

...

 

With such a tReplace, it should work:

Capture.PNG

 

2-Then with a tXmlMap, you can change the name of your elements.

 

Here is an example. My input file is:

<Root>
<Record>
<Value>A</Value>
<Value>B</Value>
</Record>
<Record>
<Value>C</Value>
<Value>C</Value>
</Record>
</Root>

 

I want to get:

<?xml version="1.0" encoding="ISO-8859-15"?>
<Root>
<Record>
<Value1>A</Value1>
<Value2>A</Value2>
</Record>
<Record>
<Value1>C</Value1>
<Value2>C</Value2>
</Record></Root>

 

Here is my tXmlMap:

Capture.PNG

Eric

 

Four Stars AT
Four Stars

Re: Read XML file where field names are not xml nodes

Thanks eric44.

 

I am unable to get tReplace to work. Here is the error message I get when executing. 

 

tReplaceError.png

Four Stars AT
Four Stars

Re: Read XML file where field names are not xml nodes

Thanks for the suggestion. I am trying eric44 solutions. If that doesn't work I will try your recommendation. 

Eight Stars

Re: Read XML file where field names are not xml nodes

Hi,

 

What is your input type format for your tReplace ? Document ? If yes, you have to convert it to string with a tConverttype component before your tReplace.

 

Eric