Four Stars

extract data from xml

I have an XML and wan to extract data from it.

I read first the xml file tFileinputxml, then extraxt fields in 2 steps

 1/ I get data from (tExtractXML)

      InputFile
      DocumentType
      DocumentName

and the nodes inside the <fieldsets> node

 2/ I extract data from <fieldsets>

but in this case I have something like : field name =  and Value =

for example :

            <field name="Contract Number" value="192" />
            <field name="Author" value="mycompagny" />

So at the end I have somthing like :

InputFile  |  DocumentType  |   DocumentName | Contract Number | 192

InputFile  |  DocumentType  |   DocumentName | Author | mycompagny

 

but  I would like to get only one line like

InputFile  |  DocumentType  |   DocumentName | 192 | mycompagny

 

this is my xml example and the result I get today :

 

<DocImporterSession>
   <Document>
      <InputFile>file1.pdf</InputFile>
      <DocumentType>Contrat</DocumentType>
      <DocumentName>Contract test</DocumentName>
       <fieldsets>
         <fieldset name="Contrat">
            <field name="Contract Number" value="456" />
            <field name="Author" value="mycompagny" />
          </fieldset>
      </fieldsets>
   </Document>
   <Document>
      <InputFile>test2.pdf</InputFile>
      <DocumentType>Order</DocumentType>
      <DocumentName>test Order</DocumentName>
      <fieldsets>
         <fieldset name="Contrat">
            <field name="Contract Number" value="192" />
            <field name="Author" value="mycompagny" />
          </fieldset>
      </fieldsets>
   </Document>
</DocImporterSession>

 

result :

file1.pdf|Contrat|Contract test|Contract Number|456
file1.pdf|Contrat|Contract test|Author|mycompagny
test2.pdf|Order|test Order|Contract Number|192
test2.pdf|Order|test Order|Author|mycompagny

 

What I want :

 

file1.pdf|Contrat|Contract test| 456 | mycompagny
test2.pdf|Order|test Order|192 | mycompagny

If anybody can help me

 

  • Data Integration
1 REPLY
Ten Stars

Re: extract data from xml

To understand this and work through it you need to consider XML like a mini relational database. What you are experiencing here is pretty much the same as joining two tables where 1 row in the first table corresponds to 2 rows in the second table. In this case you would get 2 rows back. There are ways to mitigate this using XPath queries, but you may as well deal with the problem you have here as it will help you in other areas.

 

1) When you extract your data you will need to also extract the field name with the value (as you have shown). Once you have extracted your data, you need to sort it by your group columns (the columns that are all the same). You can do this with a tSortRow component.

2) Next, connect to a tMap. The tMap should have an output table which has all of the columns you wish. In your example you extra columns are "Author" and "Contract Number". So your output table should have the following columns; "InputFile", "DocumentType", "DocumentName", "Author" and "ContractNumber".

3)  Now, in the tMap there are tMap variables. These are useful as they hold their values between rows. Thanks to this, you can flatten your structure. What you need to do is create a variable which you can use to check when the group key columns change. Since tMap variables are stored between rows AND they are processed from top to bottom, this is relatively straight forward....but can take some thinking about. Take a look at this tutorial which covers this in a simple example. You may need to recreate it and play around with it to figure it out. But once you have it figured it is REALLY useful in Talend.

4) Once you have identified when the group keys change, you can work with other tMap variables. In this example we have an "Author" field and a "ContractNumber" field arriving in different rows. However, we know they will arrive consecutively. As such the rule we need to apply is as follows....

When an "Author" field arrives and it is the first row of a new group (identified by your group columns), set the value of an "Author" tMap variable to the value of that field. If the "Author"
field has not arrived and it is a new group, set the "Author" variable to null. Otherwise set the "Author" tMap variable to itself.
When a "ContractNumber" field arrives and it is the first row of a new group (identified by your group columns), set the value of a "ContractNumber" tMap variable to the value of that field. If
the "ContractNumber" field has not arrived and it is a new group, set the "ContractNumber" variable to null. Otherwise set the "ContractNumber" tMap variable to itself.

Pass the value of the "Author" field to the "Author" output table column.
Pass the value of the "ContractNumber" field to the "ContractNumber" output table column.

5) Now connect a tAggregateRow component. Group by the "InputFile", "DocumentType" and "DocumentName" columns. In the Operations section, set the "Author" and "ContractNumber" columns to a function of "Last".

 

You will now have flattened your data.

 

The reason this works is because of the tMap variables. For every new group of "InputFile", "DocumentType" and "DocumentName" you either set the "Author" field to the value or to null. Likewise with the "ContractNumber". These values are stored between rows of the group. So by the end of the group (the last record), all of the tMap variables corresponding to the fields on different rows will have been set. Since you pass these values to the corresponding output table fields for every row, the very last row passed out for that group will hold the complete set of data. In the next component you can group by the group key and use a function which will return the last record returned for both "AUthor" and "ContractNumber". This will result on one row being returned per group with all of the fields populated.

 

Hopefully this will open a few doors and give you a few ideas on how useful the tMap can be :-)

Rilhia Solutions