Five Stars

Help with tExtractXMLField for XHTML

I am writing a job to extract content out of word doc & .html files and load to elasticsearch. I am using tTikaExtractor to extract the contents out of the files.  I having the following components in my job. 

 

tFileList-->tTikaExractor-->tRowGenerator-->tExtractXML-->tFileOutputDelimited

 

The process seems to work upto tRowGenerator. However tExtractXML is not fetching any data out. I have the following in the tExtractXML component

loop xpath query =   "/html/head/"

Mapping values for title/xpath query are

"title" = "/title"

"body" = "/html/body" 

Not sure how to extract creator value from <meta name="dc:creator" content="Tshak"/> in the data

 

Following is the output coming out of tRowGenerator

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2018-04-20T14:18:00Z"/>
<meta name="cp:revision" content="4"/>
<meta name="Total-Time" content="1"/>
<meta name="extended-properties:AppVersion" content="16.0000"/>
<meta name="metaSmiley Tonguearagraph-count" content="1"/>
<meta name="meta:word-count" content="11"/>
<meta name="dc:creator" content="Tshak"/>
<meta name="extended-properties:Company" content="Tshak"/>
<meta name="Word-Count" content="11"/>
<meta name="publisher" content="Tshak"/>
<meta name="metaSmiley Tongueage-count" content="1"/>
<meta name="dcSmiley Tongueublisher" content="Tshak"/>
<title>Test Extraction</title>
</head>
<body><p><b><u>Help Desk</b></u></p>
<p><a name="_GoBack"/>First paragraph content</p>
<p/>
<p><b><u>Helpdesk Portal</b></u></p>
<p>Second paragraph content</p>
<p/>
<p/>
</body></html>

 

Appreciate your help!

1 ACCEPTED SOLUTION

Accepted Solutions
Eleven Stars

Re: Help with tExtractXMLField for XHTML

2 REPLIES
Eleven Stars

Re: Help with tExtractXMLField for XHTML

Five Stars

Re: Help with tExtractXMLField for XHTML

Thanks for your response Manohar. Your suggestion is working! I am able to extract the title and body content from the xml (xhtml).