One Star

How to read data from a word file

Hi Talend Team,
I just wanted to read some data from a word file.
Is there any direct component which can read a word file .
Or is there any way to it ???
Regards,
Sandeep.
4 REPLIES
One Star

Re: How to read data from a word file

Hi,
there is a discussion on LinkedIn about this topic (or it was you who wrote the question? (http://www.linkedin.com/groupItem?view=&gid=812977&type=member&item=107111395&qid=06608beb-085b-4573...)
Still I say - the problem with a word document is, that it is unstructured. I mean - it can contain tables, text, images, links, headers, other documents.. You could read data from an Excel sheet, but at least there are tables. So it doesn't go directly from a Word doc, but you need a a step to extract any structured information. In theory - you may create a script to save your word document as a clear text, but don't you loose any information?
If you know what is in the word document - e.g. CSV (comma separated values), you can use POI API or Visual Baisc to extract data from Word - usualy as delimited values (CSV) - and then Talend to do something useful with data.
Carpe diem
Gabriel
One Star

Re: How to read data from a word file

Hi Gabriel,
First of all thank you for your reply.
I have a requirement where i have to read data from a Microsoft word file.
I am well aware that a word file is unstructured but i just want to match pattern in file and read data across it.
For Example :
Name : kathi
Place : USA
with a sepcified deilimeter .
I wanted to match this "name" and read data "kathi" in TOS.
Regards,
Sandeep.
One Star

Re: How to read data from a word file

Hi Sandeep,
then I'd create a script using a POI API (or any Word manipulation API, e.g. Lucene ) to extract document's body clear text (I usually deploy all my routines as web services, it is easier and more accessible than trying to make a new Talend Component)- and then
- for every document (tFileList)
- extract content as clear text (tSSH, tWebService) into a temporary file
- read per row (tFileInputFullRow)
- check if file contains searched string (tFilterRow)
- read other rows necessary (tFileInputRegex)
but there is no out-of-the-box Talend component to extract clear text from a word document. In theory, you could reuse a WordExtractor from Lucene project (it uses POI as well).
Gabriel
One Star

Re: How to read data from a word file

Hi Gabriel,
Thank you once again for your reply.
So, we can extract text using script of POI API.Can please mail or post procdure to create a sample job which would be of a great help to me.

Regards,
Sandeep.