Hi, there is a discussion on LinkedIn about this topic (or it was you who wrote the question? (http://www.linkedin.com/groupItem?view=&gid=812977&type=member&item=107111395&qid=06608beb-085b-4573...) Still I say - the problem with a word document is, that it is unstructured. I mean - it can contain tables, text, images, links, headers, other documents.. You could read data from an Excel sheet, but at least there are tables. So it doesn't go directly from a Word doc, but you need a a step to extract any structured information. In theory - you may create a script to save your word document as a clear text, but don't you loose any information? If you know what is in the word document - e.g. CSV (comma separated values), you can use POI API or Visual Baisc to extract data from Word - usualy as delimited values (CSV) - and then Talend to do something useful with data. Carpe diem Gabriel
Hi Gabriel, First of all thank you for your reply. I have a requirement where i have to read data from a Microsoft word file. I am well aware that a word file is unstructured but i just want to match pattern in file and read data across it. For Example : Name : kathi Place : USA with a sepcified deilimeter . I wanted to match this "name" and read data "kathi" in TOS. Regards, Sandeep.
Hi Sandeep, then I'd create a script using a POI API (or any Word manipulation API, e.g. Lucene ) to extract document's body clear text (I usually deploy all my routines as web services, it is easier and more accessible than trying to make a new Talend Component)- and then - for every document (tFileList) - extract content as clear text (tSSH, tWebService) into a temporary file - read per row (tFileInputFullRow) - check if file contains searched string (tFilterRow) - read other rows necessary (tFileInputRegex) but there is no out-of-the-box Talend component to extract clear text from a word document. In theory, you could reuse a WordExtractor from Lucene project (it uses POI as well). Gabriel