read pdf file or word document

Seven Stars

read pdf file or word document

hi all,

 i have a problem statement to read a pdf file's content(which is text not images) and extract the text with bold letters. Since there are no PDF related components, i tried converting the pdf to word document prior reading with talend. I tried reading the word doc with tfileinput(fullrow/delimited) but of no luck.

How can i read the data in any of the formats? 

Any help is appreciated.


Thanks in advance.


Re: read pdf file or word document


Here is a custom component written  by talend community user and shared on talend exchange portal.

If you want to install a custom component into studio, this online document TalendHelpCenter:How to install and update a custom component will help.

Best regards


Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Seven Stars

Re: read pdf file or word document


thanks for your reply, i understood the component tpdftoText is capable of converting a PDF in to a text file. But my requirement is to read the PDF or word(.docx) file and to apply transformations while reading it.

can i read the PDF file as it is and extract the required string from it by applying filters?