Seven Stars

read pdf file or word document

hi all,

 i have a problem statement to read a pdf file's content(which is text not images) and extract the text with bold letters. Since there are no PDF related components, i tried converting the pdf to word document prior reading with talend. I tried reading the word doc with tfileinput(fullrow/delimited) but of no luck.

How can i read the data in any of the formats? 

Any help is appreciated.

 

Thanks in advance.

2 REPLIES
Moderator

Re: read pdf file or word document

Hello,

Here is a custom component written  by talend community user and shared on talend exchange portal.

https://exchange.talend.com/#marketplaceproductoverview:marketplace=marketplace%252F1&p=marketplace%...

If you want to install a custom component into studio, this online document TalendHelpCenter:How to install and update a custom component will help.

Best regards

Sabrina

--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Seven Stars

Re: read pdf file or word document

hi,

thanks for your reply, i understood the component tpdftoText is capable of converting a PDF in to a text file. But my requirement is to read the PDF or word(.docx) file and to apply transformations while reading it.

can i read the PDF file as it is and extract the required string from it by applying filters?