i have a problem statement to read a pdf file's content(which is text not images) and extract the text with bold letters. Since there are no PDF related components, i tried converting the pdf to word document prior reading with talend. I tried reading the word doc with tfileinput(fullrow/delimited) but of no luck.
How can i read the data in any of the formats?
Any help is appreciated.
Thanks in advance.
Here is a custom component written by talend community user and shared on talend exchange portal.
If you want to install a custom component into studio, this online document TalendHelpCenter:How to install and update a custom component will help.
thanks for your reply, i understood the component tpdftoText is capable of converting a PDF in to a text file. But my requirement is to read the PDF or word(.docx) file and to apply transformations while reading it.
can i read the PDF file as it is and extract the required string from it by applying filters?