Four Stars

I am trying to download a pdf file, read the pdf file and load it directly into the database using Talend

Trying build a data flow that reads a PDF from the local server  and  load the file into the database using Talend.

 Data type of the column that I am loading is BLOB

2 REPLIES
Four Stars

Re: I am trying to download a pdf file, read the pdf file and load it directly into the database using Talend

Hi,
I think that you have to do this with a java routine.
A solution exists in this linkg: https://www.talendforge.org/forum/viewtopic.php?id=6609
Six Stars

Re: I am trying to download a pdf file, read the pdf file and load it directly into the database using Talend

Hi there,

 

If you're literally just wanting to store the PDF file binary data in a database BLOB field, then this can be done very simply, as follows:

 

Use a tFileInputRaw, with the Mode set to "Read the file as a bytes array":

 

tFileInputRaw2.png

 

Then, in the schema of e.g. a tMysqlOutput component, set the DB Type to BLOB:

 

tMysqlOutput2_Schema.png

 

If however you're wanting to read in the PDF and do some processing, e.g. extracting the text, then you will need to do this in Java code using a suitable library such a iText. Be aware that iText, whilst a superb and very feature rich library, is not free for any commercial use, and so you'd need to buy a licence.

 

I did take a quick look on Talend Exchange, and found a free component - tTikaExtractor - which appears to offer extraction of text from PDF files, so this may be an option, although I've not used this.

 

Regards,

 

 

Chris