I am trying to download a pdf file, read the pdf file and load it directly into the database using Talend

Four Stars

I am trying to download a pdf file, read the pdf file and load it directly into the database using Talend

Trying build a data flow that reads a PDF from the local server  and  load the file into the database using Talend.

 Data type of the column that I am loading is BLOB

Four Stars

Re: I am trying to download a pdf file, read the pdf file and load it directly into the database using Talend

Hi,
I think that you have to do this with a java routine.
A solution exists in this linkg: https://www.talendforge.org/forum/viewtopic.php?id=6609
Eight Stars

Re: I am trying to download a pdf file, read the pdf file and load it directly into the database using Talend

Hi there,

 

If you're literally just wanting to store the PDF file binary data in a database BLOB field, then this can be done very simply, as follows:

 

Use a tFileInputRaw, with the Mode set to "Read the file as a bytes array":

 

tFileInputRaw2.png

 

Then, in the schema of e.g. a tMysqlOutput component, set the DB Type to BLOB:

 

tMysqlOutput2_Schema.png

 

If however you're wanting to read in the PDF and do some processing, e.g. extracting the text, then you will need to do this in Java code using a suitable library such a iText. Be aware that iText, whilst a superb and very feature rich library, is not free for any commercial use, and so you'd need to buy a licence.

 

I did take a quick look on Talend Exchange, and found a free component - tTikaExtractor - which appears to offer extraction of text from PDF files, so this may be an option, although I've not used this.

 

Regards,

 

 

Chris

What’s New for Talend Spring ’19

Watch the recorded webinar!

Watch Now

Agile Data lakes & Analytics

Accelerate your data lake projects with an agile approach

Watch

Definitive Guide to Data Quality

Create systems and workflow to manage clean data ingestion and data transformation.

Download

Tutorial

Introduction to Talend Open Studio for Data Integration.

Watch