PDF data source in Talend

One Star

PDF data source in Talend

A widely popular format for storing information is pdf. Is there any connector that can be used to read the content of pdf file in Talend?

Re: PDF data source in Talend

pdf's are the nightmare data source for all ETL tools. Unfortunately Talend is not the exception.
Often, a PDF is represented as a single image. This means that to retrieve any information from the "text" of the PDF, you would have to implement OCR routines. This is not a small task and getting all of the data from a PDF correctly is a big risk of this design.
if you have thousands of PDF's that must be entered to the DB it *might* be worth it to implement OCR and integrate this into a Talend job. My advice is to try very hard to get your data in a machine readable format, and understand what you're getting into if you agree to parse PDF files.

What’s New for Talend Spring ’19

Watch the recorded webinar!

Watch Now


Introduction to Talend Open Studio for Data Integration.


Downloads and Trials

Test drive Talend's enterprise products.


Definitive Guide to Data Integration

Practical steps to developing your data integration strategy.