One Star

PDF data source in Talend

A widely popular format for storing information is pdf. Is there any connector that can be used to read the content of pdf file in Talend?

Re: PDF data source in Talend

pdf's are the nightmare data source for all ETL tools. Unfortunately Talend is not the exception.
Often, a PDF is represented as a single image. This means that to retrieve any information from the "text" of the PDF, you would have to implement OCR routines. This is not a small task and getting all of the data from a PDF correctly is a big risk of this design.
if you have thousands of PDF's that must be entered to the DB it *might* be worth it to implement OCR and integrate this into a Talend job. My advice is to try very hard to get your data in a machine readable format, and understand what you're getting into if you agree to parse PDF files.