pdf's are the nightmare data source for all ETL tools. Unfortunately Talend is not the exception. Often, a PDF is represented as a single image. This means that to retrieve any information from the "text" of the PDF, you would have to implement OCR routines. This is not a small task and getting all of the data from a PDF correctly is a big risk of this design. if you have thousands of PDF's that must be entered to the DB it *might* be worth it to implement OCR and integrate this into a Talend job. My advice is to try very hard to get your data in a machine readable format, and understand what you're getting into if you agree to parse PDF files.