Four Stars

Extracting data from PDF and Scanned images through Talend

Hi,

 

I have a requirement to extract data from PDF, word and scanned images through Talend. Could anyone please suggest what can be the best component to use for the same.

 

I am using Talend Big data platform version 6.3 

 

Thanks in Advance!!

 

Regards,

Pragya

1 REPLY
Eleven Stars

Re: Extracting data from PDF and Scanned images through Talend

You are going to have to go to third party Java APIs for this. That is a major advantage of Talend, in that you can use third party APIs. You will need to be able to write Java to achieve this. Take a look here as a start (https://tika.apache.org/).

 

You may find a component in the Talend exchange for the Word data, but I don't think there will be a Talend component for getting data from scanned images

Rilhia Solutions