I have a requirement to extract data from PDF, word and scanned images through Talend. Could anyone please suggest what can be the best component to use for the same.
I am using Talend Big data platform version 6.3
Thanks in Advance!!
You are going to have to go to third party Java APIs for this. That is a major advantage of Talend, in that you can use third party APIs. You will need to be able to write Java to achieve this. Take a look here as a start (https://tika.apache.org/).
You may find a component in the Talend exchange for the Word data, but I don't think there will be a Talend component for getting data from scanned images