One Star

OCR (Optical character recognition) Scanner for talend.

Hi
I have been scouring the internet for about an hour and a half looking for some way to scan a pdf or doc. into talend but have not found any answers or components that can help.
I was wondering if anyone had made or knows of a component that can scan documents and put their produced text into talend for processing. At the moment I am using Free-OCR which is not really the way I want to go as I have to run the program before each talend process which is not very efficient.
Im really hoping somene has a solution to this.
Thanks in advance.
Dean Wake
P.S. I wasnt quite sure where to post this.
6 REPLIES
Moderator

Re: OCR (Optical character recognition) Scanner for talend.

Hi,
We don't have such a component to scan a pdf or doc. Talend is a code generator ETL which use JAVA as the underline technology generated to perform the Data Extraction, Transformation and Loading.
Best regareds
Sarbina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: OCR (Optical character recognition) Scanner for talend.

Is it possible for me to request such a component? I know it is possible to do through talends as there are many OCR SDK's based on java
Moderator

Re: OCR (Optical character recognition) Scanner for talend.

Hi,
You can open a JIRA issue in the Talend DI project of the JIRA bugtracker for your new feature. Our component developer will see if this feature can be available in further version.
Certainly, you can create a custom component by yourself.
Please see the reference:componentCreation
Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
One Star

Re: OCR (Optical character recognition) Scanner for talend.

Hi, using ocr scanning technique to extract text or images from pdf, it supports full-page OCR, auto and manual zonal OCR creation, meanwhile, you can do some simple image processing, such as deskew, despeckle...
http://www.rasteredge.com/how-to/csharp-imaging/ocr-sdk/
One Star

Re: OCR (Optical character recognition) Scanner for talend.

Hi, using ocr scanning technique to extract text or images from pdf, it supports full-page OCR, auto and manual zonal OCR creation, meanwhile, you can do some simple image processing, such as deskew, despeckle...
http://www.rasteredge.com/how-to/csharp-imaging/ocr-sdk/

i have seen it , looked wonderful
One Star

Re: OCR (Optical character recognition) Scanner for talend.

if you want to use free ocr, you can try this free online ocr service, it supports 40+ languages, and can save converted text to editable txt file and searchable pdf document.