How to read pdf file in talend

One Star

How to read pdf file in talend

Hello,
I need help to read in a variable the content of a pdf file to put it in a text field on a database.
What sort of component I'm suppose to use ?
The process :
- list the files on a folder : ok
- read the file name to find the database row : ok
- read the content of the file to put it on a database ... not ok Smiley Sad
Does anyone have a solution ???
Thanks,
David

Accepted Solutions
One Star

Re: How to read pdf file in talend

I can't share the project because it's for my company, sorry for that.
To make this work.
In the talend Repository Menu, create a new Routines :
// template routine Java
package routines;
import java.io.*;
/*
* user specification: the function's comment should contain keys as follows: 1. write about the function's comment.but
* it must be before the "{talendTypes}" key.
*
* 2. {talendTypes} 's value must be talend Type, it is required . its value should be one of: String, char | Character,
* long | Long, int | Integer, boolean | Boolean, byte | Byte, Date, double | Double, float | Float, Object, short |
* Short
*
* 3. {Category} define a category for the Function. it is required. its value is user-defined .
*
* 4. {param} 's format is: {param} <type> <name>
*
* <type> 's value should be one of: string, int, list, double, object, boolean, long, char, date. <name>'s value is the
* Function's parameter name. the {param} is optional. so if you the Function without the parameters. the {param} don't
* added. you can have many parameters for the Function.
*
* 5. {example} gives a example for the Function. it is optional.
*/
public class fichierRef {
/**
* readFile: lit le fichier pdf et renvoi une chaine
*
*
* {talendTypes} String
*
* {Category} User Defined
*
* {param} string() input: le nom du fichier à lire
*
* {example} readFile("/etc/passwd") # hacking en cours ...
*/
public static String readFile(String fichier) {
String chaine = new String() ;
try {
InputStream ips=new FileInputStream(fichier);
InputStreamReader ipsr=new InputStreamReader(ips);
BufferedReader br=new BufferedReader(ipsr);
String ligne;
while ((ligne=br.readLine())!=null){
chaine+=ligne+"\n";
}
br.close();
return chaine ;
}catch(Exception e){
return "";
}

}

On any tMap you need it, put this sort of data :
routines.fichierRef.readFile(row3.filename).getBytes()

All Replies
Community Manager

Re: How to read pdf file in talend

Hello David
Unfortunately, there is no a component can be used to extract data from a PDF file. Smiley Sad
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: How to read pdf file in talend

ok I find a solution : using a TJava after a TFileExist with this code
String chaine = new String() ;
InputStream ips=new FileInputStream(((String)globalMap.get("tFileExist_2_FILENAME")));
InputStreamReader ipsr=new InputStreamReader(ips);
BufferedReader br=new BufferedReader(ipsr);
String ligne;
while ((ligne=br.readLine())!=null){
chaine+=ligne+"\n";
}
br.close();
In the next object, use the chaine variable of the TJava object.
One Star

Re: How to read pdf file in talend

I finally prefere another solution :
create a routines (in java) with a function readFile
in the tmap before data insertion, use routines.classname.functionname(pdffilenametoread)
Community Manager

Re: How to read pdf file in talend

Hello friend
Can you share your job and routine on forum?
Thanks for your support!
Best regards

shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: How to read pdf file in talend

I can't share the project because it's for my company, sorry for that.
To make this work.
In the talend Repository Menu, create a new Routines :
// template routine Java
package routines;
import java.io.*;
/*
* user specification: the function's comment should contain keys as follows: 1. write about the function's comment.but
* it must be before the "{talendTypes}" key.
*
* 2. {talendTypes} 's value must be talend Type, it is required . its value should be one of: String, char | Character,
* long | Long, int | Integer, boolean | Boolean, byte | Byte, Date, double | Double, float | Float, Object, short |
* Short
*
* 3. {Category} define a category for the Function. it is required. its value is user-defined .
*
* 4. {param} 's format is: {param} <type> <name>
*
* <type> 's value should be one of: string, int, list, double, object, boolean, long, char, date. <name>'s value is the
* Function's parameter name. the {param} is optional. so if you the Function without the parameters. the {param} don't
* added. you can have many parameters for the Function.
*
* 5. {example} gives a example for the Function. it is optional.
*/
public class fichierRef {
/**
* readFile: lit le fichier pdf et renvoi une chaine
*
*
* {talendTypes} String
*
* {Category} User Defined
*
* {param} string() input: le nom du fichier à lire
*
* {example} readFile("/etc/passwd") # hacking en cours ...
*/
public static String readFile(String fichier) {
String chaine = new String() ;
try {
InputStream ips=new FileInputStream(fichier);
InputStreamReader ipsr=new InputStreamReader(ips);
BufferedReader br=new BufferedReader(ipsr);
String ligne;
while ((ligne=br.readLine())!=null){
chaine+=ligne+"\n";
}
br.close();
return chaine ;
}catch(Exception e){
return "";
}

}

On any tMap you need it, put this sort of data :
routines.fichierRef.readFile(row3.filename).getBytes()
One Star

Re: How to read pdf file in talend

Notice that you could use some PDF library (iText) to extract some metadata.
One Star

Re: How to read pdf file in talend

hi,
Urgent please
i am new to talend
I need help to read a pdf and write the contents to txt file can some one help me to get started.

I also tried adding the tFileOutputPDF after adding this in the talend tool in options window--->preferences--->talend--->components--->user component folder but not able to view in the palette.
Please help me giving some suggestions

Thank's
jones
Four Stars

Re: How to read pdf file in talend

HI Cabajones
tFileOutputPDF is a component, you can download from talend exchange.(http://www.talendforge.org/exchange/)

thanks
B. Anil Kumar
Community Manager

Re: How to read pdf file in talend

hi,

I also tried adding the tFileOutputPDF after adding this in the talend tool in options window--->preferences--->talend--->components--->user component folder but not able to view in the palette.
Please help me giving some suggestions

Thank's
jones

Hi Jones
tFileOutputPDF is used to write data to a PDF file, there is no a component can be used to read data from a PDF file, you need to hard code to read it in a routine as arfman did and call it in a job.
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
One Star

Re: How to read pdf file in talend

Hi,
Thank's for very useful information
i have written a method to read the pdf
Can you please help me how to add the method as a Routines to run the code from the talend tool
when i create a job i am able to view the code but not able to edit it to add my method.
Please give me a suggestion.

Thank's
caba
Community Manager

Re: How to read pdf file in talend

Check out the documentation https://help.talend.com/search/all?query=Managing+user+routines&content-lang=en
and let us know if you need further assistance.

Re: How to read pdf file in talend

hello Cabajones
would you be so kind to share your routine?
i am sure it would help other too.
thanks,

Re: How to read pdf file in talend

Is there any change in the status of this - "no compoent exists to read pdfs"
Given the nature of PDFs, that's what I'd expect, just checking.
Seventeen Stars

Re: How to read pdf file in talend

Why should a ETL tool read a PDF file?

Re: How to read pdf file in talend

I agree it doesn't make good sense but my boss told me to ask. Your answer is reassuring Smiley Happy.
Seventeen Stars

Re: How to read pdf file in talend

Good question. In the moment you have to use self written code in a tJavaFlex but I do not know how to read a PDF.
I would google for it. Sorry.
Ony problem is: a PDF can be created from images and the structure of the text is oriented for the layout and does not have a fix structure like a HTML table. A solution would be meanly a individual solution for a particular PDF file and every layout changes on the file will have impact to your code.
Four Stars

Re: How to read pdf file in talend

Is there any change in status of no component exist to read pdf ?
Okay, even if no component exists, is there any way to extract some particular columnar data (although no physical table structure is drawn in pdf, but virtually data is divided into columns) and store it in DB table columns ?
Through java code and itext library in routine, I am able to read pdf file but as mentioned above how to extract columns from pdf ?
Any code or url reference for this will be helpful. 
Sixteen Stars

Re: How to read pdf file in talend

Google "Java API for reading PDF files".
This is an unusual requirement (for reasons already explained above), but if there is text in the PDF that can be retrieved, the best way is to write a Java routine making use of an existing Java API. One of Talend's massive advantages over other tools is the ease at which you can write your own components or just add code to a tJavaFlex to make use of third party APIs.
One Star

Re: How to read pdf file in talend

Hi talend team,
We have a requirement to read the data from a PDF file/files. wanted to know like do we have any component provided by talend tool through which we can read the content from the pdf files.
I have gone through the different posts on google but maximum I found that it can be done using a piece of java code, but issue is that it is customized for a particular file and not valid unanimously for any kind of PDF file. So request you to share something on this so that I can get clear picture and decide accordingly to go ahead with talend as ETL tool for my assignment. Any sort of help would be appreciable
Thanks