PDF and HTML parsers

Hi,
Is talend supports PDF and HTML? If yes can you please let me know how can we do this.
Thanks & Regards,
Syed
5 REPLIES
Community Manager

Re: PDF and HTML parsers

Hi
There is a custom component tPDFToText on Talend exchange
http://www.talendforge.org/exchange/index.php?eid=346&product=tos&action=view&nav=1,1,1
it can be used to convert a PDF file to a text file, and then you can extract a delimited area.
About HTML file, you can test tHTTPTableInput component,
http://www.talendforge.org/exchange/index.php?eid=72&product=tos&action=view&nav=1,1,1
Best regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business

Re: PDF and HTML parsers

Hi Shong,
I have downloaded PDF parser and included in TOS. This component is generating plain text. I am not sure which component
to use to read this text file. Because my PDF file contains the table which is converting as in the text file as follows
----------------------------------------------------------------------------------------
Instrument Details

Asset Type:CORPORATE DEBT Provider:BCP Golden Copy

Identifiers
ISIN XS0283708575
CUSIP EG1215284
SEDOL B1P8V35
CFI Code
Titu Code 65083001
Central Code
SIIB Code
RIC
Code Number NA
Issuer Details
Group Issue NO
--------------------------------------------------------------------------
From this file I need to preapare the key value pairs and load them in DB.
Example:
ISIN : XS0283708575
CUSIP : EG1215284
SEDOL : B1P8V35
Please Suggest me how can I do this.
Thanks & Regards,
Syed
Community Manager

Re: PDF and HTML parsers

Hi
These three records do always start with "ISIN", "CUSIP" and "SEDOL"? If so, use a tFileInputFullRow to read each line one by one, and then filter the rows which start with "ISIN", "CUSIP" and "SEDOL" on tFilterRow, extract each line into multiple fields on tExtractDelimitedFields. for example
tFileInputFullRow--main-->tFilterRow-->tExtractDelimitedFields-->tLogrow
on tFilterRow, use the advanced module and set the filter expression as below:
input_row.line.startsWith("ISIN")||input_row.line.startsWith("CUSIP")||input_row.line.startsWith("SEDOL")

Best regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business

Re: PDF and HTML parsers

Hi Shong,
I have given 'Filed Separater' as space in 'tExtractDelimitedFields' component.
This works fine for ISIN,CUSIP and SEDOL values but also I have the keys as 'Titu Code' and 'Central Code'.
For this it is not working.
Can you please suggest how can I do for these.

Thanks & Regards,
Syed

Re: PDF and HTML parsers

Hi Shong,
Can you please suggest how to solve this issue?
Thanks & Regards,
Syed