PDF and HTML parsers

PDF and HTML parsers

Hi,
Is talend supports PDF and HTML? If yes can you please let me know how can we do this.
Thanks & Regards,
Syed
Community Manager

Re: PDF and HTML parsers

Hi
There is a custom component tPDFToText on Talend exchange
http://www.talendforge.org/exchange/index.php?eid=346&product=tos&action=view&nav=1,1,1
it can be used to convert a PDF file to a text file, and then you can extract a delimited area.
About HTML file, you can test tHTTPTableInput component,
http://www.talendforge.org/exchange/index.php?eid=72&product=tos&action=view&nav=1,1,1
Best regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business

Re: PDF and HTML parsers

Hi Shong,
I have downloaded PDF parser and included in TOS. This component is generating plain text. I am not sure which component
to use to read this text file. Because my PDF file contains the table which is converting as in the text file as follows
----------------------------------------------------------------------------------------
Instrument Details

Asset Type:CORPORATE DEBT Provider:BCP Golden Copy

Identifiers
ISIN XS0283708575
CUSIP EG1215284
SEDOL B1P8V35
CFI Code
Titu Code 65083001
Central Code
SIIB Code
RIC
Code Number NA
Issuer Details
Group Issue NO
--------------------------------------------------------------------------
From this file I need to preapare the key value pairs and load them in DB.
Example:
ISIN : XS0283708575
CUSIP : EG1215284
SEDOL : B1P8V35
Please Suggest me how can I do this.
Thanks & Regards,
Syed
Community Manager

Re: PDF and HTML parsers

Hi
These three records do always start with "ISIN", "CUSIP" and "SEDOL"? If so, use a tFileInputFullRow to read each line one by one, and then filter the rows which start with "ISIN", "CUSIP" and "SEDOL" on tFilterRow, extract each line into multiple fields on tExtractDelimitedFields. for example
tFileInputFullRow--main-->tFilterRow-->tExtractDelimitedFields-->tLogrow
on tFilterRow, use the advanced module and set the filter expression as below:
input_row.line.startsWith("ISIN")||input_row.line.startsWith("CUSIP")||input_row.line.startsWith("SEDOL")

Best regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
Highlighted

Re: PDF and HTML parsers

Hi Shong,
I have given 'Filed Separater' as space in 'tExtractDelimitedFields' component.
This works fine for ISIN,CUSIP and SEDOL values but also I have the keys as 'Titu Code' and 'Central Code'.
For this it is not working.
Can you please suggest how can I do for these.

Thanks & Regards,
Syed

Re: PDF and HTML parsers

Hi Shong,
Can you please suggest how to solve this issue?
Thanks & Regards,
Syed

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration

Read

Why Companies Move to the Cloud: 7 Success Stories

Learn how and why companies are moving to the Cloud

Read Now