how to use the pdfs on in the different urls and integrate the parameters in a file

Seven Stars

how to use the pdfs on in the different urls and integrate the parameters in a file

Hi all, 

I saw the other topic posted. Unfortunately the solution does not fit my needs. 

I have different pdfs in the site https://www.cert.ssi.gouv.fr/ , how can  extract info from each pdf to export a database and  integrate  in a table

 

Thank you very much,

Tags (1)
Twelve Stars

Re: how to use the pdfs on in the different urls and integrate the parameters in a file

pdf are not data files!!!
is there tags inside?
can you get them as excel?
did you have an ocr?

Francois Denis

Tag as "solved" for others! Kudos to thanks!

Twelve Stars

Re: how to use the pdfs on in the different urls and integrate the parameters in a file

you pehaps nend an rpa application.

Francois Denis

Tag as "solved" for others! Kudos to thanks!

Seven Stars

Re: how to use the pdfs on in the different urls and integrate the parameters in a file

first of all I thank you for your answer but I didn't understand correctly, I use the pdfs that are in this following site https://www.cert.ssi.gouv.fr/  and extract a data table that exists in each pdf and then I integrate them in a table

regards

Twelve Stars

Re: how to use the pdfs on in the different urls and integrate the parameters in a file

it's a good advertising for this site but:
PDF files are for printing they contain printable data.
sometime they also contain data into pdf tags useful for indexing.
to convert pdf to text you need to use an ocr.
rca create automatic human process.
Talend is an etl it work with data it does not work with pdf (as I know).

Francois Denis

Tag as "solved" for others! Kudos to thanks!

Seven Stars

Re: how to use the pdfs on in the different urls and integrate the parameters in a file


I found that there is a tpdftotext component that has been created by other users on talendexchange but I need to extrat the table that is in each pdf so it doesn't work for me 

Highlighted
Employee

Re: how to use the pdfs on in the different urls and integrate the parameters in a file

Hi,

 

    If you are using a custom component, I would suggest you to contact the author of the component directly. Reading from PDFs is not a good strategy as the data in PDF is meant for easy reading from human perspective. But if you have to read the data lying in PDF, why don't you go to the source system which is providing data to PDF and pick it from there?

 

   That is the ideal way of doing in case of an enterprise environment.

 

Tail Note:- Amazon is creating a new feature called Textract to read PDF but it is currently in Preview mode. Once its ready, you can make API calls from Talend to get result set. There are lot of third party companies go allow API call to fetch the data. You can try that route.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)

Seven Stars

Re: how to use the pdfs on in the different urls and integrate the parameters in a file

hi nikhilthampi
first of all I thank you for your time and answer, in fact I'm new in talend if I want to ask my very simple question in an example I would like to know how I can have the data in part: DOCUMENT MANAGEMENT  from site  https://www.cert.ssi.gouv.fr/alerte/CERTFR-2019-ALE-008/
in a table like that:
Reference:                    CERTFR-2019-ALE-008
Title:                             Vulnerability in Microsoft SharePoint Server
Date of first version      29 May 2019
Date of last version       29 May 2019
Source(s)                      Microsoft Security Bulletin CVE-2019-0604 dated February 12, 2019

thanks 

regards

Employee

Re: how to use the pdfs on in the different urls and integrate the parameters in a file

Hi,

 

    The simple answer is there are no standard components from Talend palette for this requirement There might be components created by Talend community members in exchange.talend.com

 

     Other option to do is to write custom java code to read the data using routine options in Talend or call any third party API using REST API calls from Talend.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved :-)

 

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration

Read

Tutorial

Introduction to Talend Open Studio for Data Integration.

Watch

Downloads and Trials

Test drive Talend's enterprise products.

Downloads