Hi rhall_2_0, thank you so much for your solution, a very nice and very well structured tutorial.
but unfortunately this does not solve my problem, i tried your solution and I can not get out data of these dynamic sites, like this one "https://www.risultati.it/partita/MPX1oKd9/#informazioni-partita" ,but are many other of this type, I'm interested a certain football match. and I can not do this because I'm not able to understand how data are exposed on the site in some way recall database I think. the html that you get with thttprequest in talend does not contains all of this data that i need... any other idea ?? thank you again in advanced
yes, I noticed this, in fact I'm fighting with this thing for a while, it's not a very urgent thing, because I do not have to do this at work, but it's a very interesting thing I'd like to do. However, I would like to thank you once again for your time! and I hope to hear from you again if anyone can find a way to do it.
first of all, thank you for your time to help me
in fact, I want to parse XML / HTML from site https://www.cert.ssi.gouv.fr/
when i try to catch data from the HTML page that cames with the component everything works fine, but this page is very simple does not have any divs, or blockquotes, is structured only using tables, when i try to use a page that uses more html tags, like blockquotes, is like tHTTPTableInput does not recognize the Tables, so it launch a
"Exception in component tHTTPTableInput_1 java.lang.ArrayIndexOutOfBoundsException:"
I'm expecting to have a table like that
ie I want to parse HTML and extract all the CERTEFs with a title and a publication date and all the VECs that it exists in each CERTEF
I do not know which component I can use and with which configuration that extracts exactly the same table
thank you for helping me
This is not going to be easy and there is no component I know of which will just do it for you. I think you will need to use a bit a code. I have written post which describes how I achieved something very similar. It is very complicated, but there will not be an "easy" way of achieving this I am afraid.
first I thank you for your answer,
any way, I created a job as following but I have a problem in writing the codes
I searched between the questions in community and I find it but it doesn't work https://www.rilhia.com/tutorials/using-third-party-java-library-scrape-content-table-web-page
but I don't know how I can use this way for my project because the site has several div and pdf and link and the data is not exactly in the specific tables
@mitra1367 I recreated the work that was hosted at the link that no longer works here: https://community.talend.com/t5/Design-and-Development/Extract-Multiple-table-using-tHTTPTableInput-...
I think I may have said that it is tricky and you need to understand the third party Java API that I mention. Take a look at the documentation for that.
Unfortunately scraping websites is notoriously hard because there is no standard way of displaying data. So your solution will usually be entirely bespoke to your problem
Talend named a Leader.
Kickstart your first data integration and ETL projects.
Watch the recorded webinar!
Learn how to make your data more available, reduce costs and cut your build time
Read about OTTO's experiences with Big Data and Personalized Experiences
Take a look at this video about Talend Integration with Databricks