Five Stars

Extract Data from URL in Talend

hello everyone I'm trying to do crawling with talend, and I managed to do it even with the tHttpInputTable component found in the talend exchange, but also with java code by importing the jsoup library into the tJavaFlex component. The result is amazing to be able to do it on all the other sites is what I try to do, but I'm still new in this field, someone can make me a small overview and what I'm missing, for example the simple and static sites such as " http://www.imdb.com/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe... "which is a rating site for movies , I can do it with a few lines of java, but for the most complex and certainly not static sites such as "https://www.risultati.it" which is a live soccer results site I can not, what I'm missing ?
Is JSOUP not powerful enough to crawl all kinds of sites? thanks in advance for those who have to devote some time and open a world in this field to a new one.
4 REPLIES
Highlighted
Fifteen Stars

Re: Extract Data from URL in Talend

I wrote a tutorial on exactly this about 3 years ago. You can find it here: https://www.rilhia.com/tutorials/using-third-party-java-library-scrape-content-table-web-page

I suspect that better libraries are now available but the thing I learnt doing this is that you need to build your job with the website it is going to scrape in mind. Consider a website like XML. You can't build a single job to handle all types of XML. The same applies to websites.

I hope the tutorial gives you a few ideas.
Rilhia Solutions
Five Stars

Re: Extract Data from URL in Talend

Hi rhall_2_0, thank you so much for your solution, a very nice and very well structured tutorial.

but unfortunately this does not solve my problem, i tried your solution and I can not get out data of these dynamic sites, like this one "https://www.risultati.it/partita/MPX1oKd9/#informazioni-partita" ,but are many other of this type, I'm interested a certain football match. and I can not do this because I'm not able to understand how data are exposed on the site in some way recall database I think. the html that you get with thttprequest in talend does not contains all of this data that i need... any other idea ?? thank you again in advanced

Fifteen Stars

Re: Extract Data from URL in Talend

I've looked at the page source of the page you posted. It looks like you might struggle with this. It looks like this page has been written to obfuscate the data to prevent page scraping. There is a lot of Javascript used. I am not sure you will be able to do this. You *could* try saving the pages locally as HTML and then processing them. That *might* make it slightly easier.

Rilhia Solutions
Five Stars

Re: Extract Data from URL in Talend

yes, I noticed this, in fact I'm fighting with this thing for a while, it's not a very urgent thing, because I do not have to do this at work, but it's a very interesting thing I'd like to do. However, I would like to thank you once again for your time! and I hope to hear from you again if anyone can find a way to do it.