Using Talend to crawl a website

One Star

Using Talend to crawl a website

Hi,
My use case is like this:- I want to crawl a website say talend.com, and extract all the information on the website into Hadoop. After that I want to search for specific strings in the data, and use it populate Hive and create a report.
I want to use Talend to populate data from a website and store it in Hadoop. I watched this video
http://www.talend.com/resources/webinars/watch/215#validatewebinar
Based on this when i use a t_FileFetch or t_HttpRequest and connect to a URI say "http://talend.com" - I only get the first page , which I can save in a file. How can I iterate over the entire contents of a directory- I need to know each distinct URL like talend.com/products etc. How can I iteratively fetch all files under a master URL.
Seventeen Stars

Re: Using Talend to crawl a website

I would use a regulary expression and filter the content of the first page for links. After collection all links you can iterate over them and so on.

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 2

Part 2 of a series on Context Variables

Blog

Best Practices for Using Context Variables with Talend – Part 1

Learn how to do cool things with Context Variables

Blog

Migrate Data from one Database to another with one Job using the Dynamic Schema

Find out how to migrate from one database to another using the Dynamic schema

Blog