One Star

Using Talend to crawl a website

My use case is like this:- I want to crawl a website say, and extract all the information on the website into Hadoop. After that I want to search for specific strings in the data, and use it populate Hive and create a report.
I want to use Talend to populate data from a website and store it in Hadoop. I watched this video
Based on this when i use a t_FileFetch or t_HttpRequest and connect to a URI say "" - I only get the first page , which I can save in a file. How can I iterate over the entire contents of a directory- I need to know each distinct URL like etc. How can I iteratively fetch all files under a master URL.
Seventeen Stars

Re: Using Talend to crawl a website

I would use a regulary expression and filter the content of the first page for links. After collection all links you can iterate over them and so on.