Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

I am a Talend Open Source newbie (1 week) and I need a component to extract a list of hyperlinks from an html page I download with tFileFetch.
The specific hyperlinks I need to extract download data files. If I get a complete list of hyperlinks (one per row in a file)
in a second step I can filter the list for the one's I am interested in and then in a third step I can iterate over the list and use string functions (from Talend Code\Routines) to build the URLs I want to pass to another tFileFetch to download the 50+ data files on a daily basis.
I have successfully downloaded the HTML page by feeding the original HTML link to tFileFetch.
By HTML hyperlinks I mean everything between "<A" and "</A>".
In general, extracting hyperlinks can be done with Regular Expressions or an XML/XQUERY, but Talend's components
assume something close to a regular row and column structure (a schema) and blow up with malformed or loosely structured HTML.
Slightly off topic -- one exception (for my application) might be Exchange component tHTTPTableInput (how to install in TOS?).
I researched the topic and found convoluted Regular Expressions (RegEx):
<a.*href=('|")?(http\://.*?(?=\1)).*>\s*(+|.*?)?\s*</a>
http://vidmar.net/weblog/archive/2009/09/10/matching-links-with-regular-expression-in-html.aspx
and this interesting February 2008 blog post "Showdown ? Java HTML Parsing Comparison"
on extracting hyperlinks using an XML/XQUERY from Java.
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
"So, to test the parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. "
* * *
"I gave each library an InputStream created from a URL (referred to as urlIS in the code samples below) and expected an org.w3c.dom.Node in return once the parse operation was completed. "
* * *
"Finally, to judge the ability to parse the HTML, I ran the XQuery ?//a? to grab all the <a> tags from the document ."
NOTE: Compare the XML/XQUERY ""//a" to the Regular Expression "<a.*href=('|")?(http\://.*?(?=\1)).*>\s*(+|.*?)?\s*</a>".
"The only one of these parsing libraries I had used before was jTidy. It was able to extract the links from 5 of the 10 documents. However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. "
* * *
"One drawback to HtmlCleaner is that it?s not available in a Maven repository. Sometimes NekoHTML may be easier to use for this reason."
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
The blog post does not give the complete Java code:
"I implemented each library in its own class extending from an AbstractScraper implementing a Scraper interface I created. "
* * *
"The implementation specific code for each library is below"
So, if we can get the complete Java code from the blog post author can this be implemented in a custom code tJava component?
As I mentioned at the beginning, I have downloaded a page using tFileFetch
and if I can get a complete list of hyperlinks (one per row in a file)
in a second step I can filter the list (using ? Talend component) for the URL's I am interested in
and then in a third step I can iterate over the list and use string functions (from Talend Code\Routines)
to build the URLs I want to pass to another tFileFetch to download the 50+ data files on a daily basis.
But first, I have to get over this hump (extracting the links) -- can you help?
Thanks
Jim

Re: Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

Another approach:
Java - extract an HTML tag from a String using Pattern and Matcher
http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group
"Use the Java Pattern and Matcher classes, and supply a regular expression (regex) to the Pattern class that defines the tag you want to extract. Then use the find method of the Matcher class to see if there is a match, and if so, use the group method to extract the actual group of characters from the String that matches your regular expression."
"In the following source code I demonstrate how to extract the contents from a code tag from a longer HTML string:"
* * *
"It's important to note that this example is hard-coded to look for only one occurrence of this group. In a more robust example, where you want to find and extract the contents of every code tag, your code would look more like this, using a while loop with the find method:"
http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group
This approach seems simpler than a full blown SAX or DOM parser.
Jim

Re: Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

I have a proof of concept program working, but it requires pre-processing of the HTML file.
The pre-processing of the HTML file consists of changing all </A> strings to be followed by a
blank space and an end of line string.
For proof of concept I did the pre-processing in MS Word.
I hope to be able to do the pre-processing using GNU SED (stream editor).
While researching SED, I ran across this thread that was relevant to the original topic.
New To Java - java 'sed' like functionality?
http://forums.sun.com/thread.jspa?threadID=743023
Code examples include reading the file name from the command line and
reading the entire file into a string (warning: have to control regex so it
doesn't match multiple end tags from later tag pairs -- that's why I do line
at a time input and pre-process to make sure each tag pair is on a separate line).
If Java uses zero based arrays, why is the matched string found at element one?
And do the single letter variables mean they are using Generics?
Jim