Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

I am a Talend Open Source newbie (1 week) and I need a component to extract a list of hyperlinks from an html page I download with tFileFetch.
The specific hyperlinks I need to extract download data files. If I get a complete list of hyperlinks (one per row in a file)
in a second step I can filter the list for the one's I am interested in and then in a third step I can iterate over the list and use string functions (from Talend Code\Routines) to build the URLs I want to pass to another tFileFetch to download the 50+ data files on a daily basis.
I have successfully downloaded the HTML page by feeding the original HTML link to tFileFetch.
By HTML hyperlinks I mean everything between "<A" and "</A>".
In general, extracting hyperlinks can be done with Regular Expressions or an XML/XQUERY, but Talend's components
assume something close to a regular row and column structure (a schema) and blow up with malformed or loosely structured HTML.
Slightly off topic -- one exception (for my application) might be Exchange component tHTTPTableInput (how to install in TOS?).
I researched the topic and found convoluted Regular Expressions (RegEx):
<a.*href=('|")?(http\://.*?(?=\1)).*>\s*(+|.*?)?\s*</a>
http://vidmar.net/weblog/archive/2009/09/10/matching-links-with-regular-expression-in-html.aspx
and this interesting February 2008 blog post "Showdown ? Java HTML Parsing Comparison"
on extracting hyperlinks using an XML/XQUERY from Java.
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
"So, to test the parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. "
* * *
"I gave each library an InputStream created from a URL (referred to as urlIS in the code samples below) and expected an org.w3c.dom.Node in return once the parse operation was completed. "
* * *
"Finally, to judge the ability to parse the HTML, I ran the XQuery ?//a? to grab all the <a> tags from the document ."
NOTE: Compare the XML/XQUERY ""//a" to the Regular Expression "<a.*href=('|")?(http\://.*?(?=\1)).*>\s*(+|.*?)?\s*</a>".
"The only one of these parsing libraries I had used before was jTidy. It was able to extract the links from 5 of the 10 documents. However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. "
* * *
"One drawback to HtmlCleaner is that it?s not available in a Maven repository. Sometimes NekoHTML may be easier to use for this reason."
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
The blog post does not give the complete Java code:
"I implemented each library in its own class extending from an AbstractScraper implementing a Scraper interface I created. "
* * *
"The implementation specific code for each library is below"
So, if we can get the complete Java code from the blog post author can this be implemented in a custom code tJava component?
As I mentioned at the beginning, I have downloaded a page using tFileFetch
and if I can get a complete list of hyperlinks (one per row in a file)
in a second step I can filter the list (using ? Talend component) for the URL's I am interested in
and then in a third step I can iterate over the list and use string functions (from Talend Code\Routines)
to build the URLs I want to pass to another tFileFetch to download the 50+ data files on a daily basis.
But first, I have to get over this hump (extracting the links) -- can you help?
Thanks
Jim
Highlighted

Re: Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

Another approach:
Java - extract an HTML tag from a String using Pattern and Matcher
http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group
"Use the Java Pattern and Matcher classes, and supply a regular expression (regex) to the Pattern class that defines the tag you want to extract. Then use the find method of the Matcher class to see if there is a match, and if so, use the group method to extract the actual group of characters from the String that matches your regular expression."
"In the following source code I demonstrate how to extract the contents from a code tag from a longer HTML string:"
* * *
"It's important to note that this example is hard-coded to look for only one occurrence of this group. In a more robust example, where you want to find and extract the contents of every code tag, your code would look more like this, using a while loop with the find method:"
http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group
This approach seems simpler than a full blown SAX or DOM parser.
Jim

Re: Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

I have a proof of concept program working, but it requires pre-processing of the HTML file.
The pre-processing of the HTML file consists of changing all </A> strings to be followed by a
blank space and an end of line string.
For proof of concept I did the pre-processing in MS Word.
I hope to be able to do the pre-processing using GNU SED (stream editor).
While researching SED, I ran across this thread that was relevant to the original topic.
New To Java - java 'sed' like functionality?
http://forums.sun.com/thread.jspa?threadID=743023
Code examples include reading the file name from the command line and
reading the entire file into a string (warning: have to control regex so it
doesn't match multiple end tags from later tag pairs -- that's why I do line
at a time input and pre-process to make sure each tag pair is on a separate line).
If Java uses zero based arrays, why is the matched string found at element one?
And do the single letter variables mean they are using Generics?
Jim

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog

How Media Organizations Achieved Success with Data Integration

Learn how media organizations have achieved success with Data Integration

Read

6 Ways to Start Utilizing Machine Learning with Amazon We Services and Talend

Look at6 ways to start utilizing Machine Learning with Amazon We Services and Talend

Blog

Why Companies Move to the Cloud: 7 Success Stories

Learn how and why companies are moving to the Cloud

Read Now