One Star

How to extract data from a website?

Hi,
i´ve got two websites. One Website wich supports SOAP, imports and so on.
Another Website wich keeps about 7000 html documents with an identical format with information in tables on it.
Now, with the relaunch, I have to transport content from the 7000 files to a database / CMS / SOAP.
I saw, that talend is able to connect to http.
Can I also extract data from html tables?
Thank you.
Bye, Chris

  • Data Integration
20 REPLIES
One Star

Re: How to extract data from a website?

Ithink that There isn't any way to extract data from a html table but if you have only table you may use a regular expression
One Star

Re: How to extract data from a website?

Hello Chris,
as Olivier wrote, there is no special component. I had the same problem and it ends up in a tJavaRow with many regex. But that depends on your html structure. I've experimented a little bit with html2xml converter. If you search in google you should find different tools (including open source). At the end I could'nt use them because my input was very "unwell formed".
If you found a solution please give a us a feedback.
Bye
Volker
One Star

Re: How to extract data from a website?

I have written an OpenSource function for converting bad HTML to well-formed XML (http://sourceforge.net/projects/light-html2xml) and I would appreciate to test it with your input.
It is a single-pass automat and it does not need specific objects. It is not yet written in Java but in C# and in PHP5 (I will soon rewrite it in Java, especially if you're interested in...).
One Star

Re: How to extract data from a website?

Yes I think that it would be a really good idea to write it in java then I will create a specific talend component to perform this action
Employee

Re: How to extract data from a website?

Hi,
We use for internal stats some Talend jobs using http://cpan.uwinnipeg.ca/module/HTML::TokeParser in tPerl/tPerlRow. We may push on the stack a new component if you need it.
Hope this helps
One Star

Re: How to extract data from a website?

The Java version of the html2xml function I have written is now downloadable at http://sourceforge.net/projects/light-html2xml
Please send me your comments and remarks about it so I will fix bugs.
One Star

Re: How to extract data from a website?

Yes u can extract all data from 7000 pages. i m also working on this.
One Star

Re: How to extract data from a website?

I found another helpful thing for this:
http://www.iopus.com/imacros/firefox/?ref=fxmoz
Amazing tool to automate the web, even data extraction works fine.
One could combine the output which is e.g. Excel with Talend to get it into another database.
One Star

Re: How to extract data from a website?

User vder software, extract data from Amazon.com output to xml format. view screenshot: http://binhgiang.sourceforge.net/xmlalbum/slides/vietspider%20xml%20list%20detail%201.html
and download from: http://binhgiang.sourceforge.net/site/download.jsp
One Star

Re: How to extract data from a website?

I would suggest Automation Anywhere. Great tool for web data extraction and automating any task. Free Trial available for download at:
http://www.automationanywhere.com/download/freeTrial.htm
Just try it out! Smiley Wink
Employee

Re: How to extract data from a website?

You can also try tHTTPTableInput. This component has been designed for extracting data directly from HTML Pages.
http://www.talendforge.org/exchange/tos/extension_view.php?eid=72
Regards
Martin
One Star

Re: How to extract data from a website?

Have you ever wonder if you can have full contents from your desired website into a single Excel Document?
If so, I have the solution for you at fairly cheaper price.
I can extract most of the website data and compile it in a single ms-excel 2003 format within just few days.
It can be any website, from a simple site to complex sites like b2b portals or whatever you can come up with.
Contact me with your website and requirements.
Regards,
Janib Soomro
janib4all@hotmail.com
One Star

Re: How to extract data from a website?

I can make it for you. site.downloader@gmail.com
One Star

Re: How to extract data from a website?

Talend, I am having trouble in getting HTML table data to excel using talend v4.2.2. I saw there is a component thttptable for previous version.
Can you help in this regard?
One Star

Re: How to extract data from a website?

Hello Honed,
I'm having the same problem, when i try to catch data from the html page that cames with the component everything works fine, but this page is very simple does not have any divs, or blockquotes, is structured only using tables, when i try to use a page that uses more html tags, like blockquotes, is like tHTTPTableInput does not recognize the Tables, so it launch a
"Exception in component tHTTPTableInput_1 java.lang.ArrayIndexOutOfBoundsException:"
Does anyone here has the same problem or know how to solve this?

Thanks
One Star

Re: How to extract data from a website?

Hello,
Did you try DataCrops web extraction software tool? 
DataCrops tool allows you to extract data from any website and provides it to you in proper structure. This business data really helps you to generate leads for your business as well as you can easily analyse this data and take prominent decision for your business !
One Star

Re: How to extract data from a website?

Try this for free download trial version
Employee

Re: How to extract data from a website?

You can use Talend for this. It needs a little Java coding, but it is more than possible. I have written a simple tutorial here. It comes with all of the source code in Talend v5.5.1 format.
One Star

Re: How to extract data from a website?

Recently I faced some problem to extract data but I found data extractor software from webcontentextractor.com, it helped me a lot to extract data. When I used this software it provided me excellent support and saved a lot of time and effort.
One Star

Re: How to extract data from a website?

I have developed tHTMLInput component which can do better HTML parsing inside Talend job.  you can integrate this component to get the result in Table format then you can send these records to file or table or any processing component which supports main or iterate flow.