tHTMLinput

One Star

tHTMLinput

I would like to parse the table on the following page:
http://english.mnb.hu/arfolyamok
So the HTML is something like:

<td class="firstcell noborder">AUD</td>
<td>Australian Dollar</td>
<td>1</td>
<td>209.43</td>
<td></td>
<td class="firstcell">KRW</td>
<td>South Korean Won</td>
<td>100</td>
<td>24.87</td>
</tr>

So I am trying to use the tHTMLinput component and currently I get the following message:
Exécution en erreur :Échec de la génération du code.


Could you help me ?
Thanks
Didier
Moderator

Re: tHTMLinput

Hi,
Have you already checked this custom component overview from TalendExchange:tHTMLInput?

Best regards
Sabrina
--
Don't forget to give kudos when a reply is helpful and click Accept the solution when you think you're good with it.
Sixteen Stars

Re: tHTMLinput

I have written a tutorial that covers this. It is reasonably complicated and meant to show what you can do with Talend when you use it with other third party tools/libraries. It comes with an example job that I think you can probably tailor to your requirements. You will require Java knowledge to tailor this to your requirements.
http://www.rilhia.com/node/39
One Star

Re: tHTMLinput

Yes to Sabrina
and Yes, I have read the tutorial but it is not yet clear
Five Stars

Re: tHTMLinput

hi Dihonore, 
follow following steps to get expected result, it does not required Java knowledge. and hope you have seen the original post from author of tHTMLInput component here.

Parent Element="table.MNBDailyRatesUI_Table.mnbtable"
add columns and configure as follows. 

first column= "td:eq(0)"
second column ="td:eq(1)"
third column = "td:eq(2)"

this way you can get the expected result. if you want to try then refer this URL.

http://try.jsoup.org/~Tmx2BFhR_XBIJE0WJMFj86MpMEM


Hope this solve your problem. 
One Star

Re: tHTMLinput

Another example to understand how tHTMLInput works
on the same site: http://english.mnb.hu/
I want to extract the official euro rate
so  I  have the HTML source code:

<div class="MNBStatsValue roundedBox">
<span>
<span id="ctl00_WebPartManager1_MNBEuroExchangeRate1880065841_ctl00_euroValueLabel">EUR</span></span>&#160;
<span id="ctl00_WebPartManager1_MNBEuroExchangeRate1880065841_ctl00_euroPriceLabel" class="BaseRateData">310.83</span>
</div>
so the parent element is "div.MNBStatsValue roundedBox" or "span.BaseRateData" ??
and how I get the rate (310.83) ??
Thanks
Didier
Five Stars

Re: tHTMLinput

your parent element is =div.MNBStatsValue.roundedBox
or if you just need a Euro value then keep the parent as above and give the column value as follows. 

Euro Rate="span#ctl00_WebPartManager1_MNBEuroExchangeRate1880065841_ctl00_euroPriceLabel"

this will solve your problem. 
One Star

Re: tHTMLinput

Currently I get:
Démarrage du job GetCurrency_BNH_HTML a 11:24 16/07/2015.
connecting to socket on port 3598
connected
310.83|
3  %|
0.6  %|
1.50 %|
disconnected
Job GetCurrency_BNH_HTML terminé à 11:24 16/07/2015.
Is there a way to specify the class BaseRateData to get only the rate?
Thanks
Didier
Five Stars

Re: tHTMLinput

put the parent element as "*" and keep the column setting as is. it will give only euro rate. 

EuroRate="span#ctl00_WebPartManager1_MNBEuroExchangeRate1880065841_ctl00_euroPriceLabel"

tHTMLInput component will give expected result. other wise you can filter result using tmap to get only one result. 
One Star

Re: tHTMLinput

Démarrage du job GetCurrency_BNH_HTML a 12:50 16/07/2015.
connecting to socket on port 3413
connected
Exception in component tHTMLInput_1
java.lang.NullPointerException
    at pmi.getcurrency_bnh_html_0_1.GetCurrency_BNH_HTML.tHTMLInput_1Process(GetCurrency_BNH_HTML.java:692)
    at pmi.getcurrency_bnh_html_0_1.GetCurrency_BNH_HTML.runJobInTOS(GetCurrency_BNH_HTML.java:1099)
    at pmi.getcurrency_bnh_html_0_1.GetCurrency_BNH_HTML.main(GetCurrency_BNH_HTML.java:920)
disconnected
Job GetCurrency_BNH_HTML terminé à 12:50 16/07/2015.


another recommandation???
Five Stars

Re: tHTMLinput

Today I will upload new version of tHTMLInput please download it, It will give you option to choose how many times you want to try to connect certain webpage. this will avoid errors at some extent. 
One Star

Re: tHTMLinput

Hi Umesh,
I have looked on the Talend Exchange site and I see tHTMLinput Release date : 22-Apr and tHTMLInput_extended with a release date of 2-July
There is another version??
Thanks
Didier
One Star

Re: tHTMLinput

I have install the V2
Now it works:
connecting to socket on port 3628
connected
.------+--------------.
|      tLogRow_2      |
|=-----+-------------=|
|euro  |parseErrorText|
|=-----+-------------=|
|309.33|null          |
'------+--------------'
disconnected
but the component does not seem very stable:
connected
Exception in component tHTMLInput_2
java.lang.NullPointerException
    at pmi.getcurrency_bnh_html_0_1.GetCurrency_BNH_HTML.tHTMLInput_2Process(GetCurrency_BNH_HTML.java:806)
    at pmi.getcurrency_bnh_html_0_1.GetCurrency_BNH_HTML.runJobInTOS(GetCurrency_BNH_HTML.java:1209)
    at pmi.getcurrency_bnh_html_0_1.GetCurrency_BNH_HTML.main(GetCurrency_BNH_HTML.java:1030)
disconnected

when you try to execute several times the same job!


Five Stars

Re: tHTMLinput

You need to add tSleep component to slow down process, and avoid frequent requesting to web server, it may block your IP address. Other wise there are options but need to add like proxy. It will be added in next version, right now I am too busy. 
 
One Star

Re: tHTMLinput

With a tSleep:
Démarrage du job GetCurrency_BNH_HTML a 15:45 27/07/2015.
connecting to socket on port 3510
connected
Exception in component tHTMLInput_2
java.lang.NullPointerException
    at pmi.getcurrency_bnh_html_0_1.GetCurrency_BNH_HTML.tHTMLInput_2Process(GetCurrency_BNH_HTML.java:1478)
    at pmi.getcurrency_bnh_html_0_1.GetCurrency_BNH_HTML.tSleep_1Process(GetCurrency_BNH_HTML.java:2076)
    at pmi.getcurrency_bnh_html_0_1.GetCurrency_BNH_HTML.runJobInTOS(GetCurrency_BNH_HTML.java:2338)
    at pmi.getcurrency_bnh_html_0_1.GetCurrency_BNH_HTML.main(GetCurrency_BNH_HTML.java:2159)
disconnected
Job GetCurrency_BNH_HTML terminé à 15:46 27/07/2015.