Getting an specific tag from HTML file

Highlighted
Four Stars

Getting an specific tag from HTML file

Hi everyone,

 

I'm trying to get an href link from an HTML file obtained from a HTTP GET request, but seems that I cannot iterate in the correct xml tags to get the data.

the xpath wich I'm trying to dive into is: "//*[@id="node-24615"]/div/div/div/div/center/div[3]/div/table/tbody/tr[2]/td[2]/div/a"

and the link that I have to get is: "http://obieebr.banrep.gov.co/analytics/saw.dll?Download&Format=excel2007&Extension=.xlsx&BypassCache..."

 

thanks for your help!!

 

talend_job.PNG

 

 

 


Accepted Solutions
Sixteen Stars

Re: Getting an specific tag from HTML file

HTML is not XML so this will only work in rare cases. A better solution to this is use something like jsoup https://jsoup.org/

 

It will require a bit of java, but is entirely possible.


All Replies
Sixteen Stars

Re: Getting an specific tag from HTML file

HTML is not XML so this will only work in rare cases. A better solution to this is use something like jsoup https://jsoup.org/

 

It will require a bit of java, but is entirely possible.

Four Stars

Re: Getting an specific tag from HTML file

Thanks for your help, finally I could import jsoup library and write a short java code to extract the link.

try {
Document doc = Jsoup.connect(context.webURI).timeout(20000).get();
Elements tds = doc.select(context.elementSelector);
context.webURIExcel = tds.first().attr(context.hrefLabel);
} catch (IOException e) {
e.printStackTrace();
}

 

Sixteen Stars

Re: Getting an specific tag from HTML file

Nice work!

Tutorial

Introduction to Talend Open Studio for Data Integration.

Definitive Guide to Data Integration

Practical steps to developing your data integration strategy.

Definitive Guide to Data Quality

Create systems and workflow to manage clean data ingestion and data transformation.