[resolved] getting destination url

Six Stars

[resolved] getting destination url

Hi,
Trying to get the destination url of a url like this: http://objects.icecat.biz/objects/mmo-26204092-2358398.html
The url redirects to a pdf file. I did manage to get the file with tfilefetch (allowing redirect) but I just need the destination url (simply the url of the pdf file in this case), not the file itself. Any idea's?
Thanks,
Henry

Accepted Solutions
Six Stars

Re: [resolved] getting destination url

Hi Shong,
Thank you for helping me out, managed to get it working! Only for one url so far... Now I'm trying to get it to work with an input stream (rows come from tExtractXMLField) how should I make it work for all rows? Should I be using tJavaRow instead of tJava?
My code now is:
String url = row15.url;
java.net.HttpURLConnection con = (java.net.HttpURLConnection) new java.net.URL(url).openConnection();
con.setInstanceFollowRedirects(false);
con.connect();
String realURL = con.getHeaderField("Location");
System.out.println(realURL);

The part not yet working is String url = row15.url;  if I'd replace row15.url with "http://myurl.com" then the code is working for that url.
Thanks,
Henry

All Replies
Community Manager

Re: [resolved] getting destination url

Hi 
There is no a component can be used to get the real url behind a redirect URL right now, however, you can hard code on tJava component to get it, refer to the following pages:
http://www.programminglogic.com/how-to-find-the-real-url-behind-a-redirect-in-java/
http://stackoverflow.com/questions/2659000/java-how-to-find-the-redirected-url-of-a-url
Best regards
Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
Six Stars

Re: [resolved] getting destination url

Hi Shong,
Thank you for helping me out, managed to get it working! Only for one url so far... Now I'm trying to get it to work with an input stream (rows come from tExtractXMLField) how should I make it work for all rows? Should I be using tJavaRow instead of tJava?
My code now is:
String url = row15.url;
java.net.HttpURLConnection con = (java.net.HttpURLConnection) new java.net.URL(url).openConnection();
con.setInstanceFollowRedirects(false);
con.connect();
String realURL = con.getHeaderField("Location");
System.out.println(realURL);

The part not yet working is String url = row15.url;  if I'd replace row15.url with "http://myurl.com" then the code is working for that url.
Thanks,
Henry
Community Manager

Re: [resolved] getting destination url

Yes, if you want access the input data flow, use tJavaRow to replace tJava and change your code to:
String url = input_row.url;
java.net.HttpURLConnection con = (java.net.HttpURLConnection) new java.net.URL(url).openConnection();
con.setInstanceFollowRedirects(false);
con.connect();
String realURL = con.getHeaderField("Location");
System.out.println(realURL);

Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
Six Stars

Re: [resolved] getting destination url

Great thanks Shong, that works, it seems I run into 2 small challenges though:
1) IF the input_row.url ends with .html, THEN it needs to perform the code to get the destination url.  ELSE output_row.url = input_row.url). Any idea how I should include such a statement in tJavaRow? FYI my current code is:
String url = input_row.url;
System.out.println(url);
java.net.HttpURLConnection con = (java.net.HttpURLConnection) new java.net.URL(url).openConnection();
con.setInstanceFollowRedirects(false);
con.connect();
String realURL = con.getHeaderField("Location");
System.out.println(realURL);
output_row.url = realURL;

2) after a numer of connections I get a timeout. I suspect the server on the other side does not accept more than x connections. Could I be missing something in the code above to properly close each connection before going to fetch the next destination url?
Six Stars

Re: [resolved] getting destination url

Regarding 1) I don't know whether it's best practice, but it seems I solved including the if statement like this:
String url = input_row.url;
System.out.println(url);
if (StringHandling.INDEX(url,".html")>0){
java.net.HttpURLConnection con = (java.net.HttpURLConnection) new java.net.URL(url).openConnection();
con.setInstanceFollowRedirects(false);
con.connect();
String realURL = con.getHeaderField("Location");
System.out.println(realURL);
output_row.url = realURL;
}
else {input_row.url = output_row.url;}

Still open to solve: solving the connection timeout (after x connections it won't accept more)
Six Stars

Re: [resolved] getting destination url

Regarding 2) it seems addding con.disconnect(); solves the connection issue. I don't know whether it is best practice to disconnect each time but it works
//Code generated according to input schema and output schema
output_row.fk_product_id = input_row.fk_product_id;
output_row.fk_supplier_id = input_row.fk_supplier_id;
output_row.id_by_datasupplier = input_row.id_by_datasupplier;
//
String url = input_row.url;
System.out.println(url);
if (StringHandling.INDEX(url,".html")>0){
java.net.HttpURLConnection con = (java.net.HttpURLConnection) new java.net.URL(url).openConnection();
con.setInstanceFollowRedirects(false);
con.connect();
String realURL = con.getHeaderField("Location");
System.out.println(realURL);
output_row.url = realURL;
con.disconnect();
}
else {input_row.url = output_row.url;}

Shong, or if anyone likes to comment on this, would love to hear, otherwise I'll mark ik resolved later today
Community Manager

Re: [resolved] getting destination url

Hi 
1) It is OK with this method, you can also use String.endsWith(
) method to check if the url ends with .html.
String url = input_row.url;
System.out.println(url);
if (url.endsWith(".html")){
..}

2) You need to close the connection at the end, for example:
if (StringHandling.INDEX(url,".html")>0){
java.net.HttpURLConnection con = (java.net.HttpURLConnection) new java.net.URL(url).openConnection();
con.setInstanceFollowRedirects(false);
con.connect();
String realURL = con.getHeaderField("Location");
System.out.println(realURL);
con.disConnect();
output_row.url = realURL;
}

Shong
----------------------------------------------------------
Talend | Data Agility for Modern Business
Six Stars

Re: [resolved] getting destination url

Many thanks. Used endsWith now and kept disconnect()  (without capital C). I'll mark it resolved now