Get HTML content of a redirected URL?

K

Kaidi

Hi,
I am trying to write a spider like program. The first step is of
course
to be able to get the correct content given a URL.
All goes fine before I find the Java's URL class don't handle
URL redirection automatically. For example, if the html page the
current
url pionts to is:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>This Page has Moved</title>
<meta HTTP-EQUIV="Refresh" CONTENT="0;
URL=http://www1.cs.uic.edu/CSweb/page.php?page=root&audience=public">
<h1> </h1>
</head>
</HTML>

Then, my program will not go the the alternative URL as specified in
above
HTML code.

Can any one kindly tell me how to let the program follow the url
redirection / refresh in this situation? Thanks a lot and happy Xmax.

The code I am currently using looks like:

// begin of code =============================================
public static void main(String [] args) {
Reader myreader;
// check the input parameter.
if (args.length == 0) {
System.err.println("Usage: java HTMLParseDemo [url | file]");
System.exit(0);
};
// variables
URL input_url, new_url;
String input_url_string, temp_s;
input_url_string = args[0];
try {
input_url=new URL("http://www.cnn.com");// this line test only

if (input_url_string.indexOf("://") > 0) {
input_url = new URL(input_url_string);
Object content = input_url.getContent();
if (content instanceof InputStream) {
myreader = new InputStreamReader((InputStream)content);
}
else if (content instanceof Reader) {
myreader = (Reader)content;
}
else {
throw new Exception("Bad URL content type.");
}
}
else {
myreader = new FileReader(input_url_string);
};
//
HTMLEditorKit.Parser parser;
System.out.println("About to parse " + input_url_string);
parser = new ParserDelegator();
TagTreeCallBack cb = new TagTreeCallBack(input_url);
parser.parse(myreader, cb, true);
myreader.close();
}
catch (Exception e) {
System.err.println("Error: " + e);
e.printStackTrace(System.err);
};
}
// end of code ==========================================
 
R

Real Gagnon

All goes fine before I find the Java's URL class don't handle
URL redirection automatically.

You can try using HttpURLConnection and then
HttpURLConnection.setFollowRedirects(true);
but in your case i don't think it will work
since the redirection is not coming from the server
but from the html code received by the client.

Bye.
 
S

Silvio Bierman

Kaidi said:
Hi,
I am trying to write a spider like program. The first step is of
course
to be able to get the correct content given a URL.
All goes fine before I find the Java's URL class don't handle
URL redirection automatically. For example, if the html page the
current
url pionts to is:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>This Page has Moved</title>
<meta HTTP-EQUIV="Refresh" CONTENT="0;
URL=http://www1.cs.uic.edu/CSweb/page.php?page=root&audience=public">
<h1> </h1>
</head>
</HTML>

Then, my program will not go the the alternative URL as specified in
above
HTML code.

Can any one kindly tell me how to let the program follow the url
redirection / refresh in this situation? Thanks a lot and happy Xmax.

The code I am currently using looks like:

// begin of code =============================================
public static void main(String [] args) {
Reader myreader;
// check the input parameter.
if (args.length == 0) {
System.err.println("Usage: java HTMLParseDemo [url | file]");
System.exit(0);
};
// variables
URL input_url, new_url;
String input_url_string, temp_s;
input_url_string = args[0];
try {
input_url=new URL("http://www.cnn.com");// this line test only

if (input_url_string.indexOf("://") > 0) {
input_url = new URL(input_url_string);
Object content = input_url.getContent();
if (content instanceof InputStream) {
myreader = new InputStreamReader((InputStream)content);
}
else if (content instanceof Reader) {
myreader = (Reader)content;
}
else {
throw new Exception("Bad URL content type.");
}
}
else {
myreader = new FileReader(input_url_string);
};
//
HTMLEditorKit.Parser parser;
System.out.println("About to parse " + input_url_string);
parser = new ParserDelegator();
TagTreeCallBack cb = new TagTreeCallBack(input_url);
parser.parse(myreader, cb, true);
myreader.close();
}
catch (Exception e) {
System.err.println("Error: " + e);
e.printStackTrace(System.err);
};
}
// end of code ==========================================

URL handles HTTP redirects which are a specific type of HTTP response.
Returning a HTML-document containing a meta-tag is not a HTTP redirect, it
is a messy way to tell a browser to look somewhere else. It was conceived of
because people writing HTML have no way to control HTTP server responses
without doing CGI-like stuff.

Silvio Bierman
 
K

Kaidi

Thanks a lot guys.
However, I met another redirect problem which I have no ides of it.
(I have tried the setFollowRedirects(true) but not working).

This url:
http://www.buy.com/basket/additem.asp?loc=18133&sku=10343367
in IE will redirect us to another one:
http://www.buy.com/basket/basket.asp?dclksa=10343367^1

How do I find that out in my Java program?
My code are below, it goes to the exception part for that url. :-(

-------code begin -------------
public String getpagesource(String url_string)
{
int MAXSIZE=9000000;// max page size 9M
String n_string="";
try {
// try opening the URL
URL url=new URL(url_string);
HttpURLConnection urlConnection = (HttpURLConnection)
url.openConnection();
urlConnection.setFollowRedirects(true);
urlConnection.setAllowUserInteraction(false);
InputStream urlStream = url.openStream();
// search the input stream for links
// first, read in the entire URL
byte b[] = new byte[1000];
int numRead = urlStream.read(b);
String content;
if (numRead>0)
content = new String(b, 0, numRead);
else
content = new String(""); // so it will be ignored by later
program.
while ((numRead != -1) && (content.length() < MAXSIZE))
{
numRead = urlStream.read(b);
if (numRead != -1) {
String newContent = new String(b, 0, numRead);
content += newContent;
}
};
n_string = urlConnection.getURL().toString();
if (n_string.compareToIgnoreCase(url_string) != 0)
{
System.out.println("URL redirect detedted, org:
"+url_string+"\rnew :"+n_string);
}
return content;
}
catch (IOException e)
{
return "";
}
}

------- code end ---------------
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,188
Members
46,733
Latest member
LonaMonzon

Latest Threads

Top