K
Kaidi
Hi,
I am trying to write a spider like program. The first step is of
course
to be able to get the correct content given a URL.
All goes fine before I find the Java's URL class don't handle
URL redirection automatically. For example, if the html page the
current
url pionts to is:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>This Page has Moved</title>
<meta HTTP-EQUIV="Refresh" CONTENT="0;
URL=http://www1.cs.uic.edu/CSweb/page.php?page=root&audience=public">
<h1> </h1>
</head>
</HTML>
Then, my program will not go the the alternative URL as specified in
above
HTML code.
Can any one kindly tell me how to let the program follow the url
redirection / refresh in this situation? Thanks a lot and happy Xmax.
The code I am currently using looks like:
// begin of code =============================================
public static void main(String [] args) {
Reader myreader;
// check the input parameter.
if (args.length == 0) {
System.err.println("Usage: java HTMLParseDemo [url | file]");
System.exit(0);
};
// variables
URL input_url, new_url;
String input_url_string, temp_s;
input_url_string = args[0];
try {
input_url=new URL("http://www.cnn.com");// this line test only
if (input_url_string.indexOf("://") > 0) {
input_url = new URL(input_url_string);
Object content = input_url.getContent();
if (content instanceof InputStream) {
myreader = new InputStreamReader((InputStream)content);
}
else if (content instanceof Reader) {
myreader = (Reader)content;
}
else {
throw new Exception("Bad URL content type.");
}
}
else {
myreader = new FileReader(input_url_string);
};
//
HTMLEditorKit.Parser parser;
System.out.println("About to parse " + input_url_string);
parser = new ParserDelegator();
TagTreeCallBack cb = new TagTreeCallBack(input_url);
parser.parse(myreader, cb, true);
myreader.close();
}
catch (Exception e) {
System.err.println("Error: " + e);
e.printStackTrace(System.err);
};
}
// end of code ==========================================
I am trying to write a spider like program. The first step is of
course
to be able to get the correct content given a URL.
All goes fine before I find the Java's URL class don't handle
URL redirection automatically. For example, if the html page the
current
url pionts to is:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>This Page has Moved</title>
<meta HTTP-EQUIV="Refresh" CONTENT="0;
URL=http://www1.cs.uic.edu/CSweb/page.php?page=root&audience=public">
<h1> </h1>
</head>
</HTML>
Then, my program will not go the the alternative URL as specified in
above
HTML code.
Can any one kindly tell me how to let the program follow the url
redirection / refresh in this situation? Thanks a lot and happy Xmax.
The code I am currently using looks like:
// begin of code =============================================
public static void main(String [] args) {
Reader myreader;
// check the input parameter.
if (args.length == 0) {
System.err.println("Usage: java HTMLParseDemo [url | file]");
System.exit(0);
};
// variables
URL input_url, new_url;
String input_url_string, temp_s;
input_url_string = args[0];
try {
input_url=new URL("http://www.cnn.com");// this line test only
if (input_url_string.indexOf("://") > 0) {
input_url = new URL(input_url_string);
Object content = input_url.getContent();
if (content instanceof InputStream) {
myreader = new InputStreamReader((InputStream)content);
}
else if (content instanceof Reader) {
myreader = (Reader)content;
}
else {
throw new Exception("Bad URL content type.");
}
}
else {
myreader = new FileReader(input_url_string);
};
//
HTMLEditorKit.Parser parser;
System.out.println("About to parse " + input_url_string);
parser = new ParserDelegator();
TagTreeCallBack cb = new TagTreeCallBack(input_url);
parser.parse(myreader, cb, true);
myreader.close();
}
catch (Exception e) {
System.err.println("Error: " + e);
e.printStackTrace(System.err);
};
}
// end of code ==========================================