U
unbending
I'm having trouble using the example method (to extract text from an
HTML document I found on Sun's site). It works fine for standard
ANSI-based files, but when I convert them to Unicode or UTF-8, it
doesn't work right (it includes a bunch of strange characters).
I think the reason it's not working has to do with the 2-byte vs.
1-byte encoding, but I have no idea how to fix it. Any ideas?
Here's my code:
final StringBuffer buf = new StringBuffer(1000);
try {
// Create an HTML document that appends all text to buf
HTMLDocument doc = new HTMLDocument() {
public HTMLEditorKit.ParserCallback getReader(int pos) {
return new HTMLEditorKit.ParserCallback() {
// This method is called whenever text is encountered
// in the HTML file
public void handleText(char[] data, int pos) {
buf.append(data + "\n");
}
};
}
};
// Create a reader on the HTML content
// URL url = new URI(location).toURL();
URL url = location.toURL();
URLConnection conn = url.openConnection();
Reader rd = new InputStreamReader(conn.getInputStream());
// Parse the HTML
HTMLEditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
}
catch(MalformedURLException mue)
{ System.out.println(mue.getLocalizedMessage()); }
catch(BadLocationException ble)
{ System.out.println(ble.getLocalizedMessage()); }
catch(IOException ioe)
{ System.out.println(ioe.getLocalizedMessage()); }
parsed = buf.toString();
HTML document I found on Sun's site). It works fine for standard
ANSI-based files, but when I convert them to Unicode or UTF-8, it
doesn't work right (it includes a bunch of strange characters).
I think the reason it's not working has to do with the 2-byte vs.
1-byte encoding, but I have no idea how to fix it. Any ideas?
Here's my code:
final StringBuffer buf = new StringBuffer(1000);
try {
// Create an HTML document that appends all text to buf
HTMLDocument doc = new HTMLDocument() {
public HTMLEditorKit.ParserCallback getReader(int pos) {
return new HTMLEditorKit.ParserCallback() {
// This method is called whenever text is encountered
// in the HTML file
public void handleText(char[] data, int pos) {
buf.append(data + "\n");
}
};
}
};
// Create a reader on the HTML content
// URL url = new URI(location).toURL();
URL url = location.toURL();
URLConnection conn = url.openConnection();
Reader rd = new InputStreamReader(conn.getInputStream());
// Parse the HTML
HTMLEditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
}
catch(MalformedURLException mue)
{ System.out.println(mue.getLocalizedMessage()); }
catch(BadLocationException ble)
{ System.out.println(ble.getLocalizedMessage()); }
catch(IOException ioe)
{ System.out.println(ioe.getLocalizedMessage()); }
parsed = buf.toString();