Extract text from HTML (unicode)

U

unbending

I'm having trouble using the example method (to extract text from an
HTML document I found on Sun's site). It works fine for standard
ANSI-based files, but when I convert them to Unicode or UTF-8, it
doesn't work right (it includes a bunch of strange characters).

I think the reason it's not working has to do with the 2-byte vs.
1-byte encoding, but I have no idea how to fix it. Any ideas?

Here's my code:
final StringBuffer buf = new StringBuffer(1000);
try {
// Create an HTML document that appends all text to buf
HTMLDocument doc = new HTMLDocument() {
public HTMLEditorKit.ParserCallback getReader(int pos) {
return new HTMLEditorKit.ParserCallback() {
// This method is called whenever text is encountered
// in the HTML file
public void handleText(char[] data, int pos) {
buf.append(data + "\n");
}
};
}
};

// Create a reader on the HTML content
// URL url = new URI(location).toURL();
URL url = location.toURL();
URLConnection conn = url.openConnection();
Reader rd = new InputStreamReader(conn.getInputStream());

// Parse the HTML
HTMLEditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
}
catch(MalformedURLException mue)
{ System.out.println(mue.getLocalizedMessage()); }
catch(BadLocationException ble)
{ System.out.println(ble.getLocalizedMessage()); }
catch(IOException ioe)
{ System.out.println(ioe.getLocalizedMessage()); }
parsed = buf.toString();
 
C

Chris Smith

unbending said:
I'm having trouble using the example method (to extract text from an
HTML document I found on Sun's site). It works fine for standard
ANSI-based files, but when I convert them to Unicode or UTF-8, it
doesn't work right (it includes a bunch of strange characters).

There is no such thing as a "standard ANSI-based file". ANSI
standardizes (or jointly standardizes) a lot of things, including a good
number of very different character encodings. If you mean ASCII, then
say ASCII. If you mean something else, then say what you mean.

There is also no such character encoding as "Unicode". I'll assume you
mean one of UCS-2BE, UCS-2LE, UTF-16LE or UTF-16BE. The difference
between UCS-2 and UTF-16 is probably not critical for you, unless you're
using characters outside of the Unicode basic plane. The difference
between big-endian and little-endian is very important, though, and
you'll need to know which one you are using.

You said:
Reader rd = new InputStreamReader(conn.getInputStream());

If you're having character encoding problems, this is almost certainly
the source. The constructor you've used for InputStreamReader uses the
platform default encoding. Because I don't know what platform you're
working on, I can't tell you what that is. Apparently, though, it is
(or is a superset of) the same encoding you used in the first document,
but is not compatible with UTF-8 or whatever other Unicode encoding you
tried.

There is another constructor for InputStreamReader which allows you to
specify an encoding for the file. You should use that instead.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,701
Latest member
XavierQ83

Latest Threads

Top