Extract text from HTML (unicode)

unbending · Jan 29, 2005

I'm having trouble using the example method (to extract text from an
HTML document I found on Sun's site). It works fine for standard
ANSI-based files, but when I convert them to Unicode or UTF-8, it
doesn't work right (it includes a bunch of strange characters).

I think the reason it's not working has to do with the 2-byte vs.
1-byte encoding, but I have no idea how to fix it. Any ideas?

Here's my code:
final StringBuffer buf = new StringBuffer(1000);
try {
// Create an HTML document that appends all text to buf
HTMLDocument doc = new HTMLDocument() {
public HTMLEditorKit.ParserCallback getReader(int pos) {
return new HTMLEditorKit.ParserCallback() {
// This method is called whenever text is encountered
// in the HTML file
public void handleText(char[] data, int pos) {
buf.append(data + "\n");
}
};
}
};

// Create a reader on the HTML content
// URL url = new URI(location).toURL();
URL url = location.toURL();
URLConnection conn = url.openConnection();
Reader rd = new InputStreamReader(conn.getInputStream());

// Parse the HTML
HTMLEditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
}
catch(MalformedURLException mue)
{ System.out.println(mue.getLocalizedMessage()); }
catch(BadLocationException ble)
{ System.out.println(ble.getLocalizedMessage()); }
catch(IOException ioe)
{ System.out.println(ioe.getLocalizedMessage()); }
parsed = buf.toString();

Chris Smith · Jan 29, 2005

unbending said:
I'm having trouble using the example method (to extract text from an
HTML document I found on Sun's site). It works fine for standard
ANSI-based files, but when I convert them to Unicode or UTF-8, it
doesn't work right (it includes a bunch of strange characters).

There is no such thing as a "standard ANSI-based file". ANSI
standardizes (or jointly standardizes) a lot of things, including a good
number of very different character encodings. If you mean ASCII, then
say ASCII. If you mean something else, then say what you mean.

There is also no such character encoding as "Unicode". I'll assume you
mean one of UCS-2BE, UCS-2LE, UTF-16LE or UTF-16BE. The difference
between UCS-2 and UTF-16 is probably not critical for you, unless you're
using characters outside of the Unicode basic plane. The difference
between big-endian and little-endian is very important, though, and
you'll need to know which one you are using.

You said:
Reader rd = new InputStreamReader(conn.getInputStream());

If you're having character encoding problems, this is almost certainly
the source. The constructor you've used for InputStreamReader uses the
platform default encoding. Because I don't know what platform you're
working on, I can't tell you what that is. Apparently, though, it is
(or is a superset of) the same encoding you used in the first document,
but is not compatible with UTF-8 or whatever other Unicode encoding you
tried.

There is another constructor for InputStreamReader which allows you to
specify an encoding for the file. You should use that instead.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

html parsing	0	Dec 2, 2006
HTML Parser - problem with multiple instances	0	Apr 29, 2005
The distinction between a java applet and an application	1	Jan 4, 2023
HTML Parser Help Please	3	Sep 30, 2004
html parser, some site work only	0	Jun 13, 2004
Can this be done from a servlet?	2	Apr 21, 2006
How to handle text/html content from Firefox copied to Clipboardunder Linux	9	Jul 27, 2008
Notes/Domino HTML parsen	3	Apr 26, 2006

Extract text from HTML (unicode)

unbending

Chris Smith

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads