the right way to detect encoding used in InputStream carrying HTML or XML

HK · May 26, 2005

Suppose you are faced with an java.io.InputStream
and it is supposed to carry either HTML or XML.
Ultimately you want to read with a Reader and the
correct encoding, of course.

Is the following a correct strategy:

1) Wrap the InputStream into a BufferedInputStream
to make sure mark() and reset() work.

2) Read single bytes from it up to some reasonable limit
and convert them to characters by simple casting:

char ch = (char)the_byte_I_read;

3) check for encoding, e.g. with regexp
4) call reset() on the BufferedInputStream
5) wrap the BufferedInputStream into a Reader
with the determined encoding
6) Start reading.

What bothers me a bit is the additional
BufferedInputStream in between when the
Reader later has another buffer. I am also
not sure if the cast is the right way to
convert bytes to chars before you know the
encoding.

Comments?
Harald.

John C. Bollinger · May 26, 2005

HK said:
Suppose you are faced with an java.io.InputStream
and it is supposed to carry either HTML or XML.
Ultimately you want to read with a Reader and the
correct encoding, of course.

Is the following a correct strategy:

1) Wrap the InputStream into a BufferedInputStream
to make sure mark() and reset() work.

2) Read single bytes from it up to some reasonable limit
and convert them to characters by simple casting:

char ch = (char)the_byte_I_read;

3) check for encoding, e.g. with regexp
4) call reset() on the BufferedInputStream
5) wrap the BufferedInputStream into a Reader
with the determined encoding
6) Start reading.

What bothers me a bit is the additional
BufferedInputStream in between when the
Reader later has another buffer. I am also
not sure if the cast is the right way to
convert bytes to chars before you know the
encoding.

The Reader does not necessarily have another buffer. As far as I know,
in fact, the only ones that do (in the platform library) are
BufferedReader and its subclass, LineNumberReader. It is generally best
to buffer as close to the source as possible, which is just what you
propose to do.

If encoding information is not provided externally (i.e. in an HTTP
header, or a protocol-dependent default), then determining the encoding
from the content itself is tricky, and differs between XML and HTML.
The details are off-topic for this group, but all involve examining the
initial portion of the byte stream. Some encodings are difficult or
impossible to determine in this way.

I see these problems with your strategy:

(1) Relying on a BufferedInputStream to provide the ability to reset()
the stream puts a fixed upper limit (the buffer size) on how far into
the file the encoding information can be sought. If you get it from a
<meta> tag in an HTML document, for instance, then it is impossible to
place an absolute bound on how far into the file the relevant tag can
occur (though you could probably choose a bound that in practice meets
your needs).

(2) Casting bytes to chars cannot be relied upon to work correctly for
any multibyte or variable-length encoding (e.g. UTF-8, especially
UTF-16). For UTF-16 with a byte-order mark, you may be able to guess
the encoding from the first two bytes, without worrying about chars,
though there you would thereafter want to _discard_ those bytes. UTF-8
corresponds with ASCII and all the ISO-8859-X encodings over the first
128 code points, so as long as you don't have any encoded, non-ASCII
characters in the stream before whatever information you will use to
determine the encoding, UTF-8 might nevertheless work OK. There is NO
correct way to convert bytes to chars without knowing anything about the
encoding.

Wibble · May 27, 2005

XML parsers do something like that.

Firstly, dont cast to char, leave it as bytes. Otherwise
you may get into trouble with sign extension unless your
careful.

XML parsers look at the <?xml prefix of every message
and see if its 8 or 16 bit encoded. Then they scan
for the specific encoding in the header, which will not
have any non 8bit chars up to that point. Once the
encoding is parsed, the rest of the document may be
read. Be careful because XML and java name encodings
differently.

You can probably generalize this to HTML for 8 vs 16 bit.
You then have to scan a bit for the encoding in the header,
which is not mandatory.

HK · May 27, 2005

HK said:
Suppose you are faced with an java.io.InputStream
and it is supposed to carry either HTML or XML.
Ultimately you want to read with a Reader and the
correct encoding, of course.

[...]

Thanks for the answers which showed me that I did
not fully understand the complexity of the
problem. I actually thought that up until the
encoding information the stream had to be ASCII
or UTF-8 anyway. Now I read the fine manual:

http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

It has all that is needed for XML, at least.

Harald.

Dale King · May 31, 2005

HK said:
Suppose you are faced with an java.io.InputStream
and it is supposed to carry either HTML or XML.
Ultimately you want to read with a Reader and the
correct encoding, of course.

I believe XML is supposed to be UTF-8 unless it specifies otherwise
using an encoding attribute. But your XML parser should handle all of
that for you.

John C. Bollinger · Jun 6, 2005

Dale said:
I believe XML is supposed to be UTF-8 unless it specifies otherwise
using an encoding attribute. But your XML parser should handle all of
that for you.

If an XML document is not encoded in UTF-8, then its encoding must be
specified in the XML declaration, true. If you don't know from some
external source what the encoding is, however, then you may not be able
to decode the XML declaration to find out. Many common cases can be
handled without too much trouble, but I don't know any universal solution.

Dale King · Jun 7, 2005

John said:
If an XML document is not encoded in UTF-8, then its encoding must be
specified in the XML declaration, true. If you don't know from some
external source what the encoding is, however, then you may not be able
to decode the XML declaration to find out. Many common cases can be
handled without too much trouble, but I don't know any universal solution.

See appendix F of the XML spec.:
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

John C. Bollinger · Jun 7, 2005

Dale said:
John C. Bollinger wrote:

See appendix F of the XML spec.:
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

I am well aware of that; I was alluding to it when I wrote that many
common cases can be handled. It doesn't even come close to covering
_all_ the infinitely many possibilities, however. For the sake of
argument only, I point out that no matter what autodetection algorithm
you devise, I can produce an encoding that breaks it. In practice, such
intentionally perverse encodings are less of an issue than possible real
encodings that accidentally happen to confound existing algorithms. It
may be that the procedure described in appendix F suffices for any
particular purpose, but no one should be fooled into thinking that it is
universal.

XML inside a web page and encoding	0	Jul 29, 2008
XML/HTML Encoding problem	3	May 22, 2006
GIF file encoding to save and display in browser	4	Jan 29, 2006
Scatter/Gather in Java or Javascript & html (Dynamic class loading?)	16	Feb 10, 2007
Character encoding (2)	1	Oct 25, 2004
Need advice: What is the best way; CDATA or normal xml tag	7	May 9, 2008
How could I convert plain UTF-8 XML to Outlook HTML format ?	1	Oct 14, 2010
How to parse xml with ISO-8859-1 encoding using ElementTree andSimpleXMLTreeBuilder?	0	May 13, 2008

the right way to detect encoding used in InputStream carrying HTML or XML

HK

John C. Bollinger

Wibble

HK

Dale King

John C. Bollinger

Dale King

John C. Bollinger

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads