the right way to detect encoding used in InputStream carrying HTML or XML

H

HK

Suppose you are faced with an java.io.InputStream
and it is supposed to carry either HTML or XML.
Ultimately you want to read with a Reader and the
correct encoding, of course.

Is the following a correct strategy:

1) Wrap the InputStream into a BufferedInputStream
to make sure mark() and reset() work.

2) Read single bytes from it up to some reasonable limit
and convert them to characters by simple casting:

char ch = (char)the_byte_I_read;

3) check for encoding, e.g. with regexp
4) call reset() on the BufferedInputStream
5) wrap the BufferedInputStream into a Reader
with the determined encoding
6) Start reading.

What bothers me a bit is the additional
BufferedInputStream in between when the
Reader later has another buffer. I am also
not sure if the cast is the right way to
convert bytes to chars before you know the
encoding.

Comments?
Harald.
 
J

John C. Bollinger

HK said:
Suppose you are faced with an java.io.InputStream
and it is supposed to carry either HTML or XML.
Ultimately you want to read with a Reader and the
correct encoding, of course.

Is the following a correct strategy:

1) Wrap the InputStream into a BufferedInputStream
to make sure mark() and reset() work.

2) Read single bytes from it up to some reasonable limit
and convert them to characters by simple casting:

char ch = (char)the_byte_I_read;

3) check for encoding, e.g. with regexp
4) call reset() on the BufferedInputStream
5) wrap the BufferedInputStream into a Reader
with the determined encoding
6) Start reading.

What bothers me a bit is the additional
BufferedInputStream in between when the
Reader later has another buffer. I am also
not sure if the cast is the right way to
convert bytes to chars before you know the
encoding.

The Reader does not necessarily have another buffer. As far as I know,
in fact, the only ones that do (in the platform library) are
BufferedReader and its subclass, LineNumberReader. It is generally best
to buffer as close to the source as possible, which is just what you
propose to do.

If encoding information is not provided externally (i.e. in an HTTP
header, or a protocol-dependent default), then determining the encoding
from the content itself is tricky, and differs between XML and HTML.
The details are off-topic for this group, but all involve examining the
initial portion of the byte stream. Some encodings are difficult or
impossible to determine in this way.

I see these problems with your strategy:

(1) Relying on a BufferedInputStream to provide the ability to reset()
the stream puts a fixed upper limit (the buffer size) on how far into
the file the encoding information can be sought. If you get it from a
<meta> tag in an HTML document, for instance, then it is impossible to
place an absolute bound on how far into the file the relevant tag can
occur (though you could probably choose a bound that in practice meets
your needs).

(2) Casting bytes to chars cannot be relied upon to work correctly for
any multibyte or variable-length encoding (e.g. UTF-8, especially
UTF-16). For UTF-16 with a byte-order mark, you may be able to guess
the encoding from the first two bytes, without worrying about chars,
though there you would thereafter want to _discard_ those bytes. UTF-8
corresponds with ASCII and all the ISO-8859-X encodings over the first
128 code points, so as long as you don't have any encoded, non-ASCII
characters in the stream before whatever information you will use to
determine the encoding, UTF-8 might nevertheless work OK. There is NO
correct way to convert bytes to chars without knowing anything about the
encoding.
 
W

Wibble

XML parsers do something like that.

Firstly, dont cast to char, leave it as bytes. Otherwise
you may get into trouble with sign extension unless your
careful.

XML parsers look at the <?xml prefix of every message
and see if its 8 or 16 bit encoded. Then they scan
for the specific encoding in the header, which will not
have any non 8bit chars up to that point. Once the
encoding is parsed, the rest of the document may be
read. Be careful because XML and java name encodings
differently.

You can probably generalize this to HTML for 8 vs 16 bit.
You then have to scan a bit for the encoding in the header,
which is not mandatory.
 
H

HK

HK said:
Suppose you are faced with an java.io.InputStream
and it is supposed to carry either HTML or XML.
Ultimately you want to read with a Reader and the
correct encoding, of course.
[...]

Thanks for the answers which showed me that I did
not fully understand the complexity of the
problem. I actually thought that up until the
encoding information the stream had to be ASCII
or UTF-8 anyway. Now I read the fine manual:

http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

It has all that is needed for XML, at least.

Harald.
 
D

Dale King

HK said:
Suppose you are faced with an java.io.InputStream
and it is supposed to carry either HTML or XML.
Ultimately you want to read with a Reader and the
correct encoding, of course.

I believe XML is supposed to be UTF-8 unless it specifies otherwise
using an encoding attribute. But your XML parser should handle all of
that for you.
 
J

John C. Bollinger

Dale said:
I believe XML is supposed to be UTF-8 unless it specifies otherwise
using an encoding attribute. But your XML parser should handle all of
that for you.

If an XML document is not encoded in UTF-8, then its encoding must be
specified in the XML declaration, true. If you don't know from some
external source what the encoding is, however, then you may not be able
to decode the XML declaration to find out. Many common cases can be
handled without too much trouble, but I don't know any universal solution.
 
D

Dale King

John said:
If an XML document is not encoded in UTF-8, then its encoding must be
specified in the XML declaration, true. If you don't know from some
external source what the encoding is, however, then you may not be able
to decode the XML declaration to find out. Many common cases can be
handled without too much trouble, but I don't know any universal solution.

See appendix F of the XML spec.:
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing
 
J

John C. Bollinger

Dale said:
John C. Bollinger wrote:


See appendix F of the XML spec.:
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

I am well aware of that; I was alluding to it when I wrote that many
common cases can be handled. It doesn't even come close to covering
_all_ the infinitely many possibilities, however. For the sake of
argument only, I point out that no matter what autodetection algorithm
you devise, I can produce an encoding that breaks it. In practice, such
intentionally perverse encodings are less of an issue than possible real
encodings that accidentally happen to confound existing algorithms. It
may be that the procedure described in appendix F suffices for any
particular purpose, but no one should be fooled into thinking that it is
universal.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,821
Latest member
AleidaSchi

Latest Threads

Top