Question about Character Set

S

ssk

Hello!

This might be a dumb question.

An XML file starts with a line like the following line.
<?xml version="1.0" encoding="ISO-8859-1"?>
So an application knows what encoding the file is.
However, how does an application read the first line without knowing
what encoding it is?

That is...
To know what encoding it is, it should read the first line.
To read the first line, it should know what encoding it is.

Isn't this a chicken and egg issue?
Am I missing an important point?

TIA.
Sam
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

This might be a dumb question.

No, this is not a dumb question.
To know what encoding it is, it should read the first line.
To read the first line, it should know what encoding it is.

Isn't this a chicken and egg issue?

Yes, this is a chicken and egg problem.
The problem goes even deeper when you
consider files which are encoded in UTF-16.
This is a very readable explanation:

http://safari.oreilly.com/?x=1&mode...&t=1&c=1&u=1&r=&o=1&n=1&d=1&p=1&a=0&srchText=
 
T

Toni Uusitalo

Hello!

This might be a dumb question.

An XML file starts with a line like the following line.
<?xml version="1.0" encoding="ISO-8859-1"?>
So an application knows what encoding the file is.
However, how does an application read the first line without knowing
what encoding it is?

That is...
To know what encoding it is, it should read the first line.
To read the first line, it should know what encoding it is.

quote from:
http://www.w3c.org/TR/2004/REC-xml-20040204/#sec-guessing
"Because the contents of the encoding declaration are restricted to
characters from the ASCII repertoire (however encoded), a processor can
reliably read the entire encoding declaration as soon as it has detected
which family of encodings is in use."

(however encoded) means that it still can be 16-bit or 32-bit character (see
also "Without a Byte Order Mark" table) but all the characters in the
declaration are in the ascii range of course.
Isn't this a chicken and egg issue?
Am I missing an important point?

Not that complicated. Clever stuff from W3C XML Working Group however.
with respect,
Toni Uusitalo
 
S

ssk

Thank you for the answer.
I have a question.
See in-line.

Toni said:
quote from:
http://www.w3c.org/TR/2004/REC-xml-20040204/#sec-guessing
"Because the contents of the encoding declaration are restricted to
characters from the ASCII repertoire (however encoded), a processor can
reliably read the entire encoding declaration as soon as it has detected
which family of encodings is in use."

(however encoded) means that it still can be 16-bit or 32-bit character (see
also "Without a Byte Order Mark" table) but all the characters in the
declaration are in the ascii range of course.

I understand it.
But what about UCS-2?
It uses 2 bytes for all characters including ASCII characters.
Well, I don't think I've seen UCS-2 used for encoding yet.
But that's one of encodings, right?
Not that complicated. Clever stuff from W3C XML Working Group however.
with respect,
Toni Uusitalo

Thanks again.
Sam
 
R

Richard Tobin

However, how does an application read the first line without knowing
what encoding it is?

Very, very carefully...

Since there must be an encoding declaration unless it's UTF-8, the
first bytes must either be a byte-order mark or the characters
"<?xml " or else you can assume UTF-8. So you can look at those first
few bytes and determine the possibilities. Since the encoding
declaration is limited to characters in the ascii set, you don't have
to know whether it's latin-1 or latin-5 or some proprietary Microsoft
encoding to read it. Likewise it won't matter which version of ebcdic
it is if you have to deal with that.

-- Richard
 
T

Toni Uusitalo

But what about UCS-2?
It uses 2 bytes for all characters including ASCII characters.
Well, I don't think I've seen UCS-2 used for encoding yet.
But that's one of encodings, right?

It's mentioned in the spec in "Without a Byte Order Mark" table as
ISO-10646-UCS-2, encoding detection process is essentially the same
as with UTF-16 without BOM, quote from that table:
"UTF-16BE or big-endian ISO-10646-UCS-2 or other encoding with a 16-bit code
unit in big-endian order and ASCII characters encoded as ASCII values (the
encoding declaration must be read to determine which)"

with respect,
Toni Uusitalo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,243
Members
46,838
Latest member
KandiceChi

Latest Threads

Top