Expanded Entities Not In Document Encoding - Shouldn't This Be AParse Error?

M

MaggotChild

Parsers usually error if there is a byte that's not in the input
document's stated encoding.
I have a ISO-8859-1 document that contains entities representing
several non 8859-1 chars. When these entities are expanded, the
document is no longer in the given encoding, but there is no error
from the parser. Is this in accordance with the XML spec?

The parser is libxml2 (via Perl interface).
 
R

Richard Tobin

MaggotChild said:
I have a ISO-8859-1 document that contains entities representing
several non 8859-1 chars. When these entities are expanded, the
document is no longer in the given encoding

Once the entities are expanded, the document isn't in an encoding at
all. It's just unicode characters.

An XML document can contain any (legal) unicode characters of the
encoding it's written in. One of the main purposes of character
references is so that you aren't limited to the characters in your
encoding.

-- Richard
 
M

MaggotChild

Once the entities are expanded, the document isn't in an encoding at
all.  It's just unicode characters.

So, in general, this means if my language does not support wide chars,
and I want to check for such a char in the parse tree, I need to look
for the unicode code point?

Thanks
 
J

Joe Kesselman

MaggotChild said:
So, in general, this means if my language does not support wide chars,
and I want to check for such a char in the parse tree, I need to look
for the unicode code point?

That's correct. The XML APIs (DOM, SAX, and as far as I know all the
others) use Unicode internally, usually as strings of UTF-16 characters.

Conceptually, the first thing the parser does is convert from your
actual source encoding to Unicode. Then, as it scans through the unicode
representation of your document, it handles any <![CDATA[]]> sections,
Entity References and Numeric Character References. (The latter simply
turn into the corresponding Unicode character, of course.)

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top