If no encoding what then ?

A

Asger Jørgensen

Hi there

If the XML declaration only contain:

<?xml version="1.0"?>

No encoding is specified, and there is no BOM at the
biginning of the file.

What should tha XmlParser do?
should it report an error or should it parse the
document in a certain default encoding, if so
which ?

What if the BOM is different from the encoding, specified
in the XML declaration ?

Thanks in advance
Asger
 
R

Richard Tobin

Asger Jørgensen said:
If the XML declaration only contain:

<?xml version="1.0"?>

No encoding is specified, and there is no BOM at the
biginning of the file.

What should tha XmlParser do?

If there is externally-specified coding information (e.g. in the HTTP
Content-Type header if the document is retrieved by HTTP), then it
should use that.

Otherwise it should assume UTF-8.
What if the BOM is different from the encoding, specified
in the XML declaration ?

In the absence of external information, it's a well-formedness error
(because the document is not in the encoding specified by the
declaration). If there is external information, then the BOM must
be consistent with it, but the encoding declaration is ignored.

-- Richard
 
A

Asger Jørgensen

Hi Richard
Thanks for explaining

Richard Tobin said:
In the absence of external information, it's a well-formedness error
(because the document is not in the encoding specified by the
declaration). If there is external information, then the BOM must
be consistent with it, but the encoding declaration is ignored.

Lets see if I understand You correct.
If the BOM is there, the BOM is the winner and the files should be
encoded acording to the BOM ?

Thanks again
Kind regards
Asger
 
R

Richard Tobin

Lets see if I understand You correct.
If the BOM is there, the BOM is the winner and the files should be
encoded acording to the BOM ?

In effect yes, because if the BOM was incorrect, then those bytes
wouldn't be a BOM, but some other characters (or illegal for the
encoding), so the file wouldn't be syntactically correct.

-- Richard
 
J

Joe Kesselman

http://www.w3.org/TR/REC-xml/#charencoding, inter alia

"In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8. Note that since ASCII
is a subset of UTF-8, ordinary ASCII entities do not strictly need an
encoding declaration."

However, see also http://www.w3.org/TR/REC-xml/#sec-guessing


You may be starting to realize why we tell folks to use
off-the-shelf-parsers rather than rolling their own. Writing something
that seems to work on a trivial case is trivial. Writing something that
actually covers all the common cases is a lot more work. Writing
something that covers all the edge cases is not _BAD_, but it's on the
order of an undergraduate term project rather than a quick hack... and
making it efficient is harder, and adding validation significantly
harder yet.
 
A

Asger Jørgensen

Hi Joe

Thanks for explaining

Joe Kesselman said:
You may be starting to realize why we tell folks to use
off-the-shelf-parsers rather than rolling their own.

I have no problem understanding that You recomend the use of
off-the-shelf-parsers, if asked what a person should use.
Writing something that seems to work on a trivial case is trivial.

Thats what I have done and it works great, the question I asked this
time was just so that I MIGHT be able to use my parser on other files.
Writing something that actually covers all the common cases is a lot more
work.

I have never had any doubth about that and it was never my goal,
as I also mentioned.

I have learned a lot from this project, thanks to You and others,
and also the fact that I wasn't scared away by people telling me
that it was close to impossibly.

in the thread:
"Would a lack of line breaks in a doc cause parsing problems"
You can see that PHP's XmlParser have a bug and the guy
is stuck with that off-the-shelf-parser.
When I experience a bug, I can usualy fix it, I have the source.;-)

Thanks again for Your help
Kind regards
Asger

Nothing is wrong and nothing is right.
Things have a funny way of always working out for the best.
 
A

Andreas Prilop

From: "Asger J?rgensen" <[email protected]>
Subject: If no encoding what then ?
X-Newsreader: Microsoft Outlook Express 6.00.2900.3138

Quite right! Missing encoding (charset)!

You need to set up your newsreader^W Outlook Express correctly

Tools > Options > Send
Mail Sending Format > Plain Text Settings > Message format MIME
News Sending Format > Plain Text Settings > Message format MIME
Encode text using: None

in order to send special, non-ASCII characters such as
1 ¤ = 100 ¢
Æ æ Å å Ø ø
 
A

Asger Jørgensen

Hi Andreas

Andreas Prilop said:
Quite right! Missing encoding (charset)!

You need to set up your newsreader^W Outlook Express correctly

Tools > Options > Send
Mail Sending Format > Plain Text Settings > Message format MIME
News Sending Format > Plain Text Settings > Message format MIME
Encode text using: None

in order to send special, non-ASCII characters such as
1 ¤ = 100 ¢
Æ æ Å å Ø ø

What You suggest don't make much sence, I don't say that You are wrong
but removing any encoding would suggest to me that the message is send in
the local codepage, and if You live in Europe that would be ok for You
but not for people in the rest of the world.
My newsreader is set up to send in unicode and it seem to work for most
of the people i communicate with.
I have notised that my name is wrong in Your reply, but thats not the case
with the other posters.

Maybe You should try Unicode instead of None in the encoding settings.

But if others can report deficulties reading my posts (my name)
I will be happy to change to No encoding.

Kind regards
Asger
 
A

Andreas Prilop

Asger said:
My newsreader is set up to send in unicode and it seem to work for most
of the people i communicate with.

No, it isn't:
Æ AE
æ ae
Ã… A ring
Ã¥ a ring
Ø O slash
ø o slash
€ euro
¢ cent

Quote this!
 
R

Richard Tobin

But if others can report deficulties reading my posts (my name)

Your post appears to be in Latin-1, but there's no encoding specified
in the headers. That works fine for me, since my newsreader knows
nothing of encodings, and my terminal assumes Latin-1.

-- Richard
 
J

Joseph Kesselman

Uhm... Newsreader configuration is sorta offtopic, right? I'd suggest
folks take it offline or to a newsgroup dedicated to that newsreader,
unless it's germane to understanding a particular post.
 
G

Guest

Hi Andreas

Sory about that I couldn't read
it said Uuencoding which I read as Unicode..;-)

And yes Joseph it is of topic, but please cut Your fellowman a little
slack..

Kind regards
Asger
 
J

Joseph Kesselman

Asger said:
And yes Joseph it is of topic, but please cut Your fellowman a little
slack..

I did. I waited a while before I posted the suggestion to take it
elsewhere. It didn't look like it was winding down, so...

Seriously: We all get most value out of the newsgroups when they're all
kept on topic. Nobody's enforcing that (except on an individual level by
ignoring people who they think are abusing the system), but it really is
in everyone's best interest to make an effort to avoid digressing too
far, too long.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,008
Messages
2,570,271
Members
46,874
Latest member
CyberGateway

Latest Threads

Top