If no encoding what then ?

Asger Jørgensen · Sep 29, 2007

Hi there

If the XML declaration only contain:

<?xml version="1.0"?>

No encoding is specified, and there is no BOM at the
biginning of the file.

What should tha XmlParser do?
should it report an error or should it parse the
document in a certain default encoding, if so
which ?

What if the BOM is different from the encoding, specified
in the XML declaration ?

Thanks in advance
Asger

Richard Tobin · Sep 29, 2007

Asger Jørgensen said:
If the XML declaration only contain:

<?xml version="1.0"?>

No encoding is specified, and there is no BOM at the
biginning of the file.

What should tha XmlParser do?

If there is externally-specified coding information (e.g. in the HTTP
Content-Type header if the document is retrieved by HTTP), then it
should use that.

Otherwise it should assume UTF-8.

What if the BOM is different from the encoding, specified
in the XML declaration ?

In the absence of external information, it's a well-formedness error
(because the document is not in the encoding specified by the
declaration). If there is external information, then the BOM must
be consistent with it, but the encoding declaration is ignored.

-- Richard

Asger Jørgensen · Sep 29, 2007

Hi Richard
Thanks for explaining

Richard Tobin said:
In the absence of external information, it's a well-formedness error
(because the document is not in the encoding specified by the
declaration). If there is external information, then the BOM must
be consistent with it, but the encoding declaration is ignored.

Lets see if I understand You correct.
If the BOM is there, the BOM is the winner and the files should be
encoded acording to the BOM ?

Thanks again
Kind regards
Asger

Richard Tobin · Sep 29, 2007

Lets see if I understand You correct.
If the BOM is there, the BOM is the winner and the files should be
encoded acording to the BOM ?

In effect yes, because if the BOM was incorrect, then those bytes
wouldn't be a BOM, but some other characters (or illegal for the
encoding), so the file wouldn't be syntactically correct.

-- Richard

Joe Kesselman · Sep 29, 2007

http://www.w3.org/TR/REC-xml/#charencoding, inter alia

"In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8. Note that since ASCII
is a subset of UTF-8, ordinary ASCII entities do not strictly need an
encoding declaration."

However, see also http://www.w3.org/TR/REC-xml/#sec-guessing

You may be starting to realize why we tell folks to use
off-the-shelf-parsers rather than rolling their own. Writing something
that seems to work on a trivial case is trivial. Writing something that
actually covers all the common cases is a lot more work. Writing
something that covers all the edge cases is not _BAD_, but it's on the
order of an undergraduate term project rather than a quick hack... and
making it efficient is harder, and adding validation significantly
harder yet.

Asger Jørgensen · Sep 30, 2007

Hi Joe

Thanks for explaining

Joe Kesselman said:
You may be starting to realize why we tell folks to use
off-the-shelf-parsers rather than rolling their own.

I have no problem understanding that You recomend the use of
off-the-shelf-parsers, if asked what a person should use.

Writing something that seems to work on a trivial case is trivial.

Thats what I have done and it works great, the question I asked this
time was just so that I MIGHT be able to use my parser on other files.

Writing something that actually covers all the common cases is a lot more
work.

I have never had any doubth about that and it was never my goal,
as I also mentioned.

I have learned a lot from this project, thanks to You and others,
and also the fact that I wasn't scared away by people telling me
that it was close to impossibly.

in the thread:
"Would a lack of line breaks in a doc cause parsing problems"
You can see that PHP's XmlParser have a bug and the guy
is stuck with that off-the-shelf-parser.
When I experience a bug, I can usualy fix it, I have the source.;-)

Thanks again for Your help
Kind regards
Asger

Nothing is wrong and nothing is right.
Things have a funny way of always working out for the best.

Andreas Prilop · Oct 1, 2007

From: "Asger J?rgensen" <[email protected]>
Subject: If no encoding what then ?
X-Newsreader: Microsoft Outlook Express 6.00.2900.3138

Quite right! Missing encoding (charset)!

You need to set up your newsreader^W Outlook Express correctly

Tools > Options > Send
Mail Sending Format > Plain Text Settings > Message format MIME
News Sending Format > Plain Text Settings > Message format MIME
Encode text using: None

in order to send special, non-ASCII characters such as
1 ¤ = 100 ¢
Æ æ Å å Ø ø

Asger Jørgensen · Oct 1, 2007

Hi Andreas

Andreas Prilop said:
Quite right! Missing encoding (charset)!

You need to set up your newsreader^W Outlook Express correctly

Tools > Options > Send
Mail Sending Format > Plain Text Settings > Message format MIME
News Sending Format > Plain Text Settings > Message format MIME
Encode text using: None

in order to send special, non-ASCII characters such as
1 ¤ = 100 ¢
Æ æ Å å Ø ø

What You suggest don't make much sence, I don't say that You are wrong
but removing any encoding would suggest to me that the message is send in
the local codepage, and if You live in Europe that would be ok for You
but not for people in the rest of the world.
My newsreader is set up to send in unicode and it seem to work for most
of the people i communicate with.
I have notised that my name is wrong in Your reply, but thats not the case
with the other posters.

Maybe You should try Unicode instead of None in the encoding settings.

But if others can report deficulties reading my posts (my name)
I will be happy to change to No encoding.

Kind regards
Asger

Andreas Prilop · Oct 1, 2007

Asger said:
My newsreader is set up to send in unicode and it seem to work for most
of the people i communicate with.

No, it isn't:
Ã† AE
Ã¦ ae
Ã… A ring
Ã¥ a ring
Ã˜ O slash
Ã¸ o slash
â‚¬ euro
Â¢ cent

Quote this!

Richard Tobin · Oct 1, 2007

But if others can report deficulties reading my posts (my name)

Your post appears to be in Latin-1, but there's no encoding specified
in the headers. That works fine for me, since my newsreader knows
nothing of encodings, and my terminal assumes Latin-1.

-- Richard

Joseph Kesselman · Oct 1, 2007

Uhm... Newsreader configuration is sorta offtopic, right? I'd suggest
folks take it offline or to a newsgroup dedicated to that newsreader,
unless it's germane to understanding a particular post.

Guest · Oct 1, 2007

Hi Andreas

Sory about that I couldn't read
it said Uuencoding which I read as Unicode..;-)

And yes Joseph it is of topic, but please cut Your fellowman a little
slack..

Kind regards
Asger

Joseph Kesselman · Oct 1, 2007

Asger said:
And yes Joseph it is of topic, but please cut Your fellowman a little
slack..

I did. I waited a while before I posted the suggestion to take it
elsewhere. It didn't look like it was winding down, so...

Seriously: We all get most value out of the newsgroups when they're all
kept on topic. Nobody's enforcing that (except on an individual level by
ignoring people who they think are abusing the system), but it really is
in everyone's best interest to make an effort to avoid digressing too
far, too long.

Identifying if the program I have is python and then decompiling	0	May 29, 2022
Guessing the encoding from a BOM	7	Jan 16, 2014
I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023
SOLVE THIS IF YOU CAN PYTHON MASTER	7	Jan 30, 2023
If(strcmp(str, "") == 0) - What does this line of code mean?	0	Aug 8, 2022
Python Gurobi Optimizing Cost has no errors but I get no sensible solution	0	Aug 30, 2022
[LONG] java.net.URI encoding weirdness	18	May 5, 2014
What code do I add / overwrite so that the ebDriver' object has no attribute 'find_element_by_css_selector error is gone ?	0	Sep 19, 2022

If no encoding what then ?

Asger Jørgensen

Richard Tobin

Asger Jørgensen

Richard Tobin

Joe Kesselman

Asger Jørgensen

Andreas Prilop

Asger Jørgensen

Andreas Prilop

Richard Tobin

Joseph Kesselman

Guest

Joseph Kesselman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads