UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA)Character

M

mrdecav

Hey all,
I have a bizzare problem with a piece of mail (most likely sent by
Outlook) that is in UTF-8 format.

There is a character, coming after spaces, which from looking at a
hexdump of the file, seems to be a CA (decimal: 202). From most UTF-8
documentation I can find, this is an accent circumflex.

In browsers (IE, FF, Safari), this character shows up as an unknown
character, or as the accent circumflex. In a mail browser, however
(Outlook, Apple Mail), the character appears as a "NO-BREAK
WHITESPACE" (just a space visually), or the equivelent of an " ".

Some documentation I have found shows this is a NO-BREAK WHITESPACE,
and it is clearly what the intent is. The HTML header and MIME type
of the body part both claim UTF-8 encoding.

Is there something I am missing here? Why does this show up
incorrectly in browsers, or why do mail clients feel compelled to
replace this character, but browsers don't? Is there an easy fix to
this? I am concerned that if I actually strip the CA, I'll break
emails that actually are supposed to have the accent.

The following hex is an example of the issue:
00000250 20 64 65 73 69 67 6e 2e 20 ca 49 0d 0a 68 61 76 | design. ?
I..hav|
00000260 65 20 61 20 66 65 77 20 6d 69 6e 6f 72 20 64 65 |e a few
minor de|

design. <offending character>I have


Thanks in advance,
Andre de Cavaignac
 
J

Jukka K. Korpela

I have a bizzare problem with a piece of mail (most likely sent by
Outlook) that is in UTF-8 format.

This sounds like an e-mail problem, not an HTML issue. If the e-mail is in
HTML format or contains an HTML part, then that side of the matter could
relate to HTML, but it can hardly be the primary problem.

To solve the e-mail problem, it's best to consult someone who knows the
e-mail program you are using and give him full access to the e-mail. Of
course he should be someone you really trust, if the message may contain
confidential information.

Without primary data, one can only present speculations.
There is a character, coming after spaces, which from looking at a
hexdump of the file, seems to be a CA (decimal: 202). From most UTF-8
documentation I can find, this is an accent circumflex.

It seems that the secondary data, namely you conclusions drawn from some
work on something that might be primary data, is inherently unreliable. Your
understanding of UTF-8 is all wrong. In UTF-8, no octet > 7F as such means
any character; such octets only appear as part of a multi-octet
representation of a character.
In browsers (IE, FF, Safari), this character shows up as an unknown
character, or as the accent circumflex.

Why would you use a web browser to display an e-mail? Anyway, it seems that
you used them so that they interpreted the data as ISO-8859-1 encoded, or
something like that.
In a mail browser, however
(Outlook, Apple Mail), the character appears as a "NO-BREAK
WHITESPACE" (just a space visually), or the equivelent of an "&nbsp;".

It's NO-BREAK SPACE. But how can you distinguish it from SPACE just by
looking at it?
The HTML header and MIME type
of the body part both claim UTF-8 encoding.

So what?
Is there something I am missing here?

Yes. And we are missing a description of the real situation, the primary
data.
The following hex is an example of the issue:
00000250 20 64 65 73 69 67 6e 2e 20 ca 49 0d 0a 68 61 76 | design.
? I..hav|

It looks like the data is e.g. ISO-8859-1 encoded. But you are not
describing how you got that dump. It's quite possible that some software you
used performed a character encoding conversion. This means you would not be
looking at the primary data.
 
A

Andre de Cavaignac

This sounds like an e-mail problem, not an HTML issue. If the e-mail is in
HTML format or contains an HTML part, then that side of the matter could
relate to HTML, but it can hardly be the primary problem.

To solve the e-mail problem, it's best to consult someone who knows the
e-mail program you are using and give him full access to the e-mail. Of
course he should be someone you really trust, if the message may contain
confidential information.

Without primary data, one can only present speculations.


It seems that the secondary data, namely you conclusions drawn from some
work on something that might be primary data, is inherently unreliable. Your
understanding ofUTF-8is all wrong. InUTF-8, no octet > 7F as such means
anycharacter; such octets only appear as part of a multi-octet
representation of acharacter.


Why would you use a web browser to display an e-mail? Anyway, it seems that
you used them so that they interpreted the data as ISO-8859-1 encoded, or
something like that.


It's NO-BREAK SPACE. But how can you distinguish it from SPACE just by
looking at it?


So what?


Yes. And we are missing a description of the real situation, the primary
data.


It looks like the data is e.g. ISO-8859-1 encoded. But you are not
describing how you got that dump. It's quite possible that some software you
used performed acharacterencoding conversion. This means you would not be
looking at the primary data.

This sounds like an e-mail problem, not an HTML issue. If the e-mail is in
HTML format or contains an HTML part, then that side of the matter could
relate to HTML, but it can hardly be the primary problem.

To solve the e-mail problem, it's best to consult someone who knows the
e-mail program you are using and give him full access to the e-mail. Of
course he should be someone you really trust, if the message may contain
confidential information.

Without primary data, one can only present speculations.


It seems that the secondary data, namely you conclusions drawn from some
work on something that might be primary data, is inherently unreliable. Your
understanding ofUTF-8is all wrong. InUTF-8, no octet > 7F as such means
anycharacter; such octets only appear as part of a multi-octet
representation of acharacter.


Why would you use a web browser to display an e-mail? Anyway, it seems that
you used them so that they interpreted the data as ISO-8859-1 encoded, or
something like that.


It's NO-BREAK SPACE. But how can you distinguish it from SPACE just by
looking at it?


So what?


Yes. And we are missing a description of the real situation, the primary
data.


It looks like the data is e.g. ISO-8859-1 encoded. But you are not
describing how you got that dump. It's quite possible that some software you
used performed acharacterencoding conversion. This means you would not be
looking at the primary data.

Hi Yucca,
I appreciate the response.

The email body is in fact in HTML, and although HTML is not in itself
the problem, the way it is interpreted by clients (such as a browser)
is the issue.

I am using the web browser to display the email because I am writing
an application that supports email integration, and embedding a
browser in my application was the easiest way to render an HTML
formatted message.

I understand that the first octet in a UTF-8 formatted message can
describe the length of the data for the entire character, and did some
reading in the UTF-8 RFC. It appears, from the hex in the previous
email, that the character is a space (20) followed by a NO-BREAK SPACE
(CA, or E with a circumflex, depending on who you consult), followed
by an I. This happens in every instance there is more than one space
after a space (20). It makes sense, because two consecutive spaces
(20 20) in HTML would only render as one space. (20 &nbsp;) would
render as two spaces. It appears that the &nbsp; was encoded as a
character.

I've consulted many UTF-8 and ASCII format guides. One that I found
claims that the ASCII equivalent of 202 is "NO-BREAK SPACE". This is
how both Outlook and Apple Mail (Mail.app) render 202. Web browser
render it as the accented E.

I considered the ISO 8859-1 character set. This character set
reference also states that it is the accented E:
http://htmlhelp.com/reference/charset/iso192-223.html
In this UTF-8 reference, 202 is also the accented E:
http://www.tony-franks.co.uk/UTF-8.htm
This reference mentions 202 as being NO-BREAK SPACE in, from what I
can tell, ASCII: http://www1.tip.nl/~t876506/utf8tbl.html
But this says ASCII 202 is not a NO-BREAK SPACE: http://www.asciitable.com/

My confusion here is not with a single message, but a whole suite of
messages from different sources.

The hex above was found by taking the raw, base-64 encoded MIME part,
and decoding it -- into HTML. That HTML, according to the MIME header
and the HTML header is UTF-8 formatted. I have used two base64
decoders (.NET on Windows and Java on OSX) to decode it -- same
result. From there, I saved the output and ran "hexdump -C file.txt"
to get the hex values. The data has been pulled by both JavaMail and
the Apple Mail client (Apple mail renders it correctly). There is no
doubt that the message in question is correct, and has not been
corrupted by the code used to retrieve it.
 
J

Jukka K. Korpela

Andre said:
I appreciate the response.

Before that statement, you quoted my entire message, even including the sig,
and then quoted it again.
I am using the web browser to display the email because I am writing
an application that supports email integration,

Seriously, stop doing that. You lack the prerequisites. You can't even use a
newsreader decently, and you are totally confused with character encoding
issues.
I understand that the first octet in a UTF-8 formatted message can
describe the length of the data for the entire character,

At best, that's a very odd way of describing things. If you replace "can
describe" by "implies", it makes much better sense.
I've consulted many UTF-8 and ASCII format guides.

But you obviously cannot distinguish the rubbish from reliable sources.
One that I found
claims that the ASCII equivalent of 202 is "NO-BREAK SPACE".

That's nonsense. ASCII has nothing corresponding to 202 decimal, and ASCII
does not contain NO-BREAK SPACE at all.
The hex above was found by taking the raw, base-64 encoded MIME part,
and decoding it -- into HTML.

"Into HTML"? Base64 is a transfer encoding of characters and has nothing to
do with any markup.
There is no
doubt that the message in question is correct, and has not been
corrupted by the code used to retrieve it.

It surely isn't correct, in the very technical sense of the word, if it
claims to be UTF-8 encoded and yet isn't and specifically contains octet
sequences that are not allowed in UTF-8 data. But lacking the primary data,
we have a big "if" here.

ObHTML: Your conjecture that the data contains instances of a space followed
by a no-break space in order to create two visible spaces is plausible, but
we have no way of actually testing whether it is actually true. People have
been observed to do such things, and the method works for some values of
"work". It sounds odd that someone would write e-mail that way, but perhaps
some software used to compose e-mail creates such data by default.
 
M

mrdecav

The following hex is an example of the issue:
00000250  20 64 65 73 69 67 6e 2e  20 ca 49 0d 0a 68 61 76  | design.
? I..hav|
[...]
I understand that the first octet in a UTF-8 formatted message can
describe the length of the data for the entire character, and did some
reading in the UTF-8 RFC.  It appears, from the hex in the previous
email, that the character is a space (20) followed by a NO-BREAK SPACE
(CA, or E with a circumflex, depending on who you consult), followed
by an I.  This happens in every instance there is more than one space
after a space (20).  It makes sense, because two consecutive spaces
(20 20) in HTML would only render as one space.  (20 &nbsp;) would
render as two spaces.  It appears that the &nbsp; was encoded as a
character.

In UTF-8, NO-BREAK SPACE should appear as 0xC2 0xA0. E with circumflex
should appear as 0xC3 0x8A.

0xCA is what E with circumflex looks like in ISO-8859-1.

0xCA 0x49 is invalid as UTF-8. So it looks to me like the program
displaying this is trying to treat it as UTF-8, but then falling back to
ISO-8859-1 when it finds to its disappointment that it isn't actually
UTF-8. Lots of data incorrectly identifies itself so many programs
employ a bit of guesswork. If it did do that, you'd see the E with a
circumflex.
I've consulted many UTF-8 and ASCII format guides.  One that I found
claims that the ASCII equivalent of 202 is "NO-BREAK SPACE". This is
how both Outlook and Apple Mail (Mail.app) render 202.  Web browser
render it as the accented E.

202 is definitely the circumflexed E in ISO-8859-1, and the unicode
character 202 is also the circumflexed E. But it may be the NO-BREAK
SPACE in some other encoding. If so I don't know which one. But this is
one way to explain what is happening.
I considered the ISO 8859-1 character set.  This character set
reference also states that it is the accented E:
http://htmlhelp.com/reference/charset/iso192-223.html
In this UTF-8 reference, 202 is also the accented E:
http://www.tony-franks.co.uk/UTF-8.htm
This reference mentions 202 as being NO-BREAK SPACE in, from what I
can tell, ASCII:http://www1.tip.nl/~t876506/utf8tbl.html

Not ASCII-- ASCII only goes up to 127. But it may be that 202 is the
NO-BREAK SPACE in _something_. That guide may just be wrong, but it's a
bit of a coincidence if you're sure Apple Mail and Outlook are rendering
a no-break space. Maybe they're just rendering a gap because they don't
know what to do with the error.

Thank you Ben for a useful, productive response.

Unfortunately, some people on this board haven't seen daylight from
their mothers basement in a while and have the need to show off their
1337 knowledge of character sets by insulting others :).


**I actually found the cause of the problem I was having, a brief
description is below:**

Clearly, from what I described, the input data looked to be corrupt.
Given that I don't have intricate knowledge of character sets (just
know the basics), I figured I may have been missing something.

As it turns out, the problem is not with the encoding, but with the
headers that define the character set. Both headers (MIME and HTML)
define the character set as UTF-8, however the document is actually
encoded in Mac-Roman. In the Mac-Roman character set, 202 (0xCA) is
in fact the "NO-BREAK SPACE".

When opened in a normal text editor, which tries to determine the type
of encoding from the byte stream itself (rather than a header), it is
properly opened as Mac-Roman. Browsers are looking at the HTML header
(<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
while normal text editors look at the raw file. I suppose mail
clients are determining the encoding from the raw file, before
rendering it as HTML, and that is why it renders properly there.

There is undoubtedly a bug in one or more mail clients, which mark
text bodies as UTF-8, rather than their real encoding, Mac-Roman.
 
M

mrdecav

[...]
The following hex is an example of the issue:
00000250  20 64 65 73 69 67 6e 2e  20 ca 49 0d 0a 68 61 76  | design.
? I..hav| [...]
202 is definitely the circumflexed E in ISO-8859-1, and the unicode
character 202 is also the circumflexed E. But it may be the NO-BREAK
SPACE in some other encoding. If so I don't know which one. But this is
one way to explain what is happening.
[...]
As it turns out, the problem is not with the encoding, but with the
headers that define the character set.  Both headers (MIME and HTML)
define the character set as UTF-8, however the document is actually
encoded in Mac-Roman.  In the Mac-Roman character set, 202 (0xCA) is
in fact the "NO-BREAK SPACE".

Ah, that explains it. The headers say it's UTF-8, but the bytes are not
valid UTF-8. So the text editor falls back on its default. You would
expect the default to be ISO-8859-1 for most tools (giving you an E with
a circumflex), but evidently it's Mac-Roman for some.

You're probably using a Mac. Actually I can tell you are from the
headers on your message:

    X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
    en-us)
When opened in a normal text editor, which tries to determine the type
of encoding from the byte stream itself (rather than a header), it is
properly opened as Mac-Roman.

I would think it's practically impossible in most cases to guess that
something is Mac-Roman rather than one of the other 8-bit encodings.
Your editor is just falling back on its default.
Browsers are looking at the HTML header
(<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
while normal text editors look at the raw file.  I suppose mail
clients are determining the encoding from the raw file, before
rendering it as HTML, and that is why it renders properly there.
There is undoubtedly a bug in one or more mail clients, which mark
text bodies as UTF-8, rather than their real encoding, Mac-Roman.

Certainly. Mac-Roman is rather a strange encoding to be using anyway. If
I were fixing that bug I'd make the contents UTF-8 rather than change
the header to Mac-Roman.

Yeah, originally I was saving the raw bytes of the message to storage
and then pulling it back out. I'm going to convert any text-based
body I get to UTF-8 before saving.

Thanks again,
Andre
 
M

mrdecav

[...]
The following hex is an example of the issue:
00000250  20 64 65 73 69 67 6e 2e  20 ca 49 0d 0a 68 61 76  | design.
? I..hav| [...]
202 is definitely the circumflexed E in ISO-8859-1, and the unicode
character 202 is also the circumflexed E. But it may be the NO-BREAK
SPACE in some other encoding. If so I don't know which one. But this is
one way to explain what is happening.
[...]
As it turns out, the problem is not with the encoding, but with the
headers that define the character set.  Both headers (MIME and HTML)
define the character set as UTF-8, however the document is actually
encoded in Mac-Roman.  In the Mac-Roman character set, 202 (0xCA) is
in fact the "NO-BREAK SPACE".

Ah, that explains it. The headers say it's UTF-8, but the bytes are not
valid UTF-8. So the text editor falls back on its default. You would
expect the default to be ISO-8859-1 for most tools (giving you an E with
a circumflex), but evidently it's Mac-Roman for some.

You're probably using a Mac. Actually I can tell you are from the
headers on your message:

    X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
    en-us)
When opened in a normal text editor, which tries to determine the type
of encoding from the byte stream itself (rather than a header), it is
properly opened as Mac-Roman.

I would think it's practically impossible in most cases to guess that
something is Mac-Roman rather than one of the other 8-bit encodings.
Your editor is just falling back on its default.
Browsers are looking at the HTML header
(<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
while normal text editors look at the raw file.  I suppose mail
clients are determining the encoding from the raw file, before
rendering it as HTML, and that is why it renders properly there.
There is undoubtedly a bug in one or more mail clients, which mark
text bodies as UTF-8, rather than their real encoding, Mac-Roman.

Certainly. Mac-Roman is rather a strange encoding to be using anyway. If
I were fixing that bug I'd make the contents UTF-8 rather than change
the header to Mac-Roman.

Interestingly, Windows Mail and Outlook also render it
"correctly" (I'm guessing using Mac-Roman). There must be a bit more
to it than a default fallback...
 
M

mrdecav

[...]
The following hex is an example of the issue:
00000250  20 64 65 73 69 67 6e 2e  20 ca 49 0d 0a 68 61 76  | design.
? I..hav|
[...]
202 is definitely the circumflexed E in ISO-8859-1, and the unicode
character 202 is also the circumflexed E. But it may be the NO-BREAK
SPACE in some other encoding. If so I don't know which one. But this is
one way to explain what is happening.
[...]
As it turns out, the problem is not with the encoding, but with the
headers that define the character set.  Both headers (MIME and HTML)
define the character set as UTF-8, however the document is actually
encoded in Mac-Roman.  In the Mac-Roman character set, 202 (0xCA) is
in fact the "NO-BREAK SPACE".
Ah, that explains it. The headers say it's UTF-8, but the bytes are not
valid UTF-8. So the text editor falls back on its default. You would
expect the default to be ISO-8859-1 for most tools (giving you an E with
a circumflex), but evidently it's Mac-Roman for some.
You're probably using a Mac. Actually I can tell you are from the
headers on your message:
    X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
    en-us)
When opened in a normal text editor, which tries to determine the type
of encoding from the byte stream itself (rather than a header), it is
properly opened as Mac-Roman.
I would think it's practically impossible in most cases to guess that
something is Mac-Roman rather than one of the other 8-bit encodings.
Your editor is just falling back on its default.
Browsers are looking at the HTML header
(<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
while normal text editors look at the raw file.  I suppose mail
clients are determining the encoding from the raw file, before
rendering it as HTML, and that is why it renders properly there.
There is undoubtedly a bug in one or more mail clients, which mark
text bodies as UTF-8, rather than their real encoding, Mac-Roman.
Certainly. Mac-Roman is rather a strange encoding to be using anyway. If
I were fixing that bug I'd make the contents UTF-8 rather than change
the header to Mac-Roman.
Interestingly, Windows Mail and Outlook also render it
"correctly" (I'm guessing using Mac-Roman).  There must be a bit more
to it than a default fallback...

They may just be displaying nothing at all. They try to decode UTF-8,
find an octet sequence they don't like, and just move on. Are you sure
they're really showing a no-break space?

Well, they should be showing an E with an accent circumflex if they
are truly following UTF-8, so they must be handling that 0xCA
somehow...

Oddly enough, both Notepad and some simple .NET code
(File.ReadAllText) will try to use UTF-8, so its not a platform-
specific behavior.

If you look at the hex I displayed earlier, which is the raw text,
taken using different methods, you see this:
20 ca 49
which corresponds to:
<space>?I

This is both clear from the hexdump output above, as well as just
manually looking it up in the UTF-8 character tables. 20 is a space,
49 is an "I" and CA is most certainly between them. If mail was
decoding as UTF-8, you would expect an accent circumflex.

They may just be ignoring it (they shouldn't if they are just decoding
as UTF-8), but they are definitely adding space where the character
belongs. A single "20" looks different than "20 CA" in the mail
readers.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top