Encoding of surrogate code points to UTF-8

  • Thread starter Steven D'Aprano
  • Start date
S

Steven D'Aprano

I think this is a bug in Python's UTF-8 handling, but I'm not sure.

If I've read the Unicode FAQs correctly, you cannot encode *lone*
surrogate code points into UTF-8:

http://www.unicode.org/faq/utf_bom.html#utf8-5

Sure enough, using Python 3.3:

py> surr = '\udc80'
py> surr.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
position 0: surrogates not allowed


But reading the previous entry in the FAQs:

http://www.unicode.org/faq/utf_bom.html#utf8-4

I interpret this as meaning that I should be able to encode valid pairs
of surrogates. So if I find a code point that encodes to a surrogate pair
in UTF-16:

py> c = '\N{LINEAR B SYLLABLE B038 E}'
py> surr_pair = c.encode('utf-16be')
py> print(surr_pair)
b'\xd8\x00\xdc\x01'


and then use those same values as the code points, I ought to be able to
encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
point. But I can't:


py> s = '\ud800\udc01'
py> s.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed


Have I misunderstood? I think that Python is being too strict about
rejecting surrogate code points. It should only reject lone surrogates,
or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
or is this a bug in Python's handling of UTF-8?
 
N

Neil Cerutti

py> c = '\N{LINEAR B SYLLABLE B038 E}'
py> surr_pair = c.encode('utf-16be')
py> print(surr_pair)
b'\xd8\x00\xdc\x01'

and then use those same values as the code points, I ought to be able to
encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
point. But I can't:

py> s = '\ud800\udc01'
py> s.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed

Have I misunderstood? I think that Python is being too strict
about rejecting surrogate code points. It should only reject
lone surrogates, or invalid pairs, not valid pairs. Have I
misunderstood the Unicode FAQs, or is this a bug in Python's
handling of UTF-8?

From RFC 3629:

The definition of UTF-8 prohibits encoding character numbers
between U+D800 and U+DFFF, which are reserved for use with the
UTF-16 encoding form (as surrogate pairs) and do not directly
represent characters. When encoding in UTF-8 from UTF-16 data,
it is necessary to first decode the UTF-16 data to obtain
character numbers, which are then encoded in UTF-8 as described
above. This contrasts with CESU-8 [CESU-8], which is a
UTF-8-like encoding that is not meant for use on the Internet.
CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code
values (16-bit quantities) instead of the character number
(code point). This leads to different results for character
numbers above 0xFFFF; the CESU-8 encoding of those characters
is NOT valid UTF-8.

The Wikipedia article points out:

Whether an actual application should [refuse to encode these
character numbers] is debatable, as it makes it impossible to
store invalid UTF-16 (that is, UTF-16 with unpaired surrogate
halves) in a UTF-8 string. This is necessary to store unchecked
UTF-16 such as Windows filenames as UTF-8. It is also
incompatible with CESU encoding (described below).

So Python's interpretation is conformant, though not without some
disadvantages.

In any case, "\ud800\udc01" isn't a valid unicode string. In a
perfect world it would automatically get converted to
'\u00010001' without intervention.
 
P

Pete Forman

Steven D'Aprano said:
I think this is a bug in Python's UTF-8 handling, but I'm not sure. [snip]
py> s = '\ud800\udc01'
py> s.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed


Have I misunderstood? I think that Python is being too strict about
rejecting surrogate code points. It should only reject lone surrogates,
or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
or is this a bug in Python's handling of UTF-8?

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

D75 Surrogate pair: A representation for a single abstract character
that consists of a sequence of two 16-bit code units, where the first
value of the pair is a high-surrogate code unit and the second value
is a low-surrogate code unit.

* Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode
EncodingForms.)

* Isolated surrogate code units have no interpretation on their own.
Certain other isolated code units in other encoding forms also have no
interpretation on their own. For example, the isolated byte [\x80] has
no interpretation in UTF-8; it can be used only as part of a multibyte
sequence. (See Table 3-7). It could be argued that this line by itself
should raise an error.


That first bullet indicates that it is indeed illegal to use surrogate
pairs in UTF-8 or UTF-32.
 
N

Neil Cerutti

In any case, "\ud800\udc01" isn't a valid unicode string. In a
perfect world it would automatically get converted to
'\u00010001' without intervention.

This last paragraph is erroneous. I must have had a typo in my
testing.
 
M

MRAB

Steven D'Aprano said:
I think this is a bug in Python's UTF-8 handling, but I'm not sure. [snip]
py> s = '\ud800\udc01'
py> s.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed


Have I misunderstood? I think that Python is being too strict about
rejecting surrogate code points. It should only reject lone surrogates,
or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
or is this a bug in Python's handling of UTF-8?

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

D75 Surrogate pair: A representation for a single abstract character
that consists of a sequence of two 16-bit code units, where the first
value of the pair is a high-surrogate code unit and the second value
is a low-surrogate code unit.

* Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode
EncodingForms.)

* Isolated surrogate code units have no interpretation on their own.
Certain other isolated code units in other encoding forms also have no
interpretation on their own. For example, the isolated byte [\x80] has
no interpretation in UTF-8; it can be used only as part of a multibyte
sequence. (See Table 3-7). It could be argued that this line by itself
should raise an error.


That first bullet indicates that it is indeed illegal to use surrogate
pairs in UTF-8 or UTF-32.
The only time you should get a surrogate pair in a Unicode string is in
a narrow build, which doesn't exist in Python 3.3 and later.
 
W

wxjmfauth

--------
sys.version '3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'
'\ud800'.encode('utf-8')
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0:
surrogates not allowedb'\xff\xfe\x00\x00\x00\xd8\x00\x00'


jmf
 
T

Terry Reedy

I think this is a bug in Python's UTF-8 handling, but I'm not sure.

If I've read the Unicode FAQs correctly, you cannot encode *lone*
surrogate code points into UTF-8:

http://www.unicode.org/faq/utf_bom.html#utf8-5

Sure enough, using Python 3.3:

py> surr = '\udc80'

I am pretty sure that if Python were being strict, that would raise an
error, as the result is not a valid unicode string. Allowing the above
or not was debated and laxness was allowed for at least the following
practical reasons.

1. Python itself uses the invalid surrogate codepoints for
surrogateescape error-handling.
http://www.python.org/dev/peps/pep-0383/

2. Invalid strings are needed for tests ;-)
-- like the one you do next.

3. Invalid strings may be needed for interfacing with other C APIs.
py> surr.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
position 0: surrogates not allowed

Default strict encoding (utf-8 or otherwise) will only encode valid
unicode strings. Encode invalid strings with surrogate codepoints with
surrogateescape error handling.
But reading the previous entry in the FAQs:

http://www.unicode.org/faq/utf_bom.html#utf8-4

I interpret this as meaning that I should be able to encode valid pairs
of surrogates.

It says you should be able to 'convert' them, and that the result for
utf-8 encoding must be a single 4-bytes code for the corresponding
supplementary codepoint.
So if I find a code point that encodes to a surrogate pair
in UTF-16:

py> c = '\N{LINEAR B SYLLABLE B038 E}'
py> surr_pair = c.encode('utf-16be')
py> print(surr_pair)
b'\xd8\x00\xdc\x01'

and then use those same values as the code points, I ought to be able to
encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
point. But I can't:

py> s = '\ud800\udc01'

This is now a string with two invalid codepoints instead of one ;-).
As above, it would be rejected if Python were being strict.
py> s.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed


Have I misunderstood? I think that Python is being too strict about
rejecting surrogate code points.

No, it is being too lax about allowing them at all.

I believe there is an issue on the tracker (maybe closed) about the doc
for unicode escapes in string literals. Perhaps is should say more
clearly that inserting surrogates is allowed but results in an invalid
string that cannot be normally encoded.
 
T

Terry Reedy

On 10/8/2013 9:52 AM, Steven D'Aprano wrote:

It says you should be able to 'convert' them, and that the result for
utf-8 encoding must be a single 4-bytes code for the corresponding
supplementary codepoint.

To expand on this: The FAQ question is "How do I convert a UTF-16
surrogate pair such as <D800 DC00> to UTF-8?" utf-16 and utf-8 are both
byte (or double byte) encodings of codepoints. Direct conversion would
be 'transcoding', not encoding. Python has a few bytes transcoders and
one string transcoder (rot_13), listed at the end of
http://docs.python.org/3/library/codecs.html#python-specific-encodings
But in general, one must decode bytes to string and encode back to bytes.

I believe the utf encodings are defined as 1 to 1. If the above worked,
utf-8 would not be.
 
S

Steven D'Aprano

The only time you should get a surrogate pair in a Unicode string is in
a narrow build, which doesn't exist in Python 3.3 and later.

Incorrect.

py> sys.version
'3.3.0rc3 (default, Sep 27 2012, 18:44:58) \n[GCC 4.1.2 20080704 (Red Hat
4.1.2-52)]'
py> s = '\ud800\udc01'
py> print(len(s))
2
py> import unicodedata as ud
py> for c in s:
.... print(ud.category(c))
....
Cs
Cs

s is a string containing two code points making up a surrogate pair.


It is very frustrating that the Unicode FAQs don't always clearly
distinguish between when they are talking about bytes and when they are
talking about code points. This area about surrogates is one of places
where they conflate the two.
 
S

Steven D'Aprano

In any case, "\ud800\udc01" isn't a valid unicode string.

I don't think this is correct. Can you show me where the standard says
that Unicode strings[1] may not contain surrogates? I think that is a
critical point, and the FAQ conflates *encoded strings* (i.e. bytes using
one of the UTCs) with *Unicode strings*.

The string you give above is is a Unicode string containing two code
points, the surrogates U+D800 U+DC01, which as far as I am concerned is a
legal string (subject to somebody pointing me to a definitive source that
proves it is not). However, it *may or may not* be encodable to bytes
using UTF-8, -16 or -32.

Just as there are byte sequences that cannot be generated by the UTFs,
possibly there are code point sequences that cannot be converted to bytes
using the UTFs.

In a perfect
world it would automatically get converted to '\u00010001' without
intervention.

I certainly hope not, because Unicode string != UTF-16. This is
equivalent to saying:

When encoding the sequence of code points '\ud800\udc01' to UTF-8 bytes,
you should get the same result as if you treated the sequence of code
points as if it were bytes, decoded it using UTF-16, and then encoded
using UTF-8.

That would be a horrible, horrible design, since it privileges UTF-16 in
a completely inappropriate way. I *really* hope I am wrong, but I fear
that is my interpretation of the FAQ.



[1] Sequences of Unicode code points.
 
T

Terry Reedy

In any case, "\ud800\udc01" isn't a valid unicode string.

I don't think this is correct. Can you show me where the standard says
that Unicode strings[1] may not contain surrogates? I think that is a

see below.
critical point, and the FAQ conflates *encoded strings* (i.e. bytes using
one of the UTCs) with *Unicode strings*.

The string you give above is is a Unicode string containing two code
points, the surrogates U+D800 U+DC01, which as far as I am concerned is a
legal string (subject to somebody pointing me to a definitive source that
proves it is not). However, it *may or may not* be encodable to bytes
using UTF-8, -16 or -32.

From chapter two of the standard.

"Plain text is a pure sequence of character codes; plain Unicode-encoded
text is therefore a sequence of Unicode character codes."

http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708
"All three encoding forms can be used to represent the full range of
encoded characters in the Unicode Standard; ... Each of the three
Unicode encoding forms can be efficiently transformed into eith
er of the other two without any loss of data."

"Surrogates Area. The Surrogates Area contains only surrogate code
points and no encoded characters. See Section 16.6, Surrogates Area, for
more detail."

Before utf-16, the surrogates area was, I believe, part of the Private
Use Area (which now starts where surrogates end). I think it would have
been better if they were no longer called code points, but simply utf-16
code units.
Just as there are byte sequences that cannot be generated by the UTFs,
possibly there are code point sequences that cannot be converted to bytes
using the UTFs.

True, but not to the point. You switched from sequences of characters
(unicode text), which is what both I and Neil are talking about, to
sequences of codepoints which is a larger set when you include the
non-character surrogate 'code points' that are not allowed in unicode text.

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404

"The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and UTF-8. Each encoding form maps the Unicode code points
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences."
[1] Sequences of Unicode code points.

This is not the Standard's definition of 'unicode text'. It is also not
its definition of 'unicode string'.

"D80 Unicode string: A code unit sequence containing code units of a
particular Unicode encoding form."

In other words, a Unicode string is a utf encoding of unicode text. The
FSR adaptively uses a subset of possible sequences from all three,
though only one utf is used for any particular string.
 
S

Steven D'Aprano

In any case, "\ud800\udc01" isn't a valid unicode string.

I don't think this is correct. Can you show me where the standard says
that Unicode strings[1] may not contain surrogates? I think that is a

see below.
critical point, and the FAQ conflates *encoded strings* (i.e. bytes
using one of the UTCs) with *Unicode strings*.

The string you give above is is a Unicode string containing two code
points, the surrogates U+D800 U+DC01, which as far as I am concerned is
a legal string (subject to somebody pointing me to a definitive source
that proves it is not). However, it *may or may not* be encodable to
bytes using UTF-8, -16 or -32.

From chapter two of the standard.

"Plain text is a pure sequence of character codes; plain Unicode-encoded
text is therefore a sequence of Unicode character codes."

Also there are many valid non-characters in Unicode, including 66
explicitly defined non-characters, plus the many surrogates. So defining
Unicode strings in terms of characters is less than helpful, since it
excludes a whole bunch of strings which aren't "text" since they include
non-characters.

Also, "character" in the context of Unicode is ambiguous, due to
normalization and decomposition: a single character can have up to four
distinct forms.

http://www.macchiato.com/unicode/nfc-faq

*Code points* are rigorously defined, not characters, which is why I have
tried very hard to only refer to code points and bytes, not characters.

http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three
encoding forms can be used to represent the full range of encoded
characters in the Unicode Standard; ... Each of the three Unicode
encoding forms can be efficiently transformed into eith er of the other
two without any loss of data."

This merely says "encodings encode characters". We know that encodings
can also encode non-characters, at least *some* non-characters. The
question is, can they encode surrogates?

"Surrogates Area. The Surrogates Area contains only surrogate code
points and no encoded characters. See Section 16.6, Surrogates Area, for
more detail."

Before utf-16, the surrogates area was, I believe, part of the Private
Use Area (which now starts where surrogates end). I think it would have
been better if they were no longer called code points, but simply utf-16
code units.

Private Use is irrelevant, since strings certainly can contain Private
Use code-points, and UTF encodings can encode them.

True, but not to the point. You switched from sequences of characters
(unicode text), which is what both I and Neil are talking about, to
sequences of codepoints which is a larger set when you include the
non-character surrogate 'code points' that are not allowed in unicode
text.

I never mentioned sequences of characters. I've always talked about code
points.

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404

"The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and UTF-8. Each encoding form maps the Unicode code points
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences."

Ah! Now we're getting somewhere! I think you've hit the nail on the head:
the three UTF forms explicitly exclude the surrogates. So I think we now
have an answer:

Surrogate code points can exist in Unicode strings, but cannot be encoded
to bytes using the standard UTF-8, UTF-16 and UTF-32 encodings.

There may be other encodings, or error handlers, which are capable of
handling surrogates, but they aren't UTF-8. So I think this answers my
question. (I reserve the right to change my mind after reading more of
the standard.)

Thank you to everyone who replied.
 
W

wxjmfauth

Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit :
Yes,

and what Unicode.org does not say is that these coding
schemes (like any coding scheme) should be used in an
exclusive way.

Probably, because it is too obvious to understand.

jmf
 
N

Ned Batchelder

Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit :
Yes,

and what Unicode.org does not say is that these coding
schemes (like any coding scheme) should be used in an
exclusive way.

Can you clarify what you mean by "in an exclusive way"?

--Ned.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top