M
Martin v. Löwis
I'm proposing the following PEP for inclusion into Python 3.1.
Please comment.
Regards,
Martin
PEP: 383
Title: Non-decodable Bytes in System Character Interfaces
Version: $Revision: 71793 $
Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $
Author: Martin v. Löwis <[email protected]>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 22-Apr-2009
Python-Version: 3.1
Post-History:
Abstract
========
File names, environment variables, and command line arguments are
defined as being character data in POSIX; the C APIs however allow
passing arbitrary bytes - whether these conform to a certain encoding
or not. This PEP proposes a means of dealing with such irregularities
by embedding the bytes in character strings in such a way that allows
recreation of the original byte string.
Rationale
=========
The C char type is a data type that is commonly used to represent both
character data and bytes. Certain POSIX interfaces are specified and
widely understood as operating on character data, however, the system
call interfaces make no assumption on the encoding of these data, and
pass them on as-is. With Python 3, character strings use a
Unicode-based internal representation, making it difficult to ignore
the encoding of byte strings in the same way that the C interfaces can
ignore the encoding.
On the other hand, Microsoft Windows NT has correct the original
design limitation of Unix, and made it explicit in its system
interfaces that these data (file names, environment variables, command
line arguments) are indeed character data, by providing a
Unicode-based API (keeping a C-char-based one for backwards
compatibility).
For Python 3, one proposed solution is to provide two sets of APIs: a
byte-oriented one, and a character-oriented one, where the
character-oriented one would be limited to not being able to represent
all data accurately. Unfortunately, for Windows, the situation would
be exactly the opposite: the byte-oriented interface cannot represent
all data; only the character-oriented API can. As a consequence,
libraries and applications that want to support all user data in a
cross-platform manner have to accept mish-mash of bytes and characters
exactly in the way that caused endless troubles for Python 2.x.
With this PEP, a uniform treatment of these data as characters becomes
possible. The uniformity is achieved by using specific encoding
algorithms, meaning that the data can be converted back to bytes on
POSIX systems only if the same encoding is used.
Specification
=============
On Windows, Python uses the wide character APIs to access
character-oriented APIs, allowing direct conversion of the
environmental data to Python str objects.
On POSIX systems, Python currently applies the locale's encoding to
convert the byte data to Unicode. If the locale's encoding is UTF-8,
it can represent the full set of Unicode characters, otherwise, only a
subset is representable. In the latter case, using private-use
characters to represent these bytes would be an option. For UTF-8,
doing so would create an ambiguity, as the private-use characters may
regularly occur in the input also.
To convert non-decodable bytes, a new error handler "python-escape" is
introduced, which decodes non-decodable bytes using into a private-use
character U+F01xx, which is believed to not conflict with private-use
characters that currently exist in Python codecs.
The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning
Unicode strings which then get encoded again.
If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
Discussion
==========
While providing a uniform API to non-decodable bytes, this interface
has the limitation that chosen representation only "works" if the data
get converted back to bytes with the python-escape error handler
also. Encoding the data with the locale's encoding and the (default)
strict error handler will raise an exception, encoding them with UTF-8
will produce non-sensical data.
For most applications, we assume that they eventually pass data
received from a system interface back into the same system
interfaces. For example, and application invoking os.listdir() will
likely pass the result strings back into APIs like os.stat() or
open(), which then encodes them back into their original byte
representation. Applications that need to process the original byte
strings can obtain them by encoding the character strings with the
file system encoding, passing "python-escape" as the error handler
name.
Copyright
=========
This document has been placed in the public domain.
Please comment.
Regards,
Martin
PEP: 383
Title: Non-decodable Bytes in System Character Interfaces
Version: $Revision: 71793 $
Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $
Author: Martin v. Löwis <[email protected]>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 22-Apr-2009
Python-Version: 3.1
Post-History:
Abstract
========
File names, environment variables, and command line arguments are
defined as being character data in POSIX; the C APIs however allow
passing arbitrary bytes - whether these conform to a certain encoding
or not. This PEP proposes a means of dealing with such irregularities
by embedding the bytes in character strings in such a way that allows
recreation of the original byte string.
Rationale
=========
The C char type is a data type that is commonly used to represent both
character data and bytes. Certain POSIX interfaces are specified and
widely understood as operating on character data, however, the system
call interfaces make no assumption on the encoding of these data, and
pass them on as-is. With Python 3, character strings use a
Unicode-based internal representation, making it difficult to ignore
the encoding of byte strings in the same way that the C interfaces can
ignore the encoding.
On the other hand, Microsoft Windows NT has correct the original
design limitation of Unix, and made it explicit in its system
interfaces that these data (file names, environment variables, command
line arguments) are indeed character data, by providing a
Unicode-based API (keeping a C-char-based one for backwards
compatibility).
For Python 3, one proposed solution is to provide two sets of APIs: a
byte-oriented one, and a character-oriented one, where the
character-oriented one would be limited to not being able to represent
all data accurately. Unfortunately, for Windows, the situation would
be exactly the opposite: the byte-oriented interface cannot represent
all data; only the character-oriented API can. As a consequence,
libraries and applications that want to support all user data in a
cross-platform manner have to accept mish-mash of bytes and characters
exactly in the way that caused endless troubles for Python 2.x.
With this PEP, a uniform treatment of these data as characters becomes
possible. The uniformity is achieved by using specific encoding
algorithms, meaning that the data can be converted back to bytes on
POSIX systems only if the same encoding is used.
Specification
=============
On Windows, Python uses the wide character APIs to access
character-oriented APIs, allowing direct conversion of the
environmental data to Python str objects.
On POSIX systems, Python currently applies the locale's encoding to
convert the byte data to Unicode. If the locale's encoding is UTF-8,
it can represent the full set of Unicode characters, otherwise, only a
subset is representable. In the latter case, using private-use
characters to represent these bytes would be an option. For UTF-8,
doing so would create an ambiguity, as the private-use characters may
regularly occur in the input also.
To convert non-decodable bytes, a new error handler "python-escape" is
introduced, which decodes non-decodable bytes using into a private-use
character U+F01xx, which is believed to not conflict with private-use
characters that currently exist in Python codecs.
The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning
Unicode strings which then get encoded again.
If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
Discussion
==========
While providing a uniform API to non-decodable bytes, this interface
has the limitation that chosen representation only "works" if the data
get converted back to bytes with the python-escape error handler
also. Encoding the data with the locale's encoding and the (default)
strict error handler will raise an exception, encoding them with UTF-8
will produce non-sensical data.
For most applications, we assume that they eventually pass data
received from a system interface back into the same system
interfaces. For example, and application invoking os.listdir() will
likely pass the result strings back into APIs like os.stat() or
open(), which then encodes them back into their original byte
representation. Applications that need to process the original byte
strings can obtain them by encoding the character strings with the
file system encoding, passing "python-escape" as the error handler
name.
Copyright
=========
This document has been placed in the public domain.