[RELEASED] Python 3.1 final


Antoine Pitrou

Nobody said:
This results in an internal error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
SystemError: Objects/bytesobject.c:3182: bad argument to internal function

Please report a bug on http://bugs.python.org

As for a bytes version of sys.argv and os.environ, you're welcome to propose a
patch (this would be a separate issue on the aforementioned issue tracker).



Hallvard B Furuseth

Nobody said:
Okay, that's useful, except that it may have some bugs:
Assuming that this gets fixed, it should make most of the problems with
3.0 solvable. OTOH, it wouldn't have killed them to have added e.g.
sys.argv_bytes and os.environ_bytes.

That's hopeless to keep track of across modules if something modifies
sys.argv or os.environ.

If the current scheme for recovering the original bytes proves
insufficient, what could work is a string type which can have an
attribute with the original bytes (if the source was bytes). And/or
sys.argv and os.environ maintaining the correspondence when feasible.

Anyway, I haven't looked at whether any of this is a problem, so don't
mind me:) As long as it's definitely possible to tell python once
and for all not to apply locales and string conversions, instead of
having to keep track of an ever-expanding list of variables to tame
it's bytes->character conversions (as happened with Emacs).

Paul Moore

2009/6/29 Antoine Pitrou said:
As for a bytes version of sys.argv and os.environ, you're welcome to propose a
patch (this would be a separate issue on the aforementioned issue tracker).

But please be aware that such a proposal would have to consider:

1. That on Windows, the native form is the character version, and the
bytes version would have to address all the same sorts of encoding
issues that the OP is complaining about in the character versions. [1]

2. That the proposal address the question of how to write portable,
robust, code (given that choosing argv vs argv_bytes based on
sys.platform is unlikely to count as a good option...)

3. Why defining your own argv_bytes as argv_bytes =
[a.encode("iso-8859-1", "surrogateescape") for a in sys.argv] is
insufficient (excluding issues with bugs, which will be fixed
regardless) for the occasional cases where it's needed.

Before writing the proposal, the OP should probably review the
extensive discussions which can be found in the python-dev archives.
It would be wrong for people reading this thread to think that the
implemented approach is in any sense a "quick fix" - it's certainly a
compromise (and no-one likes all aspects of any compromise!) but it's
one made after a lot of input from people with widely differing


[1] And my understanding, from the PEP, is that even on POSIX, the
argv and environ data is intended to be character data, even though
the native C APIs expose a byte-oriented interface. So conceptually,
character format is "correct" on POSIX as well... (But I don't write
code for POSIX systems, so I'll leave it to the POSIX users to debate
this point further).


That's hopeless to keep track of across modules if something modifies
sys.argv or os.environ.

Oh, I wasn't suggesting that they should be updated. Just that there
should be some way to get at the original data.

The mechanism used in 3.1 is sufficient. I'm mostly concerned that it's
*possible* to recover the data; convenience is of secondary importance.

Calling sys.setfilesystemencoding('iso-8859-1') right at the start of the
code eliminates most of the issues. It's just the stuff which happens
before the first line of code is executed (sys.argv, os.environ, sys.stdin
etc) which was problematic.

[BTW, it isn't just Python that has problems. The directory where I was
performing tests happened to be an svn checkout. A subsequent "svn update"
promptly crapped out because I'd left behind a file whose name wasn't
valid ASCII.]


Please report a bug on http://bugs.python.org

As for a bytes version of sys.argv and os.environ, you're welcome to propose a
patch (this would be a separate issue on the aforementioned issue tracker).

Assuming that the above bug gets fixed, it isn't really necessary. In
particular, maintaining bytes/string versions in the presence of updates
is likely to be more trouble than it's worth.


As for a bytes version of sys.argv and os.environ, you're welcome to
propose a patch (this would be a separate issue on the aforementioned
issue tracker).

But please be aware that such a proposal would have to consider:

1. That on Windows, the native form is the character version, and the
bytes version would have to address all the same sorts of encoding
issues that the OP is complaining about in the character versions. [1]

A bytes version doesn't make sense on Windows (at least, not on the
NT-based versions, and the DOS-based branch isn't worth bothering about,

Also, Windows *needs* to deal with characters due to the
fact that filenames, environment variables, etc are case-insensitive.
2. That the proposal address the question of how to write portable,
robust, code (given that choosing argv vs argv_bytes based on
sys.platform is unlikely to count as a good option...)

There is a tension here between robustness and portability. In my
situation, robustness means getting the "unadulterated" data. I can always
adulterate it myself if I need to.
3. Why defining your own argv_bytes as argv_bytes =
[a.encode("iso-8859-1", "surrogateescape") for a in sys.argv] is
insufficient (excluding issues with bugs, which will be fixed
regardless) for the occasional cases where it's needed.

Other than the bug, it appears to be sufficient. I don't need to support
a locale where nl_langinfo(CODESET) is ISO-2022 (I *do* need to support
lossless round-trip of ISO-2022 filenames, possibly stored in argv and
maybe even in environ, but that's a different matter; the code only
really needs to run with LANG=C).
[1] And my understanding, from the PEP, is that even on POSIX, the
argv and environ data is intended to be character data, even though
the native C APIs expose a byte-oriented interface. So conceptually,
character format is "correct" on POSIX as well... (But I don't write
code for POSIX systems, so I'll leave it to the POSIX users to debate
this point further).

Even if it's "intended" to be character data, it isn't *required* to be.
In particular, it's not required to be in the locale's encoding.

A common example of what I need to handle is:

find /www ... -print0 | xargs -0 myscript

where the filenames can be in a wide variety of different encodings
(sometimes even within a single directory).

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Latest member

Latest Threads
