harmful str(bytes)

  • Thread starter Hallvard B Furuseth
  • Start date
H

Hallvard B Furuseth

I've been playing a bit with Python3.2a2, and frankly its charset
handling looks _less_ safe than in Python 2.

The offender is bytes.__str__: str(b'foo') == "b'foo'".
It's often not clear from looking at a piece of code whether
some data is treated as strings or bytes, particularly when
translating from old code. Which means one cannot see from
context if str(s) or "%s" % s will produce garbage.

With 2.<late> conversion Unicode <-> string the equivalent operation did
not silently produce garbage: it raised UnicodeError instead. With old
raw Python strings that was not a problem in applications which did not
need to convert any charsets, with python3 they can break.

I really wish bytes.__str__ would at least by default fail.
 
A

Arnaud Delobelle

Hallvard B Furuseth said:
I've been playing a bit with Python3.2a2, and frankly its charset
handling looks _less_ safe than in Python 2.

The offender is bytes.__str__: str(b'foo') == "b'foo'".
It's often not clear from looking at a piece of code whether
some data is treated as strings or bytes, particularly when
translating from old code. Which means one cannot see from
context if str(s) or "%s" % s will produce garbage.

With 2.<late> conversion Unicode <-> string the equivalent operation did
not silently produce garbage: it raised UnicodeError instead. With old
raw Python strings that was not a problem in applications which did not
need to convert any charsets, with python3 they can break.

I really wish bytes.__str__ would at least by default fail.

I think you misunderstand the purpose of str(). It is to provide a
(unicode) string representation of an object and has nothing to do with
converting it to unicode:
"b'\\xc2\\xa3'"


If you want to *decode* a bytes string, use its decode method and you
get a unicode string (if your bytes string is a valid encoding):
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)


If you want to *encode* a (unicode) string, use its encode method and you
get a bytes string (provided your string can be encoded using the given
encoding):
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in position 0: ordinal not in range(128)
 
A

Antoine Pitrou

The offender is bytes.__str__: str(b'foo') == "b'foo'".
It's often not clear from looking at a piece of code whether
some data is treated as strings or bytes, particularly when
translating from old code. Which means one cannot see from
context if str(s) or "%s" % s will produce garbage.

This probably comes from overuse of str(s) and "%s". They can be useful
to produce human-readable messages, but you shouldn't have to use them
very often.
I really wish bytes.__str__ would at least by default fail.

Actually, the implicit contract of __str__ is that it never fails, so
that everything can be printed out (for debugging purposes, etc.).

Regards

Antoine.
 
H

Hallvard B Furuseth

Arnaud said:
I think you misunderstand the purpose of str(). It is to provide a
(unicode) string representation of an object and has nothing to do with
converting it to unicode:

That's not the point - the point is that for 2.* code which _uses_ str
vs unicode, the equivalent 3.* code uses str vs bytes. Yet not the
same way - a 2.* 'str' will sometimes be 3.* bytes, sometime str. So
upgraded old code will have to expect both str and bytes.

In 2.*, str<->unicode conversion failed or produced the equivalent
character/byte data. Yes, there could be charset problems if the
defaults were set up wrong, but that's a smaller problem than in 3.*.
In 3.*, the bytes->str conversion always _silently_ produces garbage.

And lots of code use both, and need to convert back and forth. In
particular code 3.* code converted from 2.*, or using modules converted
from 2.*. There's a lot of such code, and will be for a long time.
 
H

Hallvard B Furuseth

Antoine said:
This probably comes from overuse of str(s) and "%s". They can be useful
to produce human-readable messages, but you shouldn't have to use them
very often.

Maybe Python 3 has something better, but they could be hard to avoid in
Python 2. And certainly our site has plenty of code using them, whether
we should have avoided them or not.
Actually, the implicit contract of __str__ is that it never fails, so
that everything can be printed out (for debugging purposes, etc.).

Nope:

$ python2 -c 'str(u"\u1000")'
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u1000' in position 0: ordinal not in range(128)

And the equivalent:

$ python2 -c 'unicode("\xA0")'
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

In Python 2, these two UnicodeEncodeErrors made our data safe from code
which used str and unicode objects without checking too carefully which
was which. Code which sort the types out carefully enough would fail.

In Python 3, that safety only exists for bytes(str), not str(bytes).
 
S

Steven D'Aprano

That's not the point - the point is that for 2.* code which _uses_ str
vs unicode, the equivalent 3.* code uses str vs bytes. Yet not the same
way - a 2.* 'str' will sometimes be 3.* bytes, sometime str. So
upgraded old code will have to expect both str and bytes.

I'm sorry, this makes no sense to me. I've read it repeatedly, and I
still don't understand what you're trying to say.

In 2.*, str<->unicode conversion failed or produced the equivalent
character/byte data. Yes, there could be charset problems if the
defaults were set up wrong, but that's a smaller problem than in 3.*. In
3.*, the bytes->str conversion always _silently_ produces garbage.

So you say, but I don't see it. Why is this garbage?
"b'abc\\xff'"

That's what I would expect from the str() function called with a bytes
argument. Since decoding bytes requires a codec, which you haven't given,
it can only return a string representation of the bytes.

If you want to decode bytes into a string, you need to specify a codec:
'abcÿ'
 
H

Hallvard B Furuseth

Steven said:
I'm sorry, this makes no sense to me. I've read it repeatedly, and I
still don't understand what you're trying to say.

OK, here is a simplified example after 2to3:

try: from urlparse import urlparse, urlunparse # Python 2.6
except: from urllib.parse import urlparse, urlunparse # Python 3.2a

foo, bar = b"/foo", b"bar" # Data from network, bar normally empty

# Statement inserted for 2.3 when urlparse below said TypeError
if isinstance(foo, bytes): foo = foo.decode("ASCII")

p = list(urlparse(foo))
if bar: p[3] = bar
print(urlunparse(p))

2.6 prints "/foo;bar", 3.2a prints "/foo;b'bar'"

You have a module which receives some strings/bytes, maybe data which
originates on the net or in a database. The module _and its callers_
may date back to before the 'bytes' type, maybe before 'unicode'.
The module is supposed to work with this data and produce some 'str's
or bytes to output. _Not_ a Python representation like "b'bar'".

The module doesn't always know which input is 'bytes' and which is
'str'. Or the callers don't know what it expects, or haven't kept
track. Maybe the input originated as bytes and were converted to
str at some point, maybe not.

Look at urrlib.parse.py and its isinstance(<data>, <str or bytes>)
calls. urlencode() looks particularly gross, though that one has code
which could be factored out. They didn't catch everything either, I
posted this when a 2to3'ed module of mine produced URLs with "b'bar'".

In the pre-'unicode type' Python (was that early Python 2, or should
I have said Python 1?) that was a non-issue - it Just Worked, sans
possible charset issues.

In Python 2 with unicode, the module would get it right or raise an
exception. Which helps the programmer fix any charset issues.

In Python 3, the module does not raise an exception, it produces
"b'bar'" when it was supposed to produce "bar".
So you say, but I don't see it. Why is this garbage?

To the user of the module, stuff with Python syntax is garbage. It
was supposed to be text/string data.
"b'abc\\xff'"

That's what I would expect from the str() function called with a bytes
argument. Since decoding bytes requires a codec, which you haven't given,
it can only return a string representation of the bytes.

If you want to decode bytes into a string, you need to specify a codec:

Except I didn't intend to decode anything - I just intended to output
the contents of the string - which was stored in a 'bytes' object.
But __str__ got called because a lot of code does that. It wasn't
even my code which did it.

There's often no obvious place to decide when to consider a stream of
data as raw bytes and when to consider it text, and no obvious time
to convert between bytes and str. When writing a program, one simply
has to decide. Such as network data (bytes) vs urllib URLs (str)
in my program. And the decision is different from what one would
decide for when to use str and when to use unicode in Python 2.

In this case I'll bugreport urlunparse to python.org, but there'll be
a _lot_ of such code around. And without an Exception getting raised,
it'll take time to find it. So it looks like it'll be a long time
before I dare entrust my data to Python 3, except maybe with modules
written from scratch.
 
A

Antoine Pitrou

Maybe Python 3 has something better, but they could be hard to avoid in
Python 2. And certainly our site has plenty of code using them, whether
we should have avoided them or not.

It's difficult to answer more precisely without knowing what you're
doing precisely. But if you already have str objects, you don't have to
call str() or format them using "%s", so implicit __str__ calls are
avoided.
Actually, the implicit contract of __str__ is that it never fails, so
that everything can be printed out (for debugging purposes, etc.).

Nope:

$ python2 -c 'str(u"\u1000")'
Traceback (most recent call last): [...]
$ python2 -c 'unicode("\xA0")'
Traceback (most recent call last):

Sure, but so what? This mainly shows that unicode support was broken in
Python 2, because:
1) it tried to do implicit bytes<->unicode coercion by using some
process-wide default encoding
2) some unicode objects didn't have a succesful str()

Python 3 fixes both these issues. Fixing 1) means there's no automatic
coercion when trying to mix bytes and unicode. Try for example:

[Python 2] >>> u"a" + "b"
u'ab'
[Python 3] >>> "a" + b"b"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly


And fixing 2) means bytes object get a meaningful str() in all
circumstances, which is much better for debug output.

If you don't think that 2) is important, then perhaps you don't deal
with non-ASCII data a lot. Failure to print out exception messages (or
log entries, etc.) containing non-ASCII characters is a big annoyance
with Python 2 for many people (including me).

In Python 2, these two UnicodeEncodeErrors made our data safe from code
which used str and unicode objects without checking too carefully which
was which.

That's false, since implicit coercion can actually happen everywhere.
And it only fails when there's non-ASCII data involved, meaning the
unsuspecting Anglo-saxon developer doesn't understand why his/her users
complain.

Regards

Antoine.
 
T

Terry Reedy

Nope:

$ python2 -c 'str(u"\u1000")'
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u1000' in position 0: ordinal not in range(128)

This could be considered a design bug due to 'str' being used both to
produce readable string representations of objects (perhaps one that
could be eval'ed) and to convert unicode objects to equivalent string
objects. which is not the same operation!

The above really should have produced '\u1000'! (the equivavlent of what
str(bytes) does today). The 'conversion to equivalent str object' option
should have required an explicit encoding arg rather than defaulting to
the ascii codec. This mistake has been corrected in 3.x, so Yep.
And the equivalent:

$ python2 -c 'unicode("\xA0")'
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

This is an application bug: either bad string or missing decoding arg.
In Python 2, these two UnicodeEncodeErrors made our data safe from code
which used str and unicode objects without checking too carefully which
was which. Code which sort the types out carefully enough would fail.

In Python 3, that safety only exists for bytes(str), not str(bytes).

If you prefer the buggy 2.x design (and there are *many* tracker bug
reports that were fixed by the 3.x change), stick with it.
 
T

Terry Reedy

That's not the point - the point is that for 2.* code which _uses_ str
vs unicode, the equivalent 3.* code uses str vs bytes. Yet not the
same way - a 2.* 'str' will sometimes be 3.* bytes, sometime str. So
upgraded old code will have to expect both str and bytes.

If you want to interconvert code between 2.6/7 and 3.x, use unicode and
bytes in the 2.x code. Bytes was added to 2.6/7 as a synonym for str
explicitly and only for conversion purposes.
 
H

Hallvard B Furuseth

Terry said:
If you want to interconvert code between 2.6/7 and 3.x, use unicode and
bytes in the 2.x code. Bytes was added to 2.6/7 as a synonym for str
explicitly and only for conversion purposes.

That's what I did, see article <[email protected]>.
And that's exactly what broke as described, because bytes.__str__
have different meanings in 2.x and 3.x: the raw contents vs the repr.
So a library function which did %s output a different result.
 
H

Hallvard B Furuseth

Antoine said:
It's difficult to answer more precisely without knowing what you're
doing precisely.

I'd just posted an example in article <[email protected]>:

urllib.parse.urlunparse(('', '', '/foo', b'bar', '', '')) returns
"/foo;b'bar'" instead of raising an exception or returning 2.6's correct
"/foo;bar".
But if you already have str objects, you don't have to
call str() or format them using "%s", so implicit __str__ calls are
avoided.

Except it's quite normal to output strings with %s. Above, a library
did it for me. Maybe also to depend on the fact that str.__str__() is a
noop, so one can call str() just in case some variable needs to be
unpacked to a plain string. urllib.parse is an example of that too.
Actually, the implicit contract of __str__ is that it never fails, so
that everything can be printed out (for debugging purposes, etc.).

Nope:

$ python2 -c 'str(u"\u1000")'
Traceback (most recent call last): [...]
$ python2 -c 'unicode("\xA0")'
Traceback (most recent call last):

Sure, but so what?

So your statement above was wrong, which you made in response to my
suggested solution.
This mainly shows that unicode support was broken in
Python 2, because:

....because Python 2 was designed so there was no way to avoid poor
unicode support one way or other. Python 3 has not fixed this, it has
just moved the problem elsewhere.
1) it tried to do implicit bytes<->unicode coercion by using some
process-wide default encoding

I had completely forgotten that. I've been lucky (with my sysadmins
maybe:) and lived with ASCII default encoding. Checking around I see
now Python2 site.py used my locale for the encoding, as if that had any
relevance for my data...
2) some unicode objects didn't have a succesful str()

Python 3 fixes both these issues. Fixing 1) means there's no automatic
coercion when trying to mix bytes and unicode.

Fine, so programs will have to do it themselves...
(...)
And fixing 2) means bytes object get a meaningful str() in all
circumstances, which is much better for debug output.

Except str() on such data has a different meaning than it did before, so
equivalent programs *silently* produce different results. Which is why
I started this thread.
If you don't think that 2) is important, then perhaps you don't deal
with non-ASCII data a lot. Failure to print out exception messages (or
log entries, etc.) containing non-ASCII characters is a big annoyance
with Python 2 for many people (including me).

I'm Norwegian. I do deal with non-ASCII and I agree failures in error
messages are annoying.

OTOH if the same bug that previously caused an error in an error,
instead quietly munges my data, that's worse than annoying. I've dealt
with that too, and the fix is to use another tool. (Ironically, in one
case it meant moving from Perl to Python, and now Python has followed
Perl...)
That's false, since implicit coercion can actually happen everywhere.

Right, it was true as long as my encoding was ASCII.
 
S

Stefan Behnel

Hallvard B Furuseth, 11.10.2010 21:50:
Fine, so programs will have to do it themselves...

Yes, they can finally handle bytes and Unicode data correctly and safely.
Having byte data turn into Unicode strings unexpectedly makes the behaviour
of your code hardly predictable and fairly error prone. In Python 3, it's
now possible to do the conversion safely at well defined points in your
code and rely on the runtime to bark at you when something slips through or
is mistreated. Detecting errors early makes your code better.

That's a huge improvement. It didn't come for free and the current Python 3
releases still have their rough edges. But there are few left and the
situation is constantly improving. You can help out if you want.

Stefan
 
H

Hallvard B Furuseth

Terry said:
This could be considered a design bug due to 'str' being used both to
produce readable string representations of objects (perhaps one that
could be eval'ed) and to convert unicode objects to equivalent string
objects. which is not the same operation!

Indeed, the eager str() and the lack of a more narrow str function is
one root of the problem. I'd put it more more generally: Converting an
object which represents a string, to an actual str. *And* __str__ may
be intended for Python-independent representations like 23 -> "23".

I expect that's why quite a bit of code calls str() just in case, which
is another root of the problem. E.g. urlencode(), as I said. The code
might not need to, but str('string') is a noop so it doesn't hurt.
Maybe that's why %s does too, instead of demanding that the user calls
str() if needed.
The above really should have produced '\u1000'! (the equivavlent of what
str(bytes) does today). The 'conversion to equivalent str object' option
should have required an explicit encoding arg rather than defaulting to
the ascii codec. This mistake has been corrected in 3.x, so Yep.

If there were a __plain_str__() method which was supposed to fail rather
than start to babble Python syntax, and if there were not plenty of
Python code around which invoked __str__, I'd agree.

As it is, this "correction" instead is causing code which previously
produced the expected non-Python-related string output, to instead
produce Pythonesque repr() stuff. See below.
This is an application bug: either bad string or missing decoding arg.

Exactly. And Python 2 caught the bug. (Since I had Ascii default
decoding, I'd forgotten Python could pick another default.)

For an app which handles Unicode vs. raw bytes, the equivalent Python 3
code is str(b"\xA0"). That's the *same* application bug, in equivalent
application code, and Python 3 does not catch it. This time the bug is
spelled str() instead, which is much more likely than old unicode() to
happen somewhere thanks to the str()-related misdesign discussed above.

Article <[email protected]> in this thread has an example.


And that's the third root of the problem above. Technically it's the
same problem that an application bug can do str(None) where it should be
using a string, and produce garbage text. The difference is that Python
forces programs to deal with these two different character/octet string
types, sometimes swapping back and forth between them. And it's not
necessarily obvious from the code which type is in use where. Python 3
has not changed that, it has strengthened it by removing the default
conversion.

Yet while the programmer now needs to be _more_ careful about this
before, Python 3 has removed the exception which caught this particular
bug instead of doing something to make it easier to find such bugs.

That's why I suggested making bytes.__str__ fail by default, annoying
as it would be. But I don't know how annoying it'd be. Maybe there
could be an option to disable it.
If you prefer the buggy 2.x design (and there are *many* tracker bug
reports that were fixed by the 3.x change), stick with it.

Bugs even with ASCII default encoding? Looking closer at setencoding()
in site.py, it doesn't seem to do anything, it's "if 0"ed out.

As I think I've made clear, I certainly don't feel like entrusting
Python 3 with my raw string data just yet.
 
H

Hallvard B Furuseth

Stefan said:
Hallvard B Furuseth, 11.10.2010 21:50:

Yes, they can finally handle bytes and Unicode data correctly and
safely. Having byte data turn into Unicode strings unexpectedly makes
the behaviour of your code hardly predictable and fairly error prone. In
Python 3, it's now possible to do the conversion safely at well defined
points in your code and rely on the runtime to bark at you when
something slips through or is mistreated. Detecting errors early makes
your code better.

That's a huge improvement. It didn't come for free and the current
Python 3 releases still have their rough edges. But there are few left
and the situation is constantly improving. You can help out if you want.

I quite agree with most of that - just not about it being safe, see my
reply to Terry Reedy. Hence my suggestion to change or disable
bytes.__str__. And yes, I'll be submitting some fixes or bug reports.
 
A

Antoine Pitrou

I'd just posted an example in article <[email protected]>:

urllib.parse.urlunparse(('', '', '/foo', b'bar', '', '')) returns
"/foo;b'bar'" instead of raising an exception or returning 2.6's correct
"/foo;bar".

Oh, this looks like a bug in urlparse. Could you report it at
http://bugs.python.org ? Thanks.
Except it's quite normal to output strings with %s.

"%s" will take the string representation of anything you give it:
bytes, but also, files, sockets, dicts, tuples, etc. So, if you're
using "%s" somewhere, it's your job to ensure that you give it the
desired type.
Maybe also to depend on the fact that str.__str__() is a
noop, so one can call str() just in case some variable needs to be
unpacked to a plain string.

Well, if you don't know what types you are currently handling and
convert them to strings "just in case", chances are you're doing
something wrong.
Fine, so programs will have to do it themselves...

That's exactly the point, yes :) It's not Python's job to guess how some
bytes you got e.g. on a socket should be decoded.
Except str() on such data has a different meaning than it did before,

Yes, it's Python 3 and it's incompatible with Python 2... !

Regards

Antoine.
 
S

Stefan Behnel

Hallvard B Furuseth, 11.10.2010 23:45:
If there were a __plain_str__() method which was supposed to fail rather
than start to babble Python syntax, and if there were not plenty of
Python code around which invoked __str__, I'd agree.

Yes, calling str() "just in case" has a clear code smell. I think that's
one of the reasons why b'abc' was chosen as output of bytes.__str__, to
make it clearly visible a) what the type of the value is, e.g. in an
interactive session, and b) that this wasn't the intended operation if it
happened during string interpolation etc. and that the user code needs
fixing. After all, you were complaining about a clearly visible problem (in
urlunparse) that was easy to find given the incorrect output.

I think raising an exception in bytes.__str__ would be a horrible thing to
do. That would make it really hard and dangerous to look at bytes objects
in a debugger or interpreter session. I think the current way bytes.__str__
behaves is a good tradeoff between safety and usability, and the output is
also very clear and readable.

Stefan
 
H

Hrvoje Niksic

Stefan Behnel said:
Hallvard B Furuseth, 11.10.2010 23:45:

Yes, calling str() "just in case" has a clear code smell. I think
that's one of the reasons why b'abc' was chosen as output of
bytes.__str__, to make it clearly visible a) what the type of the
value is, e.g. in an interactive session

Isn't that the point of repr()?
I think raising an exception in bytes.__str__ would be a horrible
thing to do. That would make it really hard and dangerous to look at
bytes objects in a debugger or interpreter session.

Again, the interactive interpreter prints out the repr, and so should
debuggers, etc. In fact, when the object is embedded in a container,
all you get is the repr anyway.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top