Totally confused by the str/bytes/unicode differences introduced inPythyon 3.x

Giampaolo Rodola' · Jan 17, 2009

Hi,
I'm sure the message I'm going to write will seem quite dumb to most
people but I really don't understand the str/bytes/unicode
differences introduced in Python 3.0 so be patient.
What I'm trying to do is porting pyftpdlib to Python 3.x.
I don't want to support Unicode. I don't want pyftpdlib for py 3k to
do anything new or different.
I just want it to behave exactly the same as in the 2.x version and
I'd like to know if that's possible with Python 3.x.

Now. The basic difference is that socket.recv() returns a bytes object
instead of a string object and that's the thing which confuses me
mainly.
My question is: is there a way to convert that bytes object into
exactly *the same thing* returned by socket.recv() in Python 2.x (a
string)?

I know I can do:

data = socket.recv(1024)
data = data.decode(encoding)

....to convert bytes into a string but that's not exactly the same
thing.
In Python 2.x I didn't have to care about the encoding. What
socket.recv() returned was just a string. That was all.
Now doing something like b''.decode(encoding) puts me in serious
troubles since that can raise an exception in case client and server
use a different encoding.

As far as I've understood the basic difference I see now is that a
Python 2.x based FTP server could handle a 3.x based FTP client using
"latin1" encoding or "utf-8" or anything else while with Python 3.x
I'm forced to tell my server which encoding to use and I don't know
how to deal with that.

--- Giampaolo
http://code.google.com/p/pyftpdlib

MRAB · Jan 17, 2009

Giampaolo said:
> Hi, I'm sure the message I'm going to write will seem quite dumb to
> most people but I really don't understand the str/bytes/unicode
> differences introduced in Python 3.0 so be patient. What I'm trying
> to do is porting pyftpdlib to Python 3.x. I don't want to support
> Unicode. I don't want pyftpdlib for py 3k to do anything new or
> different. I just want it to behave exactly the same as in the 2.x
> version and I'd like to know if that's possible with Python 3.x.
>
> Now. The basic difference is that socket.recv() returns a bytes
> object instead of a string object and that's the thing which confuses
> me mainly. My question is: is there a way to convert that bytes
> object into exactly *the same thing* returned by socket.recv() in
> Python 2.x (a string)?
>
> I know I can do:
>
> data = socket.recv(1024)
> data = data.decode(encoding)
>
> ...to convert bytes into a string but that's not exactly the same
> thing. In Python 2.x I didn't have to care about the encoding. What
> socket.recv() returned was just a string. That was all. Now doing
> something like b''.decode(encoding) puts me in serious troubles since
> that can raise an exception in case client and server use a different
> encoding.
>
> As far as I've understood the basic difference I see now is that a
> Python 2.x based FTP server could handle a 3.x based FTP client using
> "latin1" encoding or "utf-8" or anything else while with Python 3.x
> I'm forced to tell my server which encoding to use and I don't know
> how to deal with that.
>

Originally Python had a single string type 'str' with 8 bits per
character. That was a bit limiting for international use. Then a new
string type 'unicode' was introduced.

Now, in Python 3.x, it's time to tidy things up.

The 'str' type has been renamed 'bytes' and the 'unicode' type has been
renamed 'str'. If you're truly working with strings of _characters_ then
'str' is what you need, but if you're working with strings of _bytes_
then 'bytes' is what you need.

socket.send() and socket.recv() are still the same, it's just that it's
now clearer that they work with bytes and not strings.

Giampaolo Rodola' · Jan 17, 2009

If you're truly working with strings of _characters_ then
'str' is what you need, but if you're working with strings of _bytes_
then 'bytes' is what you need.

I work with string of characters but to convert bytes into string I
need to specify an encoding and that's what confuses me.
Before there was no need to deal with that.

--- Giampaolo
http://code.google.com/p/pyftpdlib

Steven D'Aprano · Jan 17, 2009

I work with string of characters but to convert bytes into string I need
to specify an encoding and that's what confuses me. Before there was no
need to deal with that.

In Python 2.x, str means "string of bytes". This has been renamed "bytes"
in Python 3.

In Python 2.x, unicode means "string of characters". This has been
renamed "str" in Python 3.

If you do this in Python 2.x:

my_string = str(bytes_from_socket)

then you don't need to convert anything, because you are going from a
string of bytes to a string of bytes.

If you do this in Python 3:

my_string = str(bytes_from_socket)

then you *do* have to convert, because you are going from a string of
bytes to a string of characters (unicode). The Python 2.x equivalent code
would be:

my_string = unicode(bytes_from_socket)

and when you convert to unicode, you can get encoding errors. A better
way to do this would be some variation on:

my_str = bytes_from_socket.decode('utf-8')

You should read this:

http://www.joelonsoftware.com/articles/Unicode.html

Giampaolo Rodola' · Jan 17, 2009

In Python 2.x, str means "string of bytes". This has been renamed "bytes"
in Python 3.

In Python 2.x, unicode means "string of characters". This has been
renamed "str" in Python 3.

If you do this in Python 2.x:

my_string = str(bytes_from_socket)

then you don't need to convert anything, because you are going from a
string of bytes to a string of bytes.

If you do this in Python 3:

my_string = str(bytes_from_socket)

then you *do* have to convert, because you are going from a string of
bytes to a string of characters (unicode). The Python 2.x equivalent code
would be:

my_string = unicode(bytes_from_socket)

and when you convert to unicode, you can get encoding errors. A better
way to do this would be some variation on:

my_str = bytes_from_socket.decode('utf-8')

You should read this:

http://www.joelonsoftware.com/articles/Unicode.html

Thanks, that clarifies a bit even if I still have a lot of doubts.
I wish I could do:

my_str = bytes_from_socket.decode('utf-8')

That would mean avoiding to replace "" with b"" almost everywhere in
my code but I doubt it would actually be a good idea.
RFC-2640 states that UTF-8 is the preferable encoding to use for both
clients and servers but I see that Python 3.x's ftplib uses latin1,
for example (bug?). How my server is supposed to deal with that?
I think that using bytes everywhere, as Christian recommended, would
be the only way to behave exactly like the 2.x version, but that's not
easy at all.

--- Giampaolo
http://code.google.com/p/pyftpdlib

Steve Holden · Jan 17, 2009

Giampaolo said:
I work with string of characters but to convert bytes into string I
need to specify an encoding and that's what confuses me.
Before there was no need to deal with that.

I don't yet understand why you feel you have to convert what you receive
to a string. In Python 3.0 bytes is the same as a string in 2.6, for
most practical purposes.

regards
Steve

Giampaolo Rodola' · Jan 17, 2009

I don't yet understand why you feel you have to convert what you receive
to a string. In Python 3.0 bytes is the same as a string in 2.6, for
most practical purposes.

regards
Steve

That would help to avoid replacing "" with b"" almost everywhere in my
code.

--- Giampaolo
http://code.google.com/p/pyftpdlib

Terry Reedy · Jan 17, 2009

Giampaolo said:
That would help to avoid replacing "" with b"" almost everywhere in my
code.

Won't 2to3 do that for you?

Giampaolo Rodola' · Jan 17, 2009

Won't 2to3 do that for you?

I used 2to3 against my code but it didn't cover the "" -> b""
conversion (and I doubt it is able to do so, anyway).

--- Giampaolo
http://code.google.com/p/pyftpdlib

Steve Holden · Jan 17, 2009

Giampaolo said:
I used 2to3 against my code but it didn't cover the "" -> b""
conversion (and I doubt it is able to do so, anyway).

Note that if you are using 2.6 you should first convert your "" quotes
to b"" - this won't make any practical difference, but then you will be
able to run 2to3 on your code and (one hopes) covert for 3.0 automatically.

regards
Steve

John Machin · Jan 17, 2009

Note that if you are using 2.6 you should first convert your "" quotes
to b"" - this won't make any practical difference, but then you will be
able to run 2to3 on your code and (one hopes) covert for 3.0 automatically.

Perhaps before we get too far down the track of telling the OP what he
should do, we should ask him a little about his intentions:

Is he porting to 3.0 and abandoning 2.x support completely?
[presumably unlikely]
So then what is the earliest 2.x that he wants to support at the same
time as 3.x? [presumably at least 2.5]
Does he intend to maintain two separate codebases, one 2.x and the
other 3.x?
Else does he intend to maintain just one codebase written in some 2.x
dialect and using 2to3 plus sys.version-dependent code for the things
that 2to3 can't/doesn't handle?

Cheers,
John

Giampaolo Rodola' · Jan 17, 2009

Note that if you are using 2.6 you should first convert your "" quotes
to b"" - this won't make any practical difference, but then you will be
able to run 2to3 on your code and (one hopes) covert for 3.0 automatically.

Click to expand...

Perhaps before we get too far down the track of telling the OP what he
should do, we should ask him a little about his intentions:

Is he porting to 3.0 and abandoning 2.x support completely?
[presumably unlikely]
No.

So then what is the earliest 2.x that he wants to support at the same
time as 3.x? [presumably at least 2.5]

I currently support Python versions from 2.3 to 2.6 by using un unique
codebase.
My idea is to support 3.x starting from the last upcoming release.

Does he intend to maintain two separate codebases, one 2.x and the
other 3.x?

I think I have no other choice.
Why? Is theoretically possible to maintain an unique code base for
both 2.x and 3.x?

Else does he intend to maintain just one codebase written in some 2.x
dialect and using 2to3 plus sys.version-dependent code for the things
that 2to3 can't/doesn't handle?

I don't think it would worth the effort.

Cheers,
John

Thanks a lot

--- Giampaolo
http://code.google.com/p/pyftpdlib

Martin v. Löwis · Jan 17, 2009

Does he intend to maintain two separate codebases, one 2.x and the

I think I have no other choice.
Why? Is theoretically possible to maintain an unique code base for
both 2.x and 3.x?

That is certainly possible! One might have to make tradeoffs wrt.
readability sometimes, but I found that this approach works quite
well for Django. I think Mark Hammond is also working on maintaining
a single code base for both 2.x and 3.x, for PythonWin.

Regards,
Martin

Terry Reedy · Jan 17, 2009

Martin said:
That is certainly possible! One might have to make tradeoffs wrt.
readability sometimes, but I found that this approach works quite
well for Django. I think Mark Hammond is also working on maintaining
a single code base for both 2.x and 3.x, for PythonWin.

Where 'single codebase' means that the code runs as is in 2.x and as
autoconverted by 2to3 (or possibly a custom comverter) in 3.x.

One barrier to doing this is when the 2.x code has a mix of string
literals with some being character strings that should not have 'b'
prepended and some being true byte strings that should have 'b'
prepended. (Many programs do not have such a mix.)

One approach to dealing with string constants I have not yet seen
discussed here is to put them all in separate file(s) to be imported.
Group the text and bytes separately. Them marking the bytes with a 'b',
either by hand or program would be easy.

tjr

John Machin · Jan 18, 2009

Where 'single codebase' means that the code runs as is in 2.x and as
autoconverted by 2to3 (or possibly a custom comverter) in 3.x.

One barrier to doing this is when the 2.x code has a mix of string
literals with some being character strings that should not have 'b'
prepended and some being true byte strings that should have 'b'
prepended. (Many programs do not have such a mix.)

One approach to dealing with string constants I have not yet seen
discussed here is to put them all in separate file(s) to be imported.
Group the text and bytes separately. Them marking the bytes with a 'b',
either by hand or program would be easy.

(1) How would this work for somebody who wanted/needed to support 2.5
and earlier?

(2) Assuming supporting only 2.6 and 3.x:

Suppose you have this line:
if binary_data[:4] == "PK\x03\x04": # signature of ZIP file

Plan A:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the bytes section of the separate file:
ZIPFILE_SIG = "PK\x03\x04"
[somewhat later]
Change the above to:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan B:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the separate file:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan C:
Change original to:
if binary_data[:4] == b"PK\3\4": # signature of ZIP file

Unless I'm gravely mistaken, you seem to be suggesting Plan A or some
variety thereof -- what advantages do you see in this over Plan C?

Terry Reedy · Jan 18, 2009

John said:
Where 'single codebase' means that the code runs as is in 2.x and as
autoconverted by 2to3 (or possibly a custom comverter) in 3.x.

One barrier to doing this is when the 2.x code has a mix of string
literals with some being character strings that should not have 'b'
prepended and some being true byte strings that should have 'b'
prepended. (Many programs do not have such a mix.)

One approach to dealing with string constants I have not yet seen
discussed here is to put them all in separate file(s) to be imported.
Group the text and bytes separately. Them marking the bytes with a 'b',
either by hand or program would be easy.

Click to expand...

(1) How would this work for somebody who wanted/needed to support 2.5
and earlier?

(2) Assuming supporting only 2.6 and 3.x:

Suppose you have this line:
if binary_data[:4] == "PK\x03\x04": # signature of ZIP file

Plan A:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the bytes section of the separate file:
ZIPFILE_SIG = "PK\x03\x04"
[somewhat later]
Change the above to:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan B:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the separate file:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan C:
Change original to:
if binary_data[:4] == b"PK\3\4": # signature of ZIP file

Unless I'm gravely mistaken, you seem to be suggesting Plan A or some
variety thereof -- what advantages do you see in this over Plan C?

Terry Reedy · Jan 18, 2009

John said:
(1) How would this work for somebody who wanted/needed to support 2.5
and earlier?

See reposts in python wiki, one by Martin.

(2) Assuming supporting only 2.6 and 3.x:

Suppose you have this line:
if binary_data[:4] == "PK\x03\x04": # signature of ZIP file

Plan A:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the bytes section of the separate file:
ZIPFILE_SIG = "PK\x03\x04"
[somewhat later]
Change the above to:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan B:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the separate file:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan C:
Change original to:
if binary_data[:4] == b"PK\3\4": # signature of ZIP file

Unless I'm gravely mistaken, you seem to be suggesting Plan A or some
variety thereof -- what advantages do you see in this over Plan C?

For 2.6 only (which is much easier than 2.x), do C. Plan A is for 2.x
where C does not work.

tjr

John Machin · Jan 18, 2009

See reposts in python wiki, one by Martin.

Most relevant of these is Martin's article on porting Django, using a
single codebase. The """goal is to support all versions that Django
supports, plus 3.0""" -- indicating that it supports at least 2.5,
which won't eat b"blah" syntax. He is using 2to3, and handles bytes
constants by """django.utils.py3.b, which is a function that converts
its argument to an ASCII-encoded byte string. In 2.x, it is another
alias for str; in 3.x, it leaves byte strings alone, and encodes
regular (unicode) strings as ASCII. This function is used in all
places where string literals are meant as bytes, plus all cases where
str() was used to invoke the default conversion of 2.x."""

Very similar to what I expected. However it doesn't answer my question
about how your "move byte strings to a separate file, prepend 'b', and
import the separate file" strategy would help ... and given that 2.5
and earlier will barf on b"arf", I don't expect it to.

(2) Assuming supporting only 2.6 and 3.x:

Click to expand...

Suppose you have this line:
if binary_data[:4] == "PK\x03\x04": # signature of ZIP file

Click to expand...

Plan A:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the bytes section of the separate file:
ZIPFILE_SIG = "PK\x03\x04"
[somewhat later]
Change the above to:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Click to expand...

Plan B:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the separate file:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Click to expand...

Plan C:
Change original to:
if binary_data[:4] == b"PK\3\4": # signature of ZIP file

Click to expand...

Unless I'm gravely mistaken, you seem to be suggesting Plan A or some
variety thereof -- what advantages do you see in this over Plan C?

Click to expand...

For 2.6 only (which is much easier than 2.x), do C. Plan A is for 2.x
where C does not work.

Excuse me? I'm with the OP now, I'm totally confused. Plan C is *not*
what you were proposing; you were proposing something like Plan A
which definitely involved a separate file.

Why won't Plan C work on 2.x (x <= 5)? Because the 2.X will b"arf".
But you say Plan A is for 2.x -- but Plan A involves importing the
separate file which contains and causes b"arf" also!

To my way of thinking, one obvious DISadvantage of a strategy that
actually moves the strings to another file (requiring invention of a
name for each string (that doesn't have one already) so that it can be
imported is the amount of effort and exposure to error required to get
the same functional result as a strategy that keeps the string in the
same file ... and this disadvantage applies irrespective of what one
does to the string: b"arf", Martin's b("arf"), somebody else's _b
("arf") [IIRC] or my you-aint-gonna-miss-noticing-this-in-the-code
BYTES_LITERAL("arf").

Cheers,
John

harmful str(bytes)	17	Oct 7, 2010
Python 3.x and bytes	0	May 17, 2011
Frustrating circular bytes issue	1	Jun 26, 2012
LEGB rule, totally confused ...	6	Aug 14, 2007
Beginner python 3 unicode question	3	Nov 16, 2013
Unicode	20	Dec 16, 2012
Will Python 3.x ever become the actual standard?	37	Oct 23, 2013
Differences creating tuples and collections.namedtuples	28	Feb 18, 2013

Totally confused by the str/bytes/unicode differences introduced inPythyon 3.x

Giampaolo Rodola'

MRAB

Giampaolo Rodola'

Steven D'Aprano

Giampaolo Rodola'

Steve Holden

Giampaolo Rodola'

Terry Reedy

Giampaolo Rodola'

Steve Holden

John Machin

Giampaolo Rodola'

Martin v. Löwis

Terry Reedy

John Machin

Terry Reedy

Terry Reedy

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads