Totally confused by the str/bytes/unicode differences introduced inPythyon 3.x

G

Giampaolo Rodola'

Hi,
I'm sure the message I'm going to write will seem quite dumb to most
people but I really don't understand the str/bytes/unicode
differences introduced in Python 3.0 so be patient.
What I'm trying to do is porting pyftpdlib to Python 3.x.
I don't want to support Unicode. I don't want pyftpdlib for py 3k to
do anything new or different.
I just want it to behave exactly the same as in the 2.x version and
I'd like to know if that's possible with Python 3.x.

Now. The basic difference is that socket.recv() returns a bytes object
instead of a string object and that's the thing which confuses me
mainly.
My question is: is there a way to convert that bytes object into
exactly *the same thing* returned by socket.recv() in Python 2.x (a
string)?

I know I can do:

data = socket.recv(1024)
data = data.decode(encoding)

....to convert bytes into a string but that's not exactly the same
thing.
In Python 2.x I didn't have to care about the encoding. What
socket.recv() returned was just a string. That was all.
Now doing something like b''.decode(encoding) puts me in serious
troubles since that can raise an exception in case client and server
use a different encoding.

As far as I've understood the basic difference I see now is that a
Python 2.x based FTP server could handle a 3.x based FTP client using
"latin1" encoding or "utf-8" or anything else while with Python 3.x
I'm forced to tell my server which encoding to use and I don't know
how to deal with that.


--- Giampaolo
http://code.google.com/p/pyftpdlib
 
M

MRAB

Giampaolo said:
> Hi, I'm sure the message I'm going to write will seem quite dumb to
> most people but I really don't understand the str/bytes/unicode
> differences introduced in Python 3.0 so be patient. What I'm trying
> to do is porting pyftpdlib to Python 3.x. I don't want to support
> Unicode. I don't want pyftpdlib for py 3k to do anything new or
> different. I just want it to behave exactly the same as in the 2.x
> version and I'd like to know if that's possible with Python 3.x.
>
> Now. The basic difference is that socket.recv() returns a bytes
> object instead of a string object and that's the thing which confuses
> me mainly. My question is: is there a way to convert that bytes
> object into exactly *the same thing* returned by socket.recv() in
> Python 2.x (a string)?
>
> I know I can do:
>
> data = socket.recv(1024)
> data = data.decode(encoding)
>
> ...to convert bytes into a string but that's not exactly the same
> thing. In Python 2.x I didn't have to care about the encoding. What
> socket.recv() returned was just a string. That was all. Now doing
> something like b''.decode(encoding) puts me in serious troubles since
> that can raise an exception in case client and server use a different
> encoding.
>
> As far as I've understood the basic difference I see now is that a
> Python 2.x based FTP server could handle a 3.x based FTP client using
> "latin1" encoding or "utf-8" or anything else while with Python 3.x
> I'm forced to tell my server which encoding to use and I don't know
> how to deal with that.
>
Originally Python had a single string type 'str' with 8 bits per
character. That was a bit limiting for international use. Then a new
string type 'unicode' was introduced.

Now, in Python 3.x, it's time to tidy things up.

The 'str' type has been renamed 'bytes' and the 'unicode' type has been
renamed 'str'. If you're truly working with strings of _characters_ then
'str' is what you need, but if you're working with strings of _bytes_
then 'bytes' is what you need.

socket.send() and socket.recv() are still the same, it's just that it's
now clearer that they work with bytes and not strings.
 
G

Giampaolo Rodola'

If you're truly working with strings of _characters_ then
'str' is what you need, but if you're working with strings of _bytes_
then 'bytes' is what you need.

I work with string of characters but to convert bytes into string I
need to specify an encoding and that's what confuses me.
Before there was no need to deal with that.


--- Giampaolo
http://code.google.com/p/pyftpdlib
 
S

Steven D'Aprano

I work with string of characters but to convert bytes into string I need
to specify an encoding and that's what confuses me. Before there was no
need to deal with that.

In Python 2.x, str means "string of bytes". This has been renamed "bytes"
in Python 3.

In Python 2.x, unicode means "string of characters". This has been
renamed "str" in Python 3.

If you do this in Python 2.x:

my_string = str(bytes_from_socket)

then you don't need to convert anything, because you are going from a
string of bytes to a string of bytes.

If you do this in Python 3:

my_string = str(bytes_from_socket)

then you *do* have to convert, because you are going from a string of
bytes to a string of characters (unicode). The Python 2.x equivalent code
would be:

my_string = unicode(bytes_from_socket)

and when you convert to unicode, you can get encoding errors. A better
way to do this would be some variation on:

my_str = bytes_from_socket.decode('utf-8')

You should read this:

http://www.joelonsoftware.com/articles/Unicode.html
 
G

Giampaolo Rodola'

In Python 2.x, str means "string of bytes". This has been renamed "bytes"
in Python 3.

In Python 2.x, unicode means "string of characters". This has been
renamed "str" in Python 3.

If you do this in Python 2.x:

    my_string = str(bytes_from_socket)

then you don't need to convert anything, because you are going from a
string of bytes to a string of bytes.

If you do this in Python 3:

    my_string = str(bytes_from_socket)

then you *do* have to convert, because you are going from a string of
bytes to a string of characters (unicode). The Python 2.x equivalent code
would be:

    my_string = unicode(bytes_from_socket)

and when you convert to unicode, you can get encoding errors. A better
way to do this would be some variation on:

    my_str = bytes_from_socket.decode('utf-8')

You should read this:

http://www.joelonsoftware.com/articles/Unicode.html

Thanks, that clarifies a bit even if I still have a lot of doubts.
I wish I could do:

my_str = bytes_from_socket.decode('utf-8')

That would mean avoiding to replace "" with b"" almost everywhere in
my code but I doubt it would actually be a good idea.
RFC-2640 states that UTF-8 is the preferable encoding to use for both
clients and servers but I see that Python 3.x's ftplib uses latin1,
for example (bug?). How my server is supposed to deal with that?
I think that using bytes everywhere, as Christian recommended, would
be the only way to behave exactly like the 2.x version, but that's not
easy at all.


--- Giampaolo
http://code.google.com/p/pyftpdlib
 
S

Steve Holden

Giampaolo said:
I work with string of characters but to convert bytes into string I
need to specify an encoding and that's what confuses me.
Before there was no need to deal with that.
I don't yet understand why you feel you have to convert what you receive
to a string. In Python 3.0 bytes is the same as a string in 2.6, for
most practical purposes.

regards
Steve
 
S

Steve Holden

Giampaolo said:
I used 2to3 against my code but it didn't cover the "" -> b""
conversion (and I doubt it is able to do so, anyway).
Note that if you are using 2.6 you should first convert your "" quotes
to b"" - this won't make any practical difference, but then you will be
able to run 2to3 on your code and (one hopes) covert for 3.0 automatically.

regards
Steve
 
J

John Machin

Note that if you are using 2.6 you should first convert your "" quotes
to b"" - this won't make any practical difference, but then you will be
able to run 2to3 on your code and (one hopes) covert for 3.0 automatically.

Perhaps before we get too far down the track of telling the OP what he
should do, we should ask him a little about his intentions:

Is he porting to 3.0 and abandoning 2.x support completely?
[presumably unlikely]
So then what is the earliest 2.x that he wants to support at the same
time as 3.x? [presumably at least 2.5]
Does he intend to maintain two separate codebases, one 2.x and the
other 3.x?
Else does he intend to maintain just one codebase written in some 2.x
dialect and using 2to3 plus sys.version-dependent code for the things
that 2to3 can't/doesn't handle?

Cheers,
John
 
G

Giampaolo Rodola'

Note that if you are using 2.6 you should first convert your "" quotes
to b"" - this won't make any practical difference, but then you will be
able to run 2to3 on your code and (one hopes) covert for 3.0 automatically.

Perhaps before we get too far down the track of telling the OP what he
should do, we should ask him a little about his intentions:

Is he porting to 3.0 and abandoning 2.x support completely?
[presumably unlikely]
No.

So then what is the earliest 2.x that he wants to support at the same
time as 3.x? [presumably at least 2.5]

I currently support Python versions from 2.3 to 2.6 by using un unique
codebase.
My idea is to support 3.x starting from the last upcoming release.
Does he intend to maintain two separate codebases, one 2.x and the
other 3.x?

I think I have no other choice.
Why? Is theoretically possible to maintain an unique code base for
both 2.x and 3.x?
Else does he intend to maintain just one codebase written in some 2.x
dialect and using 2to3 plus sys.version-dependent code for the things
that 2to3 can't/doesn't handle?

I don't think it would worth the effort.
Cheers,
John

Thanks a lot


--- Giampaolo
http://code.google.com/p/pyftpdlib
 
M

Martin v. Löwis

Does he intend to maintain two separate codebases, one 2.x and the
I think I have no other choice.
Why? Is theoretically possible to maintain an unique code base for
both 2.x and 3.x?

That is certainly possible! One might have to make tradeoffs wrt.
readability sometimes, but I found that this approach works quite
well for Django. I think Mark Hammond is also working on maintaining
a single code base for both 2.x and 3.x, for PythonWin.

Regards,
Martin
 
T

Terry Reedy

Martin said:
That is certainly possible! One might have to make tradeoffs wrt.
readability sometimes, but I found that this approach works quite
well for Django. I think Mark Hammond is also working on maintaining
a single code base for both 2.x and 3.x, for PythonWin.

Where 'single codebase' means that the code runs as is in 2.x and as
autoconverted by 2to3 (or possibly a custom comverter) in 3.x.

One barrier to doing this is when the 2.x code has a mix of string
literals with some being character strings that should not have 'b'
prepended and some being true byte strings that should have 'b'
prepended. (Many programs do not have such a mix.)

One approach to dealing with string constants I have not yet seen
discussed here is to put them all in separate file(s) to be imported.
Group the text and bytes separately. Them marking the bytes with a 'b',
either by hand or program would be easy.

tjr
 
J

John Machin

Where 'single codebase' means that the code runs as is in 2.x and as
autoconverted by 2to3 (or possibly a custom comverter) in 3.x.

One barrier to doing this is when the 2.x code has a mix of string
literals with some being character strings that should not have 'b'
prepended and some being true byte strings that should have 'b'
prepended.  (Many programs do not have such a mix.)

One approach to dealing with string constants I have not yet seen
discussed here is to put them all in separate file(s) to be imported.
Group the text and bytes separately.  Them marking the bytes with a 'b',
either by hand or program would be easy.

(1) How would this work for somebody who wanted/needed to support 2.5
and earlier?

(2) Assuming supporting only 2.6 and 3.x:

Suppose you have this line:
if binary_data[:4] == "PK\x03\x04": # signature of ZIP file

Plan A:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the bytes section of the separate file:
ZIPFILE_SIG = "PK\x03\x04"
[somewhat later]
Change the above to:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan B:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the separate file:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan C:
Change original to:
if binary_data[:4] == b"PK\3\4": # signature of ZIP file

Unless I'm gravely mistaken, you seem to be suggesting Plan A or some
variety thereof -- what advantages do you see in this over Plan C?
 
T

Terry Reedy

John said:
Where 'single codebase' means that the code runs as is in 2.x and as
autoconverted by 2to3 (or possibly a custom comverter) in 3.x.

One barrier to doing this is when the 2.x code has a mix of string
literals with some being character strings that should not have 'b'
prepended and some being true byte strings that should have 'b'
prepended. (Many programs do not have such a mix.)

One approach to dealing with string constants I have not yet seen
discussed here is to put them all in separate file(s) to be imported.
Group the text and bytes separately. Them marking the bytes with a 'b',
either by hand or program would be easy.

(1) How would this work for somebody who wanted/needed to support 2.5
and earlier?

(2) Assuming supporting only 2.6 and 3.x:

Suppose you have this line:
if binary_data[:4] == "PK\x03\x04": # signature of ZIP file

Plan A:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the bytes section of the separate file:
ZIPFILE_SIG = "PK\x03\x04"
[somewhat later]
Change the above to:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan B:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the separate file:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan C:
Change original to:
if binary_data[:4] == b"PK\3\4": # signature of ZIP file

Unless I'm gravely mistaken, you seem to be suggesting Plan A or some
variety thereof -- what advantages do you see in this over Plan C?
 
T

Terry Reedy

John said:
(1) How would this work for somebody who wanted/needed to support 2.5
and earlier?

See reposts in python wiki, one by Martin.
(2) Assuming supporting only 2.6 and 3.x:

Suppose you have this line:
if binary_data[:4] == "PK\x03\x04": # signature of ZIP file

Plan A:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the bytes section of the separate file:
ZIPFILE_SIG = "PK\x03\x04"
[somewhat later]
Change the above to:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan B:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the separate file:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan C:
Change original to:
if binary_data[:4] == b"PK\3\4": # signature of ZIP file

Unless I'm gravely mistaken, you seem to be suggesting Plan A or some
variety thereof -- what advantages do you see in this over Plan C?

For 2.6 only (which is much easier than 2.x), do C. Plan A is for 2.x
where C does not work.

tjr
 
J

John Machin

See reposts in python wiki, one by Martin.

Most relevant of these is Martin's article on porting Django, using a
single codebase. The """goal is to support all versions that Django
supports, plus 3.0""" -- indicating that it supports at least 2.5,
which won't eat b"blah" syntax. He is using 2to3, and handles bytes
constants by """django.utils.py3.b, which is a function that converts
its argument to an ASCII-encoded byte string. In 2.x, it is another
alias for str; in 3.x, it leaves byte strings alone, and encodes
regular (unicode) strings as ASCII. This function is used in all
places where string literals are meant as bytes, plus all cases where
str() was used to invoke the default conversion of 2.x."""

Very similar to what I expected. However it doesn't answer my question
about how your "move byte strings to a separate file, prepend 'b', and
import the separate file" strategy would help ... and given that 2.5
and earlier will barf on b"arf", I don't expect it to.
(2) Assuming supporting only 2.6 and 3.x:
Suppose you have this line:
if binary_data[:4] == "PK\x03\x04": # signature of ZIP file
Plan A:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the bytes section of the separate file:
ZIPFILE_SIG = "PK\x03\x04"
[somewhat later]
Change the above to:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *
Plan B:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the separate file:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *
Plan C:
Change original to:
if binary_data[:4] == b"PK\3\4": # signature of ZIP file
Unless I'm gravely mistaken, you seem to be suggesting Plan A or some
variety thereof -- what advantages do you see in this over Plan C?

For 2.6 only (which is much easier than 2.x), do C.  Plan A is for 2.x
where C does not work.

Excuse me? I'm with the OP now, I'm totally confused. Plan C is *not*
what you were proposing; you were proposing something like Plan A
which definitely involved a separate file.

Why won't Plan C work on 2.x (x <= 5)? Because the 2.X will b"arf".
But you say Plan A is for 2.x -- but Plan A involves importing the
separate file which contains and causes b"arf" also!

To my way of thinking, one obvious DISadvantage of a strategy that
actually moves the strings to another file (requiring invention of a
name for each string (that doesn't have one already) so that it can be
imported is the amount of effort and exposure to error required to get
the same functional result as a strategy that keeps the string in the
same file ... and this disadvantage applies irrespective of what one
does to the string: b"arf", Martin's b("arf"), somebody else's _b
("arf") [IIRC] or my you-aint-gonna-miss-noticing-this-in-the-code
BYTES_LITERAL("arf").

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,981
Messages
2,570,188
Members
46,733
Latest member
LonaMonzon

Latest Threads

Top