Python 3.2 has some deadly infection

C

Chris Angelico

All character strings, including
| filenames, are treated by the kernel in such a way that THEY
| APPEAR TO IT ONLY AS STRINGS OF BYTES.

Yep, the real issue here is file systems, not the kernel. But yes,
this is one of the very few places where the kernel deals with a
string - and because of the hairiness of having to handle myriad file
systems in a single path (imagine multiple levels of remote mounts -
I've had a case where I mount via sshfs a tree that includes a Samba
mount point, and you can go a lot deeper than that), the only thing it
can do is pass the bytes on unchanged. Which means, in reality, the
kernel doesn't actually do *anything* with the string, it just passes
it right along to the file system.

ChrisA
 
R

Rustom Mody

kernel doesn't actually do *anything* with the string, it just passes
it right along to the file system.

Which is what Marko (and others like Armin) are asking of python
(treated as a processing 'kernel'):

"I know what I am doing with my bytes -- please channel/funnel them
around as requested without being unnecessarily and unrequestedly
'intelligent'"
 
E

Ethan Furman

What I'm afraid of is that the Python developers are reserving the right
to remove the buffer and detach attributes from the standard streams in
a future version.

Being afraid is silly. If you have a question, ask it.
 
A

Akira Li

Marko Rauhamaa said:
That linux text is not the same thing as Python's text. Conceptually,
Python text is a sequence of 32-bit integers. Linux text is a sequence
of 8-bit integers.

_Unicode string in Python is a sequence of Unicode codepoints_. It is
correct that 32-bit integer is enough to represent any Unicode
codepoint: \u0000...\U0010FFFF

It says *nothing* about how Unicode strings are represented
*internally* in Python. It may vary from version to version, build
options and even may depend on the content of a string at runtime.

In the past, "narrow builds" might break the abstraction in some cases
that is why Linux distributions used wide python builds.


_Unicode codepoint is not a Python concept_. There is Unicode
standard http://unicode.org Though intead of following the
self-referential defenitions web, I find it easier to learn from
examples such as http://codepoints.net/U+0041 (A) or
http://codepoints.net/U+1F3A7 (🎧)

_There is no such thing as 8-bit text_
http://www.joelonsoftware.com/articles/Unicode.html

If you insert a space after each byte (8-bit) in the input text then you
may get garbage i.e., you can't assume that a character is a byte:

$ echo "Hyvää yötä" | perl -pe's/.\K/ /g'
H y v a � � � � y � � t � �

In general, you can't assume that a character is a Unicode codepoint:

$ echo "Hyvää yötä" | perl -C -pe's/.\K/ /g'
H y v a ̈ ä y ö t ä

The eXtended grapheme clusters (user-perceived characters) may be useful
in this case:

$ echo "Hyvää yötä" | perl -C -pe's/\X\K/ /g'
H y v ä ä y ö t ä

\X pattern is supported by `regex` module in Python i.e., you can't even
iterate over characters (as they are seen by a user) in Python using
only stdlib. \w+ pattern is also broken for Unicode text
http://bugs.python.org/issue1693050 (it is fixed in the `regex` module)
i.e., you can't select a word in Unicode text using only stdlib.

\X along is not enough in some cases e.g., "“ch†may be considered a
grapheme cluster in Slovak, for processes such as collation" [1]
(sorting order). `PyICU` module might be useful here.

Knowing about Unicode normalization forms (NFC, NFKD, etc)
http://unicode.org/reports/tr15/ Unicode
text segmentation [1] and Unicode collation algorithm
http://www.unicode.org/reports/tr10/ concepts is also
useful; if you want to work with text.

[1]: http://www.unicode.org/reports/tr29/
 
R

Robin Becker

On 05/06/2014 18:16, Ian Kelly wrote:
..........
How should e.g. bytes.upper() be implemented then? The correct
behavior is entirely dependent on the encoding. Python 2 just assumes
ASCII, which at best will correctly upper-case some subset of the
string and leave the rest unchanged, and at worst could corrupt the
string entirely. There are some things that were dropped that should
not have been, but my impression is that those are being worked on,
for example % formatting in PEP 461.
bytes.upper should have done exactly what str.upper in python 2 did; that way we
could have at least continued to do the wrong thing :)
 
S

Steven D'Aprano

Only in a very dull sense.

I agree with you that this is a very dull, unimportant sense. And I think
it's dullness applies equally to the situation you somehow think is
meaningfully exciting: Text is made of bytes! If you squint, you can see
those bytes! Therefore text is not a first class data type!!!

To which my answer is, yes text is made of bytes, yes, you can expose
those bytes, and no your conclusion doesn't follow.

I can't see the bytes inside Python objects, including strings, and
that's how it is supposed to be.

That's because Python the language doesn't allow you to coerce types to
other types, except possibly through its interface to the underlying C
implementation, ctypes. But Python allows you to write extensions in C,
and that gives you the full power to take any data structure and turn it
into any other data structure. Even bytes.

Similarly, I can't (easily) see how files are laid out on hard disks.
That's a true abstraction. Nothing in linux presents data, though,
except through bytes.

Incorrect. Linux presents data as text all the time. Look at the prompt:
its treated as text, not numbers. You type commands using a text
interface. The commands are made of words like ls, dd and ps, not numbers
like 0x6C73, 0x6464 and 0x7073. Applications like grep are based on line-
based files, and "line" is a text concept, not a byte concept.

Consider:

[steve@ando ~]$ echo -e '\x41\x42\x43'
ABC


The assumption of *text* is so strong in the echo application that by
default you cannot enter numeric escapes at all. Without the -e switch,
echo assumes that numeric escapes represent themselves as character
literals:

[steve@ando ~]$ echo '\x41\x42\x43'
\x41\x42\x43
 
M

Marko Rauhamaa

Steven D'Aprano said:
Incorrect. Linux presents data as text all the time. Look at the prompt:
its treated as text, not numbers.

Of course there is a textual human interface. However, from the point of
view of virtually every OS component, it's bytes.

Consider:

[steve@ando ~]$ echo -e '\x41\x42\x43'
ABC

"echo" doesn't know it's emitting text. It would be perfectly happy to
emit binary gibberish. The output goes to the pty which doesn't care
about the textual interpretation, either. Finally, the terminal
(emulation program) translates the incoming bytes to textual glyphs to
the best of its capabilities.

Anyway, what interests me mostly is that I routinely build programs and
systems that talk to each other over files, pipes, sockets and devices.
I really need to micromanage that data. I'm fine with encoding text if
that's the suitable interpretation. I just think Python is overreaching
by making the text interpretation the default for the standard streams
and files and guessing the correct encoding.

Note that subprocess.Popen() wisely assumes binary pipes. Unfortunately
the subprocess might be a python program that opens the standard streams
in the text mode...


Marko
 
E

Ethan Furman

How text is represented is very different from whether text is a
fundamental data type. A fundamental text file is such that ordinary
operating system facilities can't see inside the black box (that is,
they are *not* encoded as far as the applications go).

Of course they are. It may be an ASCII-encoding of some flavor or
other, or something really (to me) strange -- but an encoding is most
assuredly in affect.

ASCII is *not* the state of "this string has no encoding" -- that would
be Unicode; a Unicode string, as a data type, has no encoding. To
transport it, store it, etc., it must (usually?) be encoded into
something -- utf-8, ASCII, turkish, or whatever subset is agreed upon
and will hopefully contain all the Unicode characters needed for the
string to be properly represented.

The realization that ASCII was, in fact, an encoding was a big paradigm
shift for me, but a necessary one.
 
M

Marko Rauhamaa

Ethan Furman said:
Of course they are.

How would you know?
It may be an ASCII-encoding of some flavor or other, or something
really (to me) strange -- but an encoding is most assuredly in affect.

Outside metaphysics, that statement is only meaningful if you have
access to the encoding.
ASCII is *not* the state of "this string has no encoding" -- that
would be Unicode; a Unicode string, as a data type, has no encoding.

Huh?


Marko
 
M

Michael Torrie


It's this very fact that trips of JMF in his rants about FSR. Thank you
to Ethan for putting it so succinctly.

What part of his statement are you saying "Huh?" about?
 
C

Chris Angelico

Of course they are. It may be an ASCII-encoding of some flavor or other, or
something really (to me) strange -- but an encoding is most assuredly in
affect.

Allow me to explain what I think Marko's getting at here.

In most file systems, a file exists on the disk as a set of sectors of
data, plus some metadata including the file's actual size. When you
ask the OS to read you that file, it goes to the disk, reads those
sectors, truncates the data to the real size, and gives you those
bytes.

It's possible to mount a file as a directory, in which case the
physical representation is very different, but the file still appears
the same. In that case, the OS goes reading some part of the file,
maybe decompresses it, and gives it to you. Same difference. These
files still contain bytes.

A "fundamental text file" would be one where, instead of reading and
writing bytes, you read and write Unicode text. Since the hard disk
still works with sectors and bytes, it'll still be stored as such, but
that's an implementation detail; and you could format your disk UTF-8
or UTF-16 or FSR or anything you like, and the only difference you'd
see is performance.

This could certainly be done, in theory. I don't know how well it'd
fit with any of the popular OSes of today, but it could be done. And
these files would not have an encoding; their on-platter
representations would, but that's purely implementation - the text
that you wrote out and the text that you read in are the same text,
and there's been no encoding visible.

ChrisA
 
W

wxjmfauth

Le vendredi 6 juin 2014 17:25:47 UTC+2, Chris Angelico a écrit :
Allow me to explain what I think Marko's getting at here.



In most file systems, a file exists on the disk as a set of sectors of

data, plus some metadata including the file's actual size. When you

ask the OS to read you that file, it goes to the disk, reads those

sectors, truncates the data to the real size, and gives you those

bytes.



It's possible to mount a file as a directory, in which case the

physical representation is very different, but the file still appears

the same. In that case, the OS goes reading some part of the file,

maybe decompresses it, and gives it to you. Same difference. These

files still contain bytes.



A "fundamental text file" would be one where, instead of reading and

writing bytes, you read and write Unicode text. Since the hard disk

still works with sectors and bytes, it'll still be stored as such, but

that's an implementation detail; and you could format your disk UTF-8

or UTF-16 or FSR or anything you like, and the only difference you'd

see is performance.



This could certainly be done, in theory. I don't know how well it'd

fit with any of the popular OSes of today, but it could be done. And

these files would not have an encoding; their on-platter

representations would, but that's purely implementation - the text

that you wrote out and the text that you read in are the same text,

and there's been no encoding visible.
----------

From the three, you can already eliminates one.
It's not a good new.

sys.getsizeof('Gödel'.encode('utf-8'))
23
sys.getsizeof('Gödel'.encode('utf-16-le'))
27
sys.getsizeof('Gödel')
42
os.listdir(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe')
['a.txt', 'kk.bat', 'kk.cmd', 'kk.py', '__pycache__']
sys.getsizeof(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe'.encode('utf-8'))
61
sys.getsizeof(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe'.encode('utf-16-le'))
79
sys.getsizeof(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe')
100

jmf
 
W

wxjmfauth

Le vendredi 6 juin 2014 17:44:57 UTC+2, (e-mail address removed) a écrit :
Le vendredi 6 juin 2014 17:25:47 UTC+2, Chris Angelico a écrit :
Allow me to explain what I think Marko's getting at here.



In most file systems, a file exists on the disk as a set of sectors of

data, plus some metadata including the file's actual size. When you

ask the OS to read you that file, it goes to the disk, reads those

sectors, truncates the data to the real size, and gives you those





It's possible to mount a file as a directory, in which case the

physical representation is very different, but the file still appears

the same. In that case, the OS goes reading some part of the file,

maybe decompresses it, and gives it to you. Same difference. These

files still contain bytes.



A "fundamental text file" would be one where, instead of reading and

writing bytes, you read and write Unicode text. Since the hard disk

still works with sectors and bytes, it'll still be stored as such, but

that's an implementation detail; and you could format your disk UTF-8

or UTF-16 or FSR or anything you like, and the only difference you'd

see is performance.



This could certainly be done, in theory. I don't know how well it'd

fit with any of the popular OSes of today, but it could be done. And

these files would not have an encoding; their on-platter

representations would, but that's purely implementation - the text

that you wrote out and the text that you read in are the same text,

and there's been no encoding visible.

----------



From the three, you can already eliminates one.

It's not a good new.



sys.getsizeof('Gödel'.encode('utf-8'))

23

sys.getsizeof('Gödel'.encode('utf-16-le'))

27

sys.getsizeof('Gödel')

42

os.listdir(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe')

['a.txt', 'kk.bat', 'kk.cmd', 'kk.py', '__pycache__']

sys.getsizeof(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe'.encode('utf-8'))

61

sys.getsizeof(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe'.encode('utf-16-le'))

79

sys.getsizeof(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe')

100



jmf

Sorry, wront copy/paste
sys.getsizeof('Gödel'.encode('utf-8')) 23
sys.getsizeof('Gödel'.encode('utf-16-le')) 27
sys.getsizeof('Gödel') 42
os.listdir(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe') ['a.txt', 'kk.bat', 'kk.cmd', 'kk.py', '__pycache__']
sys.getsizeof(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe'.encode('utf-8')) 61
sys.getsizeof(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe'.encode('utf-16-le')) 79
sys.getsizeof(r'D:\jm\МоÑква\Zürich\Αθήνα\Å“dipe') 100

jmf
 
C

Chris Angelico

Michael Torrie said:
Ethan Furman <[email protected]>:
ASCII is *not* the state of "this string has no encoding" -- that
would be Unicode; a Unicode string, as a data type, has no encoding.

Huh?

[...]

What part of his statement are you saying "Huh?" about?

Unicode, like ASCII, is a code. Representing text in unicode is
encoding.

Yes and no. "ASCII" means two things: Firstly, it's a mapping from the
letter A to the number 65, from the exclamation mark to 33, from the
backslash to 92, and so on. And secondly, it's an encoding of those
numbers into the lowest seven bits of a byte, with the high byte left
clear. Between those two, you get a means of representing the letter
'A' as the byte 0x41, and one of them is an encoding.

"Unicode", on the other hand, is only the first part. It maps all the
same characters to the same numbers that ASCII does, and then adds a
few more... a few followed by a few, followed by... okay, quite a lot
more. Unicode specifies that the character OK HAND SIGN, which looks
like 👌 if you have the right font, is number 1F44C in hex (128076
decimal). This is the "Universal Character Set" or UCS.

ASCII could specify a single encoding, because that encoding makes
sense for nearly all purposes. (There are times when you transmit
ASCII text and use the high bit to mean something else, like parity or
"this is the end of a word" or something, but even then, you follow
the same convention of packing a number into the low seven bits of a
byte.) Unicode can't, because there are many different pros and cons
to the different encodings, and so we have UCS Transformation Formats
like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint
to a sequence of bytes.

You can't represent text in "Unicode" in a computer. Somewhere along
the way, you have to figure out how to store those codepoints as
bytes, or something more concrete (you could, for instance, use a
Python list of Python integers; I can't say that it would be in any
way more efficient than alternatives, but it would be plausible); and
that's the encoding.

ChrisA
 
S

Steven D'Aprano

Michael Torrie said:
Ethan Furman <[email protected]>:
ASCII is *not* the state of "this string has no encoding" -- that
would be Unicode; a Unicode string, as a data type, has no encoding.

Huh?

[...]

What part of his statement are you saying "Huh?" about?

Unicode, like ASCII, is a code. Representing text in unicode is
encoding.

A Unicode string as an abstract data type has no encoding. It is a
Platonic ideal, a pure form like the real numbers. There are no bytes, no
bits, just code points. That is what Ethan means. A Unicode string like
this:

s = u"NOBODY expects the Spanish Inquisition!"

should not be thought of as a bunch of bytes in some encoding, but as an
array of code points. Eventually the abstraction will leak, all
abstractions do, but not for a very long time.
 
R

Rustom Mody

On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
Ethan Furman :
ASCII is *not* the state of "this string has no encoding" -- that
would be Unicode; a Unicode string, as a data type, has no encoding.
Huh?
[...]
What part of his statement are you saying "Huh?" about?
Unicode, like ASCII, is a code. Representing text in unicode is
encoding.
[/QUOTE]
A Unicode string as an abstract data type has no encoding. It is a
Platonic ideal, a pure form like the real numbers. There are no bytes, no
bits, just code points. That is what Ethan means. A Unicode string like
this:
s = u"NOBODY expects the Spanish Inquisition!"
should not be thought of as a bunch of bytes in some encoding, but as an
array of code points. Eventually the abstraction will leak, all
abstractions do, but not for a very long time.

"Should not be thought of" yes thats the Python3 world view
Not even the Python2 world view
And very far from the classic Unix world view.

As Ned Batchelder says in Unipain: http://nedbatchelder.com/text/unipain.html :
Programmers should use the 'unicode sandwich'to avoid 'unipain':

Bytes on the outside, Unicode on the inside, encode/decode at the edges.

The discussion here is precisely about these edges

Combine that with Chris':
Yes and no. "ASCII" means two things: Firstly, it's a mapping from the
letter A to the number 65, from the exclamation mark to 33, from the
backslash to 92, and so on. And secondly, it's an encoding of those
numbers into the lowest seven bits of a byte, with the high byte left
clear. Between those two, you get a means of representing the letter
'A' as the byte 0x41, and one of them is an encoding.

and the situation appears quite the opposite of Ethan's description:

In the 'old world' ASCII was both mapping and encoding and so there was
never a justification to distinguish encoding from codepoint.

It is unicode that demands these distinctions.

If we could magically go to a world where the number of bits in a byte was 32
all this headache would go away. [Actually just 21 is enough!]
 
C

Chris Angelico

Combine that with Chris':
Yes and no. "ASCII" means two things: Firstly, it's a mapping from the
letter A to the number 65, from the exclamation mark to 33, from the
backslash to 92, and so on. And secondly, it's an encoding of those
numbers into the lowest seven bits of a byte, with the high byte left
clear. Between those two, you get a means of representing the letter
'A' as the byte 0x41, and one of them is an encoding.

and the situation appears quite the opposite of Ethan's description:

In the 'old world' ASCII was both mapping and encoding and so there was
never a justification to distinguish encoding from codepoint.

It is unicode that demands these distinctions.

If we could magically go to a world where the number of bits in a byte was 32
all this headache would go away. [Actually just 21 is enough!]

An ASCII mentality lets you be sloppy. That doesn't mean the
distinction doesn't exist. When I first started programming in C, int
was *always* 16 bits long and *always* little-endian (because I used
only one compiler). I could pretend that those bits in memory actually
were that integer, that there were no other ways that integer could be
encoded. That doesn't mean that encodings weren't important. And as
soon as I started working on a 32-bit OS/2 system, and my ints became
bigger, I had to concern myself with that. Even more so when I got
into networking, and byte order became important to me. And of course,
these days I work with integers that are encoded in all sorts of
different ways (a Python integer isn't just a puddle of bytes in
memory), and I generally let someone else take care of the details,
but the encodings are still there.

ASCII was once your one companion, it was all that mattered. ASCII was
once a friendly encoding, then your world was shattered. Wishing it
were somehow here again, wishing it were somehow near... sometimes it
seemed, if you just dreamed, somehow it would be here! Wishing you
could use just bytes again, knowing that you never would... dreaming
of it won't help you to do all that you dream you could!

It's time to stop chasing the phantom and start living in the Raoul
world... err, the real world. :)

ChrisA
 
M

Marko Rauhamaa

Chris Angelico said:
"ASCII" means two things: Firstly, it's a mapping from the letter A to
the number 65, from the exclamation mark to 33, from the backslash to
92, and so on. And secondly, it's an encoding of those numbers into
the lowest seven bits of a byte, with the high byte left clear.
Between those two, you get a means of representing the letter 'A' as
the byte 0x41, and one of them is an encoding.

The American Standard Code for Information Interchange [...] is a
character-encoding scheme [...] <URL:
http://en.wikipedia.org/wiki/ASCII>
"Unicode", on the other hand, is only the first part. It maps all the
same characters to the same numbers that ASCII does, and then adds a
few more... a few followed by a few, followed by... okay, quite a lot
more. Unicode specifies that the character OK HAND SIGN, which looks
like 👌 if you have the right font, is number 1F44C in hex (128076
decimal). This is the "Universal Character Set" or UCS.

Unicode is a computing industry standard for the consistent encoding,
representation and handling of text [...] <URL:
http://en.wikipedia.org/wiki/Unicode>

Each standard assigns numbers to letters and other symbols. In a word,
each is a code. That's what their names say, too.


Marko
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
474,075
Messages
2,570,554
Members
47,197
Latest member
NDTShavonn

Latest Threads

Top