Python 3.2 has some deadly infection

R

Rustom Mody

Combine that with Chris':
Yes and no. "ASCII" means two things: Firstly, it's a mapping from the
letter A to the number 65, from the exclamation mark to 33, from the
backslash to 92, and so on. And secondly, it's an encoding of those
numbers into the lowest seven bits of a byte, with the high byte left
clear. Between those two, you get a means of representing the letter
'A' as the byte 0x41, and one of them is an encoding.
and the situation appears quite the opposite of Ethan's description:
In the 'old world' ASCII was both mapping and encoding and so there was
never a justification to distinguish encoding from codepoint.
It is unicode that demands these distinctions.
If we could magically go to a world where the number of bits in a byte was 32
all this headache would go away. [Actually just 21 is enough!]
An ASCII mentality lets you be sloppy. That doesn't mean the
distinction doesn't exist. When I first started programming in C, int
was *always* 16 bits long and *always* little-endian (because I used
only one compiler). I could pretend that those bits in memory actually
were that integer, that there were no other ways that integer could be
encoded. That doesn't mean that encodings weren't important. And as
soon as I started working on a 32-bit OS/2 system, and my ints became
bigger, I had to concern myself with that. Even more so when I got
into networking, and byte order became important to me. And of course,
these days I work with integers that are encoded in all sorts of
different ways (a Python integer isn't just a puddle of bytes in
memory), and I generally let someone else take care of the details,
but the encodings are still there.
ASCII was once your one companion, it was all that mattered. ASCII was
once a friendly encoding, then your world was shattered. Wishing it
were somehow here again, wishing it were somehow near... sometimes it
seemed, if you just dreamed, somehow it would be here! Wishing you
could use just bytes again, knowing that you never would... dreaming
of it won't help you to do all that you dream you could!
It's time to stop chasing the phantom and start living in the Raoul
world... err, the real world. :)

I thought that "If only bytes were 21+ bits wide" would sound sufficiently
nonsensical, that I did not need to explicitly qualify it as a utopian dream!
 
M

Marko Rauhamaa

Steven D'Aprano said:
A Unicode string as an abstract data type has no encoding.

Unicode itself is an encoding. See it in action here:

72 101 108 108 111 44 32 119 111 114 108 100
It is a Platonic ideal, a pure form like the real numbers.

Far from it. It is a mapping from symbols to integers. The symbols are
the Platonic ones.

The Unicode/ASCII encoding above represents the same "Platonic" string
as this ESCDIC one:

212 133 147 147 150 107 64 166 150 153 137 132
Unicode string like this:

s = u"NOBODY expects the Spanish Inquisition!"

should not be thought of as a bunch of bytes in some encoding,

Encoding is not tied to bytes or even computers. People can speak in
code, after all.


Marko
 
C

Chris Angelico

I thought that "If only bytes were 21+ bits wide" would sound sufficiently
nonsensical, that I did not need to explicitly qualify it as a utopian dream!

Humour never dies!

ChrisA
(In case it's not obvious, by the way, everything I said above is a
reference to the Phantom of the Opera.)
 
M

Marko Rauhamaa

Marko Rauhamaa said:
Far from it. It is a mapping from symbols to integers. The symbols are
the Platonic ones.

Well, of course, even the symbols are a code. Letters code sounds and
digits code numbers.

And the sounds and numbers code ideas. Now we are getting close to being
truly Platonic.


Marko
 
N

Ned Batchelder

Unicode itself is an encoding. See it in action here:

72 101 108 108 111 44 32 119 111 114 108 100


Far from it. It is a mapping from symbols to integers. The symbols are
the Platonic ones.

The Unicode/ASCII encoding above represents the same "Platonic" string
as this ESCDIC one:

212 133 147 147 150 107 64 166 150 153 137 132


Encoding is not tied to bytes or even computers. People can speak in
code, after all.

Marko, you are right about the broader English meaning of the word
"encoding". The original point here was that "Unicode text" provides no
information about what sequence of bytes is at work.

In the Unicode ecosystem, an encoding is a specification of how the text
will be represented in a byte stream. Saying something is "Unicode"
doesn't provide that information. You have to say, "UTF8" or "UTF16" or
"UCS2", etc, in order to know how bytes will be involved.

When Ethan said, "a Unicode string, as a data type, has no encoding," he
meant (as he explained) that a Unicode string doesn't require or imply
any particular mapping to bytes.

I'm sure you understand this, I'm just trying to clarify the different
meanings of the word "encoding".
 
W

wxjmfauth

Le vendredi 6 juin 2014 17:50:50 UTC+2, Chris Angelico a écrit :
byte.) Unicode can't, because there are many different pros and cons

to the different encodings, and so we have UCS Transformation Formats

like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint

to a sequence of bytes.

A big NO.

jmf
 
C

Chris Angelico

high BIT left clear.

That thing. Unless you have bytes inside bytes (byteception?), you'll
only have room for one high bit. Some day I'll get my brain and my
fingers to agree on everything we do... but that day is not today.

ChrisA
 
R

rurpy

[...]
But Linux Unicode support is much better than Windows. Unicode support in
Windows is crippled by continued reliance on legacy code pages, and by
the assumption deep inside the Windows APIs that Unicode means "16 bit
characters". See, for example, the amount of space spent on fixing
Windows Unicode handling here:

http://www.utf8everywhere.org/

While not disagreeing with the the general premise of that page, it
has some problems that raise doubts in my mind about taking everything
the author says at face value.

For example

"Q: Why would the Asians give up on UTF-16 encoding, which saves
them 50% the memory per character?"
[...] in fact UTF-8 is used just as often in those [Asian] countries.

That is not my experience, at least for Japan. See my comments in
https://mail.python.org/pipermail/python-ideas/2012-June/015429.html
where I show that utf8 files are a tiny minority of the text files
found by Google.

He then gives a table with the size of utf8 and utf16 encoded contents
(ie stripped of html stuff) of an unnamed Japanese wikipedia page to
show that even without a lot of (html-mandated) ascii, the space savings
are not very much compared to the theoretical "50%" savings he stated:

" Dense text (Δ UTF-8)
UTF-8 ... 222 KB (0%)
UTF-16 ... 176 KB (−21%)"

Note that he calculates the space saving as (utf8-utf16)/utf8.
Yet by that metric the theoretical saving is *NOT* 50%, it is 33%.
For example 1000 Japanese characters will use 2000 bytes in utf16
and 3000 in utf8.

I did the same test using
http://ja.wikipedia.org/wiki/織田信長
I stripped html tags, javascript and redundant ascii whitespace characters
The stripped utf-8 file was 164946 bytes, the utf-16 encoded version of
same was 117756. That gives (using the (utf8-utf16)/utf16 metric he used
to claim 50% idealized savings) 40% which is quite a bit closer to the
idealized 50% than his 21%.

I would have more faith in his opinions about things I don't know
about (such as unicode programming on Windows) if his other info
were more trustworthy. IOW, just because it's on the internet doesn't
mean it's true.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
474,075
Messages
2,570,555
Members
47,197
Latest member
NDTShavonn

Latest Threads

Top