Blog "about python 3"

Terry Reedy · Jan 5, 2014

Le samedi 4 janvier 2014 23:46:49 UTC+1, Terry Reedy a Ã©crit : ...

My examples are ONLY ILLUSTRATING, this FSR
is wrong by design, can be on the side of
memory, performance, linguistic or even
typography.

Let me expand on 3 of my points. First, performance == time:

Point 3. You correctly identified a time regression in finding a
character in a string. I saw that the slowdown was *not* inherent in the
FSR but had to be a glitch in the code, and reported it on pydev with
the hope that someone would fix it even if it were not too important in
real use cases. Someone did.

Point 1. You incorrectly generalized that extreme case. I reported (a
year ago last September) that the overall stringbench results were about
the same. I also pointed out that there is an equally non-representative
extreme case in the opposite direction, and that it would equally be
wrong of me to use that to claim that FSR is faster. (It turns out that
this FSR speed advantage *is* inherent in the design.)

Memory: Point 2. A *design goal* of FSR was to save memory relative to
UTF-32, which is what you apparently prefer. Your examples show that FSF
successfully met its design goal. But you call that success, saving
memory, 'wrong'. On what basis?

You *claim* the FSR is 'wrong by design', but your examples only show
that is was temporarily wrong in implementation as far as speed and
correct by design as far as memory goes.

Terry Reedy · Jan 5, 2014

My examples are ONLY ILLUSTRATING, this FSR
is wrong by design,

Let me answer you a different way. If FSR is 'wrong by design', so are
the alternatives. Hence, the claim is, in itself, useless as a guide to
choosing. The choices:

* Keep the previous complicated system of buggy narrow builds on some
systems and space-wasting wide builds on other systems, with Python code
potentially acting differently on the different builds. I am sure that
you agree that this is a bad design.

* Improved the dual-build system by de-bugging narrow builds. I proposed
to do this (and gave Python code proving the idea) by adding the
complication of an auxiliary array of indexes of astral chars in a
UTF-16 string. I suspect you would call this design 'wrong' also.

* Use the memory-wasting UTF-32 (wide) build on all systems. I know you
do not consider this 'wrong', but come on. From an information theoretic
and coding viewpoint, it clearly is. The top (4th) byte is *never* used.
The 3rd byte is *almost never* used. The 2nd byte usage ranges from
common to almost never for different users.

Memory waste is also time waste, as moving information-free 0 bytes
takes the same time as moving informative bytes.

Here is the beginning of the rationale for the FSR (from
http://www.python.org/dev/peps/pep-0393/ -- have you ever read it?).

"There are two classes of complaints about the current implementation of
the unicode type: on systems only supporting UTF-16, users complain that
non-BMP characters are not properly supported. On systems using UCS-4
internally (and also sometimes on systems using UCS-2), there is a
complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings...".

The memory waste was a reason to stick with 2.7. It could break code
that worked in 2.x. By removing the waste, the FSR makes switching to
Python 3 more feasible for some people. It was a response to real
problems encountered by real people using Python. It fixed both classes
of complaint about the previous system.

* Switch to the time-wasting UTF-8 for text storage, as some have done.
This is different from using UTF-8 for text transmission, which I hope
becomes the norm soon.

Terry Reedy · Jan 5, 2014

If it stopped there, it would be mildly annoying ("1% of our shipments
will need to be replaced, that's a 1% cost for free replacements").
The trouble is that they don't care about the replacement either, so
it's really that 100% (or some fairly large proportion) of their
shipments will arrive with some measure of damage, and they're hoping
that their customers' threshold for complaining is often higher than
the damage sustained. Which it probably is, a lot of the time.

My wife has gotten several books from Amazon and partners and we have
never gotten one loose enough in a big enough box to be damaged. Either
the box is tight or has bubble packing. Leaving aside partners, maybe
distribution centers have different rules.

Chris Angelico · Jan 5, 2014

My wife has gotten several books from Amazon and partners and we have never
gotten one loose enough in a big enough box to be damaged. Either the box is
tight or has bubble packing. Leaving aside partners, maybe distribution
centers have different rules.

Or possibly (my personal theory) the CS rep I was talking to just
couldn't be bothered solving the problem. Way way too much work to
make the customer happy, much easier and cheaper to give a 30% refund
and hope that shuts him up.

But they managed to ship two books (the original and the replacement)
with insufficient packaging. Firstly, that requires the square of the
probability of failure; and secondly, if you care even a little bit
about making your customers happy, put a little note on the second
order instructing people to be particularly careful of this one! Get
someone to check it before it's sent out. Make sure it's right this
time. I know that's what we used to do in the family business whenever
anything got mucked up.

(BTW, I had separately confirmed that the problem was with Amazon, and
not - as has happened to me with other shipments - caused by
Australian customs officials opening the box, looking through it, and
then packing it back in without its protection. No, it was shipped
that way.)

Anyway, this is veering so far off topic that we're at no risk of
meeting any Python Alliance ships - as Mal said, we're at the corner
of No and Where. But maybe someone can find an on-topic analogy to put
some tentative link back into this thread...

ChrisA

Steven D'Aprano · Jan 6, 2014

Chris Angelico wrote about Amazon:

And yet.... I can't disagree with your final conclusion. Empirical
evidence goes against my incredulous declaration that "surely this is
a bad idea" - according to XKCD 1165, they're kicking out nearly a
cubic meter a SECOND of packages.

Yes, but judging by what you described as their packing algorithm that's
probably only a tenth of a cubic metre of *books*, the rest being empty box
for the book to rattle around in and get damaged.

Steven D'Aprano · Jan 6, 2014

Roy said:
You're missing my point.

Amazon's (short-term) goal is to increase their market share by
undercutting everybody on price. They have implemented a box-packing
algorithm which clearly has a bug in it. You are complaining that they
failed to deliver your purchase in good condition, and apparently don't
care. You're right, they don't. The cost to them to manually correct
this situation exceeds the value. This is one shipment. It doesn't
matter. You are one customer, you don't matter either. Seriously.
This may be annoying to you, but it's good business for Amazon. For
them, fast and cheap is absolutely better than correct.

One, you're missing my point that to Amazon, "fast and cheap" *is* correct.
They would not agree with you that their box-packing algorithm is buggy, so
long as their customers don't punish them for it. It meets their
requirements: ship parcels as quickly as possible, and push as many of the
costs (damaged books) onto the customer as they can get away with. If they
thought it was buggy, they would be trying to fix it.

Two, nobody is arguing against the concept that different parties have
different concepts of what's correct. To JMF, the flexible string
representation is buggy, because he's detected a trivially small slowdown
in some artificial benchmarks. To everyone else, it is not buggy, because
it does what it sets out to do: save memory while still complying with the
Unicode standard. A small slowdown on certain operations is a cost worth
paying.

Normally, the definition of "correct" that matters is that belonging to the
paying customer, or failing that, the programmer who is giving his labour
away for free. (Extend this out to more stakeholders if you wish, but the
more stakeholders you include, the harder it is to get consensus on what's
correct and what isn't.) From the perspective of Amazon's customers,
presumably so long as the cost of damaged and lost books isn't too high,
they too are willing to accept Amazon's definition of "correct" in order to
get cheap books, or else they would buy from someone else.

(However, to the extent that Amazon has gained monopoly power over the book
market, that reasoning may not apply. Amazon is not *technically* a
monopoly, but they are clearly well on the way to becoming one, at which
point the customer has no effective choice and the market is no longer
free.)

The Amazon example is an interesting example of market failure, in the sense
that the free market provides a *suboptimal solution* to a problem. We'd
all like reasonably-priced books AND reliable delivery, but maybe we can't
have both. Personally, I'm not so sure about that. Maybe Jeff Bezos could
make do with only five solid gold Mercedes instead of ten[1], for the sake
of improved delivery? But apparently not.

But I digress... ultimately, you are trying to argue that there is a single
absolute source of truth for what counts as "correct". I don't believe
there is. We can agree that some things are clearly not correct -- Amazon
takes your money and sets the book on fire, or hires an armed military
escort costing $20 million a day to deliver your book of funny cat
pictures. We might even agree on what we'd all like in a perfect world:
cheap books, reliable delivery, and a pony. But in practice we have to
choose some features over others, and compromise on requirements, and
ultimately we have to make a *pragmatic* choice on what counts as correct
based on the functional requirements, not on a wishlist of things we'd like
with infinite time and money.

Sticking to the Amazon example, what percentage of books damaged in delivery
ceases to be a bug in the packing algorithm and becomes "just one of those
things"? One in ten? One in ten thousand? One in a hundred billion billion?
I do not accept that "book gets damaged in transit" counts as a bug. "More
than x% of books get damaged", that's a bug. "Average cost to ship a book
is more than $y" is a bug. And Amazon gets to decide what the values of x%
and $y are.

I'm not saying this is always the case. Clearly, there are companies
which have been very successful at producing a premium product (Apple,
for example). I'm not saying that fast is always better than correct.
I'm just saying that correct is not always better than fast.

In the case of Amazon, "correct" in the sense of "books are packed better"
is not better than fast. It's better for the customer, and better for
society as a whole (less redundant shipping and less ecological harm), but
not better for Amazon. Since Amazon gets to decide what's better, their
greedy, short-term outlook wins, at least until such time as customers find
an alternative. Amazon would absolutely not agree with you that packing the
books more securely is "better", if they did, they would do it. They're not
stupid, just focused on short-term gain for themselves at the expense of
everyone else. (Perhaps a specialised, and common, form of stupidity.)

By the way, this whole debate is known as "Worse is better", and bringing it
back to programming languages and operating systems, you can read more
about it here:

http://www.jwz.org/doc/worse-is-better.html

[1] Figuratively speaking.

Chris Angelico · Jan 6, 2014

(However, to the extent that Amazon has gained monopoly power over the book
market, that reasoning may not apply. Amazon is not *technically* a
monopoly, but they are clearly well on the way to becoming one, at which
point the customer has no effective choice and the market is no longer
free.)

They don't need a monopoly on the whole book market, just on specific
books - which they did have, in the cited case. I actually asked the
author (translator, really - it's a translation of "Alice in
Wonderland") how he would prefer me to buy, as there are some who sell
on Amazon and somewhere else. There was no alternative to Amazon, ergo
no choice and the market was not free. Like so many things, one choice
("I want to buy Ailice's Anters in Ferlielann") mandates another
("Must buy through Amazon").

I don't know what it cost Amazon to ship me two copies of a book, but
still probably less than they got out of me, so they're still ahead.
Even if they lost money on this particular deal, they're still way
ahead because of all the people who decide it's not worth their time
to spend an hour or so trying to get a replacement. So yep, this
policy is serving Amazon fairly well.

ChrisA

Mark Lawrence · Jan 6, 2014

They don't need a monopoly on the whole book market, just on specific
books - which they did have, in the cited case. I actually asked the
author (translator, really - it's a translation of "Alice in
Wonderland") how he would prefer me to buy, as there are some who sell
on Amazon and somewhere else. There was no alternative to Amazon, ergo
no choice and the market was not free. Like so many things, one choice
("I want to buy Ailice's Anters in Ferlielann") mandates another
("Must buy through Amazon").

I don't know what it cost Amazon to ship me two copies of a book, but
still probably less than they got out of me, so they're still ahead.
Even if they lost money on this particular deal, they're still way
ahead because of all the people who decide it's not worth their time
to spend an hour or so trying to get a replacement. So yep, this
policy is serving Amazon fairly well.

ChrisA

So much for my "You never know, we might even end up with a thread
whereby the discussion is Python, the whole Python and nothing but the
Python."

wxjmfauth · Jan 7, 2014

Le dimanche 5 janvier 2014 23:14:07 UTC+1, Terry Reedy a écrit :

Let me expand on 3 of my points. First, performance == time:

Point 3. You correctly identified a time regression in finding a

character in a string. I saw that the slowdown was *not* inherent in the

FSR but had to be a glitch in the code, and reported it on pydev with

the hope that someone would fix it even if it were not too important in

real use cases. Someone did.

Point 1. You incorrectly generalized that extreme case. I reported (a

year ago last September) that the overall stringbench results were about

the same. I also pointed out that there is an equally non-representative

extreme case in the opposite direction, and that it would equally be

wrong of me to use that to claim that FSR is faster. (It turns out that

this FSR speed advantage *is* inherent in the design.)

Memory: Point 2. A *design goal* of FSR was to save memory relative to

UTF-32, which is what you apparently prefer. Your examples show that FSF

successfully met its design goal. But you call that success, saving

memory, 'wrong'. On what basis?

You *claim* the FSR is 'wrong by design', but your examples only show

that is was temporarily wrong in implementation as far as speed and

correct by design as far as memory goes.

Point 3: You are right. I'm very happy to agree.

Point 2: This Flexible String Representation does no
"effectuate" any memory optimization. It only succeeds
to do the opposite of what a corrrect usage of utf*
do.

Ned : this has already been explained and illustrated.

jmf

Terry Reedy · Jan 7, 2014

Le dimanche 5 janvier 2014 23:14:07 UTC+1, Terry Reedy a Ã©crit :

Point 2: This Flexible String Representation does no
"effectuate" any memory optimization. It only succeeds
to do the opposite of what a corrrect usage of utf*
do.

Since the FSF *was* successful in saving memory, and indeed shrank the
Python binary by about a megabyte, I have no idea what you mean.

Tim Delaney · Jan 7, 2014

Point 2: This Flexible String Representation does no
"effectuate" any memory optimization. It only succeeds
to do the opposite of what a corrrect usage of utf*
do.

UTF-8 is a variable-width encoding that uses less memory to encode code
points with lower numerical values, on a per-character basis e.g. if a code
point <= U+007F it will use a single byte to encode; if <= U+07FF two bytes
will be used; ... up to a maximum of 6 bytes for code points >= U+4000000.

FSR is a variable-width memory structure that uses the width of the code
point with the highest numerical value in the string e.g. if all code
points in the string are <= U+00FF a single byte will be used per
character; if all code points are <= U+FFFF two bytes will be used per
character; and in all other cases 4 bytes will be used per character.

In terms of memory usage the difference is that UTF-8 varies its width
per-character, whereas the FSR varies its width per-string. For any
particular string, UTF-8 may well result in using less memory than the FSR,
but in other (quite common) cases the FSR will use less memory than UTF-8
e.g. if the string contains only contains code points <= U+00FF, but some
are between U+0080 and U+00FF (inclusive).

In most cases the FSR uses the same or less memory than earlier versions of
Python 3 and correctly handles all code points (just like UTF-8). In the
cases where the FSR uses more memory than previously, the previous
behaviour was incorrect.

No matter which representation is used, there will be a certain amount of
overhead (which is the majority of what most of your examples have shown).
Here are examples which demonstrate cases where UTF-8 uses less memory,
cases where the FSR uses less memory, and cases where they use the same
amount of memory (accounting for the minimum amount of overhead required
for each).

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64
bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.4000

Indexing a character in UTF-8 is O(N) - you have to traverse the the string
up to the character being indexed. Indexing a character in the FSR is O(1).
In all cases the FSR has better performance characteristics for indexing
and slicing than UTF-8.

There are tradeoffs with both UTF-8 and the FSR. The Python developers
decided the priorities for Unicode handling in Python were:

1. Correctness
a. all code points must be handled correctly;
b. it must not be possible to obtain part of a code point (e.g. the
first byte only of a multi-byte code point);

2. No change in the Big O characteristics of string operations e.g.
indexing must remain O(1);

3. Reduced memory use in most cases.

It is impossible for UTF-8 to meet both criteria 1b and 2 without
additional auxiliary data (which uses more memory and increases complexity
of the implementation). The FSR meets all 3 criteria.

Tim Delaney

Terry Reedy · Jan 8, 2014

Since the FSF *was* successful in saving memory, and indeed shrank the
Python binary by about a megabyte, I have no idea what you mean.

Tim Delaney apparently did, and answered on the basis of his
understanding. Note that I said that the design goal was 'save memory
RELATIVE TO UTF-32', not 'optimize memory'. UTF-8 was not considered an
option. Nor was any form of arithmetic coding
https://en.wikipedia.org/wiki/Arithmetic_coding
to truly 'optimize memory'.

wxjmfauth · Jan 8, 2014

Le mercredi 8 janvier 2014 01:02:22 UTC+1, Terry Reedy a écrit :

Tim Delaney apparently did, and answered on the basis of his

understanding. Note that I said that the design goal was 'save memory

RELATIVE TO UTF-32', not 'optimize memory'. UTF-8 was not considered an

option. Nor was any form of arithmetic coding

https://en.wikipedia.org/wiki/Arithmetic_coding

to truly 'optimize memory'.

The FSR acts more as an coding scheme selector than
as a code point optimizer.

Claiming that it saves memory is some kind of illusion;
a little bit as saying "Py2.7 uses "relatively" less memory than
Py3.2 (UCS-2)".
40044

jmf

Terry Reedy · Jan 8, 2014

On 1/8/2014 4:59 AM, (e-mail address removed) wrote:
[responding to me]

The FSR acts more as an coding scheme selector

That is what PEP 393 describes and what I and many others have said. The
FSR saves memory by selecting from three choices the most compact coding
scheme for each string.

I ask again, have you read PEP 393? If you are going to critique the
FSR, you should read its basic document.

than as a code point optimizer.

I do not know what you mean by 'code point optimizer'.

Claiming that it saves memory is some kind of illusion;

Do you really think that the mathematical fact "10026 < 20040 < 40044"
(from your example below) is some kind of illusion? If so, please take
your claim to a metaphysics list. If not, please stop trolling.

a little bit as saying "Py2.7 uses "relatively" less memory than
Py3.2 (UCS-2)".

This is inane as 2.7 and 3.2 both use the same two coding schemes.
Saying '1 < 2' is different from saying '2 < 2'.

On 3.3+

40044

3.2- wide (UCS-4) builds use about 40050 bytes for all three unicode
strings. One again, you have posted examples that show how FSR saves
memory, thus negating your denial of the saving.

Mark Lawrence · Jan 8, 2014

Le dimanche 5 janvier 2014 23:14:07 UTC+1, Terry Reedy a écrit :

Ned : this has already been explained and illustrated.

jmf

This has never been explained and illustrated. Roughly 30 minutes ago
Terry Reedy once again completely shot your argument about memory usage
to pieces. You did not bother to respond to the comments from Tim
Delaney made almost one day ago. Please give up.

Programming Blog	2	Apr 7, 2024
Python 3 advice	1	Oct 4, 2024
How keep Python 3 moving forward	37	May 23, 2014
Why Python 3?	65	Apr 19, 2014
Python 3: dict & dict.keys()	4	Jul 24, 2013
Question about WEKA, Python and Python-WEKA-Wrapper3	0	Mar 31, 2022
Implementing Longest Common Subsequence (LCS) in Python	0	Sep 11, 2023
Do you feel bad because of the Python docs?	84	Feb 26, 2013

Blog "about python 3"

Terry Reedy

Terry Reedy

Terry Reedy

Chris Angelico

Steven D'Aprano

Steven D'Aprano

Chris Angelico

Mark Lawrence

wxjmfauth

Terry Reedy

Tim Delaney

Terry Reedy

wxjmfauth

Terry Reedy

Mark Lawrence

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads