Blog "about python 3"

wxjmfauth · Jan 3, 2014

It's time to understand the Character Encoding Models
and the math behind it.
Unicode does not differ from any other coding scheme.

How? With a sheet of paper and a pencil.

jmf

Robin Becker · Jan 3, 2014

Just because it's 3.3 doesn't matter...the main interest is in
compatibility. Secondly, you used just one piece of code, which could be a
fluke, try others, and check the PEP. You need to realize that evebn the
older versions are benig worked on, and they have to be refined. So if you
have a problem, use the older and import from the future would be my
suggestion

Suggesting that I use another piece of code to test python3 against python2 is a
bit silly. I'm sure I can find stuff which runs faster under python3, but
reportlab is the code I'm porting and that is going the wrong way.

Robin Becker · Jan 3, 2014

I am imagine that this was not fun.

indeed

Do you mean 'from __future__ import unicode_literals'?

No, previously we had default of utf8 encoded strings in the lower levels of the
code and we accepted either unicode or utf8 string literals as inputs to text
functions. As part of the port process we made the decision to change from
default utf8 str (bytes) to default unicode.

Am I correct in thinking that this change increases the capabilities of
reportlab? For instance, easily producing an article with abstracts in English,
Arabic, Russian, and Chinese?

It's made no real difference to what we are able to produce or accept since utf8
or unicode can encode anything in the input and what can be produced depends on
fonts mainly.

The new unicode implementation in 3.3 is faster for some operations and slower
for others. It is definitely more space efficient, especially compared to a wide
build system. It is definitely less buggy, especially compared to a narrow build
system.

Do your tests use any astral (non-BMP) chars? If so, do they pass on narrow 2.7
builds (like on Windows)?

I'm not sure if we have any non-bmp characters in the tests. Simple CJK etc etc
for the most part. I'm fairly certain we don't have any ability to handle
composed glyphs (multi-codepoint) etc etc

.....

For one thing, indexing and slicing just works on all machines for all unicode
strings. Code for 2.7 and 3.3 either a) does not index or slice, b) does not
work for all text on 2.7 narrow builds, or c) has extra conditional code only
for 2.7.

probably

Robin Becker · Jan 3, 2014

...........

Running a test suite is a completely broken benchmarking methodology.
You should isolate workloads you are interested in and write a benchmark
simulating them.

I'm certain you're right, but individual bits of code like generating our
reference manual also appear to be slower in 3.3.

Robin Becker · Jan 3, 2014

There was more speedup in 3.3.2 and possibly even more in 3.3.3, so OP
should run the latter.

python 3.3.3 is what I use on windows. As for astral / non-bmp etc etc that's
almost irrelevant for the sort of tests we're doing which are mostly simple
english text.

Roy Smith · Jan 3, 2014

Robin Becker said:
python 3.3.3 is what I use on windows. As for astral / non-bmp etc etc that's
almost irrelevant for the sort of tests we're doing which are mostly simple
english text.

The sad part is, if you're accepting any text from external sources, you
need to be able to deal with astral.

I was doing a project a while ago importing 20-something million records
into a MySQL database. Little did I know that FOUR of those records
contained astral characters (which MySQL, at least the version I was
using, couldn't handle).

My way of dealing with those records was to nuke them. Longer term we
ended up switching to Postgress.

Chris Angelico · Jan 3, 2014

I was doing a project a while ago importing 20-something million records
into a MySQL database. Little did I know that FOUR of those records
contained astral characters (which MySQL, at least the version I was
using, couldn't handle).

My way of dealing with those records was to nuke them. Longer term we
ended up switching to Postgress.

Look! Postgres means you don't lose data!!

Seriously though, that's a much better long-term solution than
destroying data. But MySQL does support the full Unicode range - just
not in its "UTF8" type. You have to specify "UTF8MB4" - that is,
"maximum bytes 4" rather than the default of 3. According to [1], the
UTF8MB4 encoding is stored as UTF-16, and UTF8 is stored as UCS-2. And
according to [2], it's even possible to explicitly choose the
mindblowing behaviour of UCS-2 for a data type that calls itself
"UTF8", so that a vague theoretical subsequent version of MySQL might
be able to make "UTF8" mean UTF-8, and people can choose to use the
other alias.

To my mind, this is a bug with backward-compatibility concerns. That
means it can't be fixed in a point release. Fine. But the behaviour
change is "this used to throw an error, now it works". Surely that can
be fixed in the next release. Or surely a version or two of
deprecating "UTF8" in favour of the two "MB?" types (and never ever
returning "UTF8" from any query), followed by a reintroduction of
"UTF8" as an alias for MB4, and the deprecation of MB3. Or am I
spoiled by the quality of Python (and other) version numbering, where
I can (largely) depend on functionality not changing in point
releases?

ChrisA

[1] http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html
[2] http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb3.html

Ethan Furman · Jan 3, 2014

I worked that out with a sheet of paper and a pencil. The pencil was a
little help, but the paper was three sheets in the wind.

Beautiful!

Terry Reedy · Jan 3, 2014

python 3.3.3 is what I use on windows. As for astral / non-bmp etc etc
that's almost irrelevant for the sort of tests we're doing which are
mostly simple english text.

If you do not test the cases where 2.7 is buggy and requires nasty
workarounds, then I can understand why you do not so much appreciate 3.3
;-).

Mark Lawrence · Jan 4, 2014

If you do not test the cases where 2.7 is buggy and requires nasty
workarounds, then I can understand why you do not so much appreciate 3.3
;-).

Are you crazy? Surely everybody prefers fast but incorrect code in
preference to something that is correct but slow? Except that Python
3.3.3 is often faster. And always (to my knowledge) correct. Upper
Class Twit of the Year anybody?

Mark Lawrence · Jan 4, 2014

We too are using python 2.4 - 2.7 in production. Different clients
migrate at different speeds.

+1

I just spent a large amount of effort porting reportlab to a version
which works with both python2.7 and python3.3. I have a large number of
functions etc which handle the conversions that differ between the two
pythons.

For fairly sensible reasons we changed the internal default to use
unicode rather than bytes. After doing all that and making the tests
compatible etc etc I have a version which runs in both and passes all
its tests. However, for whatever reason the python 3.3 version runs slower

2.7 Ran 223 tests in 66.578s

3.3 Ran 223 tests in 75.703s

I know some of these tests are fairly variable, but even for simple
things like paragraph parsing 3.3 seems to be slower. Since both use
unicode internally it can't be that can it, or is python 2.7's unicode
faster?

So far the superiority of 3.3 escapes me, but I'm tasked with enjoying
this process so I'm sure there must be some new 'feature' that will
help. Perhaps 'yield from' or 'raise from None' or .......

In any case I think we will be maintaining python 2.x code for at least
another 5 years; the version gap is then a real hindrance.

Of interest
https://mail.python.org/pipermail/python-dev/2012-October/121919.html ?

wxjmfauth · Jan 4, 2014

Le vendredi 3 janvier 2014 12:14:41 UTC+1, Robin Becker a écrit :

indeed

No, previously we had default of utf8 encoded strings in the lower levelsof the

code and we accepted either unicode or utf8 string literals as inputs to text

functions. As part of the port process we made the decision to change from

default utf8 str (bytes) to default unicode.

It's made no real difference to what we are able to produce or accept since utf8

or unicode can encode anything in the input and what can be produced depends on

fonts mainly.

I'm not sure if we have any non-bmp characters in the tests. Simple CJK etc etc

for the most part. I'm fairly certain we don't have any ability to handle

composed glyphs (multi-codepoint) etc etc

....

----

To Robin Becker

I know nothing about ReportLab except its existence.
Your story is very interesting. As I pointed, I know
nothing about the internal of ReportLab, the technical
aspects: the "Python part", "the used api for the PDF creation").
I have however some experience with the unicode TeX engine,
XeTeX, understand I'm understanding a little bit what's
happening behind the scene.

The very interesting aspect in the way you are holding
unicodes (strings). By comparing Python 2 with Python 3.3,
you are comparing utf-8 with the the internal "representation"
of Python 3.3 (the flexible string represenation).
In one sense, more than comparing Py2 with Py3.

It will be much more interesting to compare utf-8/Python
internals at the light of Python 3.2 and Python 3.3. Python
3.2 has a decent unicode handling, Python 3.3 has an absurd
(in mathematical sense) unicode handling. This is really
shining with utf-8, where this flexible string representation
is just doing the opposite of what a correct unicode
implementation does!

On the memory side, it is obvious to see it.
10020

On the performance side, it is much more complexe,
but qualitatively, you may expect the same results.

The funny aspect is that by working with utf-8 in that
case, you are (or one has) forcing Python to work
properly, but one pays on the side of the performance.
And if one wishes to save memory, one has to pay on the
side of performance.

In othe words, attempting to do what Python is
not able to do natively is just impossible!

I'm skipping the very interesting composed glyphs subject
(unicode normalization, ...), but I wish to point that
with the flexible string representation, one reaches
the top level of surrealism. For a tool which is supposed
to handle these very specific unicode tasks...

jmf

Roy Smith · Jan 4, 2014

Mark Lawrence said:
Surely everybody prefers fast but incorrect code in
preference to something that is correct but slow?

I realize I'm taking this statement out of context, but yes, sometimes
fast is more important than correct. Sometimes the other way around.

Chris Angelico · Jan 4, 2014

I realize I'm taking this statement out of context, but yes, sometimes
fast is more important than correct. Sometimes the other way around.

More usually, it's sometimes better to be really fast and mostly
correct than really really slow and entirely correct. That's why we
use IEEE floating point instead of Decimal most of the time. Though
I'm glad that Python 3 now deems the default int type to be capable of
representing arbitrary integers (instead of dropping out to a separate
long type as Py2 did), I think it's possibly worth optimizing small
integers to machine words - but mainly, the int type focuses on
correctness above performance, because the cost is low compared to the
benefit. With float, the cost of arbitrary precision is extremely
high, and the benefit much lower.

With Unicode, the cost of perfect support is normally seen to be a
doubling of internal memory usage (UTF-16 vs UCS-4). Pike and Python
decided that the cost could, instead, be a tiny measure of complexity
and actually *less* memory usage (compared to UTF-16, when lots of
identifiers are ASCII). It's a system that works only when strings are
immutable, but works beautifully there. Fortunately Pike doesn't have
any, and Python has only one, idiot like jmf who completely
misunderstands what's going on and uses microbenchmarks to prove
obscure points... and then uses nonsense to try to prove... uhh...
actually I'm not even sure what, sometimes. I wouldn't dare try to
read his posts except that my mind's already in a rather broken state,
as a combination of programming and Alice in Wonderland.

ChrisA

Ned Batchelder · Jan 4, 2014

More usually, it's sometimes better to be really fast and mostly
correct than really really slow and entirely correct. That's why we
use IEEE floating point instead of Decimal most of the time. Though
I'm glad that Python 3 now deems the default int type to be capable of
representing arbitrary integers (instead of dropping out to a separate
long type as Py2 did), I think it's possibly worth optimizing small
integers to machine words - but mainly, the int type focuses on
correctness above performance, because the cost is low compared to the
benefit. With float, the cost of arbitrary precision is extremely
high, and the benefit much lower.

With Unicode, the cost of perfect support is normally seen to be a
doubling of internal memory usage (UTF-16 vs UCS-4). Pike and Python
decided that the cost could, instead, be a tiny measure of complexity
and actually *less* memory usage (compared to UTF-16, when lots of
identifiers are ASCII). It's a system that works only when strings are
immutable, but works beautifully there. Fortunately Pike doesn't have
any, and Python has only one, idiot like jmf who completely
misunderstands what's going on and uses microbenchmarks to prove
obscure points... and then uses nonsense to try to prove... uhh...
actually I'm not even sure what, sometimes. I wouldn't dare try to
read his posts except that my mind's already in a rather broken state,
as a combination of programming and Alice in Wonderland.

ChrisA

I really wish we could discuss these things without baiting trolls.

wxjmfauth · Jan 4, 2014

Le samedi 4 janvier 2014 15:17:40 UTC+1, Chris Angelico a écrit :

More usually, it's sometimes better to be really fast and mostly

correct than really really slow and entirely correct. That's why we

use IEEE floating point instead of Decimal most of the time. Though

I'm glad that Python 3 now deems the default int type to be capable of

representing arbitrary integers (instead of dropping out to a separate

long type as Py2 did), I think it's possibly worth optimizing small

integers to machine words - but mainly, the int type focuses on

correctness above performance, because the cost is low compared to the

benefit. With float, the cost of arbitrary precision is extremely

high, and the benefit much lower.

With Unicode, the cost of perfect support is normally seen to be a

doubling of internal memory usage (UTF-16 vs UCS-4). Pike and Python

decided that the cost could, instead, be a tiny measure of complexity

and actually *less* memory usage (compared to UTF-16, when lots of

identifiers are ASCII). It's a system that works only when strings are

immutable, but works beautifully there. Fortunately Pike doesn't have

any, and Python has only one, idiot like jmf who completely

misunderstands what's going on and uses microbenchmarks to prove

obscure points... and then uses nonsense to try to prove... uhh...

actually I'm not even sure what, sometimes. I wouldn't dare try to

read his posts except that my mind's already in a rather broken state,

as a combination of programming and Alice in Wonderland.

Click to expand...

I do not mind to be considered as an idiot, but
I'm definitively not blind.

And I could add, I *never* saw once one soul, who is
explaining what I'm doing wrong in the gazillion
of examples I gave on this list.

---

Back to ReportLab. Technically I would be really
interested to see what could happen at the light
of my previous post.

jmf

Terry Reedy · Jan 4, 2014

Le samedi 4 janvier 2014 15:17:40 UTC+1, Chris Angelico a Ã©crit :

Chris, I appreciate the many contributions you make to this list, but
that does not exempt you from out standard of conduct.

Troll baiting is a form of trolling. I think you are intelligent enough
to know this. Please stop.

I do not mind to be considered as an idiot, but
I'm definitively not blind.

And I could add, I *never* saw once one soul, who is
explaining what I'm doing wrong in the gazillion
of examples I gave on this list.

If this is true, it is because you have ignored and not read my
numerous, relatively polite posts. To repeat very briefly:

1. Cherry picking (presenting the most extreme case as representative).

2. Calling space saving a problem (repeatedly).

3. Ignoring bug fixes.

4. Repetition (of the 'gazillion example' without new content).

Have you ever acknowledged, let alone thank people for, the fix for the
one bad regression you did find. The FSR is still a work in progress.
Just today, Serhiy pushed a patch speeding up the UTF-32 encoder, after
previously speeding up the UTF-32 decoder.

Chris Angelico · Jan 4, 2014

Chris, I appreciate the many contributions you make to this list, but that
does not exempt you from out standard of conduct.

Troll baiting is a form of trolling. I think you are intelligent enough to
know this. Please stop.

My apologies. I withdraw the aforequoted post. You and Ned are
correct, those comments were inappropriate. Sorry.

ChrisA

Steven D'Aprano · Jan 5, 2014

Roy said:
I realize I'm taking this statement out of context, but yes, sometimes
fast is more important than correct.

I know somebody who was once touring in the States, and ended up travelling
cross-country by road with the roadies rather than flying. She tells me of
the time someone pointed out that they were travelling in the wrong
direction, away from their destination. The roadie driving replied "Who
cares? We're making fantastic time!"

(Ah, the seventies. So many drugs...)

Fast is never more important than correct. It's just that sometimes you
might compromise a little (or a lot) on what counts as correct in order for
some speed.

To give an example, say you want to solve the Travelling Salesman Problem,
and find the shortest path through a whole lot of cities A, B, C, ..., Z.
That's a Hard Problem, expensive to solve correctly.

But if you loosen the requirements so that a correct solution no longer has
to be the absolutely shortest path, and instead accept solutions which are
nearly always close to the shortest (but without any guarantee of how
close), then you can make the problem considerably easier to solve.

But regardless of how fast your path-finder algorithm might become, you're
unlikely to be satisfied with a solution that travels around in a circle
from A to B a million times then shoots off straight to Z without passing
through any of the other cities.

Chris Angelico · Jan 5, 2014

But regardless of how fast your path-finder algorithm might become, you're
unlikely to be satisfied with a solution that travels around in a circle
from A to B a million times then shoots off straight to Z without passing
through any of the other cities.

On the flip side, that might be the best salesman your company has
ever known, if those three cities have the most customers!

ChrisA
wondering why nobody cares about the customers in TSP discussions

Programming Blog	2	Apr 7, 2024
Python 3 advice	1	Oct 4, 2024
How keep Python 3 moving forward	37	May 23, 2014
Why Python 3?	65	Apr 19, 2014
Python 3: dict & dict.keys()	4	Jul 24, 2013
Question about WEKA, Python and Python-WEKA-Wrapper3	0	Mar 31, 2022
Implementing Longest Common Subsequence (LCS) in Python	0	Sep 11, 2023
Do you feel bad because of the Python docs?	84	Feb 26, 2013

Blog "about python 3"

wxjmfauth

Robin Becker

Robin Becker

Robin Becker

Robin Becker

Roy Smith

Chris Angelico

Ethan Furman

Terry Reedy

Mark Lawrence

Mark Lawrence

wxjmfauth

Roy Smith

Chris Angelico

Ned Batchelder

wxjmfauth

Terry Reedy

Chris Angelico

Steven D'Aprano

Chris Angelico

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads