Flexible string representation, unicode, typography, ...

Serhiy Storchaka · Sep 2, 2012

Indexing is O(0) for any string.

Typo. O(1)

Steven D'Aprano · Sep 3, 2012

- Unfortunately, I got opposite and even much worst results on my win
box, considering
- libfrancais is one of my module and it does a little bit more than the
std sorting tools.

How do we know that the problem isn't in your module?

My rationale: very simple.

1) I never heard about something better than sticking with one of the
Unicode coding scheme. (genreral theory)

Your ignorance is not a good reason for abandoning a powerful software
technique.

2) I am not at all convinced by

the "new" Py 3.3 algorithm. I'm not the only one guy, who noticed
problems.

That's nice.

Nobody has yet displayed genuine performance problems, only artificial
and platform-dependent slowdowns that are insignificant in practice. If
you can demonstrate genuine problems, people will be interested in fixing
them.

Let me be frank: nobody gives a damn if, for some rare circumstances,
some_string.replace(another_string) takes 0.3Î¼s instead of 0.1Î¼s.
Overall, considering multiple platforms and dozens of different string
operations, PEP 393 is a big win:

- many operations are faster
- a few operations are a LOT faster
- but a very few operations are sometimes slower
- many strings will use less memory
- sometimes a LOT less memory
- no more distinction between wide and narrow builds
- characters in the supplementary planes are now, for the first
time in Python, treated correctly by default

That's six wins versus one loss.

Arguing, "it is fast enough", is not a correct answer.

It is *exactly* the correct answer.

Nobody is going to revert this just because your script now runs in 5.7ms
instead of 5.2ms. Who cares?

If you are *seriously* interested in debugging why string code is slower
for you, you can start by running the full suite of Python string
benchmarks: see the stringbench benchmark in the Tools directory of
source installations, or see here:

http://hg.python.org/cpython/file/8ff2f4634ed8/Tools/stringbench

Steven D'Aprano · Sep 3, 2012

I see that this misconception widely spread.

I am not familiar enough with the C implementation to tell what Python
3.3 actually does, and the PEP assumes a fair amount of familiarity with
the CPython source. So I welcome corrections.

In fact Python 3.3 uses four kinds of ready strings.

* ASCII. All codes <= U+007F.
* UCS1. All codes <= U+00FF, at least one code > U+007F.
* UCS2. All codes <= U+FFFF, at least one code > U+00FF.
* UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF.

Where UCS1 is equivalent to Latin-1, correct?

UCS2 is what Python 3.2 narrow builds uses for all strings, including
codes > U+FFFF using surrogate pairs.

UCS4 is what Python 3.2 wide builds uses for all strings.

This means that Python 3.3 will no longer have surrogate pairs.

Am I right?

Indexing is O(0) for any string.

I think you mean O(1) for constant-time lookups.

Also the string can optionally cache UTF-8 and wchar_t* representation.

Right, that's the bit that wasn't clear -- the UTF-8 data is a cache, not
the canonical representation.

Terry Reedy · Sep 3, 2012

I am not familiar enough with the C implementation to tell what Python
3.3 actually does, and the PEP assumes a fair amount of familiarity with
the CPython source. So I welcome corrections.

Where UCS1 is equivalent to Latin-1, correct?

UCS2 is what Python 3.2 narrow builds uses for all strings, including
codes > U+FFFF using surrogate pairs.

UCS4 is what Python 3.2 wide builds uses for all strings.

This means that Python 3.3 will no longer have surrogate pairs.

Basically, yes. I believe CPython will only use surrogate code points if
one requests errors=surrogate-escape on decoding or explicitly puts them
in a literal (\unnnn or \Ummmmmmmm). The consequences fall under the
'consenting adults' policy.

wxjmfauth · Sep 3, 2012

Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :

Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

With a memory gain = 0 since my text contains non-latin-1 characters!

jmf

wxjmfauth · Sep 3, 2012

Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :

Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

With a memory gain = 0 since my text contains non-latin-1 characters!

jmf

Mark Lawrence · Sep 3, 2012

Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :

With a memory gain = 0 since my text contains non-latin-1 characters!

jmf

This is getting really funny. Do you make a living writing comedy for
big film or TV studios? Your response to Steven D'Aprano's "That's six
wins versus one loss." should be hilarious. Or do you not respond to
fact based posts?

Peter Otten · Sep 3, 2012

Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a Ã©crit :

With a memory gain = 0 since my text contains non-latin-1 characters!

I can't confirm this. At least users of wide builds will see a decrease in
memory use:

$ cat strxfrm_getsize.py
import locale
import sys

print("maxunicode:", sys.maxunicode)
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
words = [
'noÃ«l', 'noir', 'nÅ“ud', 'noduleux',
'noÃ©tique', 'noÃ¨se', 'noirÃ¢tre']
print("total size of original strings:",
sum(sys.getsizeof(s) for s in words))
print(
"total size of transformed strings:",
sum(sys.getsizeof(locale.strxfrm(s)) for s in words))

$ python3.2 strxfrm_getsize.py
maxunicode: 1114111
total size of original strings: 584
total size of transformed strings: 980

$ python3.3 strxfrm_getsize.py
maxunicode: 1114111
total size of original strings: 509
total size of transformed strings: 483

The situation is more complex than you suppose -- you need less dogma and
more experiments

Terry Reedy · Sep 3, 2012

At least users of wide builds will see a decrease in memory use:

Everyone saves because everyone uses large parts of the stdlib. When 3.3
start up in a Windows console, there are 56 modules in sys.modules. With
Idle, there are over 130. All the identifiers, all the global, local,
and attribute names are present as ascii-only strings. Now multiply that
by some reasonable average, keeping in mind that __builtins__ alone has
148 names.

Former narrow build users gain less space but also gain the elimination
of buggy behavior.

Roy Smith · Sep 3, 2012

Indexing is O(0) for any string.

I think you mean O(1) for constant-time lookups.[/QUOTE]

Why settle for constant-time, when you can have zero-time instead

Serhiy Storchaka · Sep 3, 2012

If you are *seriously* interested in debugging why string code is slower
for you, you can start by running the full suite of Python string
benchmarks: see the stringbench benchmark in the Tools directory of
source installations, or see here:

http://hg.python.org/cpython/file/8ff2f4634ed8/Tools/stringbench

http://hg.python.org/cpython/file/default/Tools/stringbench

However, stringbench is not good tool to measure the effectiveness of
new string representation, because it focuses mainly on ASCII strings
and comparing strings with bytes.

Serhiy Storchaka · Sep 3, 2012

This means that Python 3.3 will no longer have surrogate pairs.

Am I right?

As Terry said, basically, yes. Python 3.3 does not need in surrogate
pairs, but does not prevent their creation. You can create a surrogate
code (U+D800..U+DFFF) intentionally (as you can create a single accent
modifier or other senseless alone charcode), but less likely that you
will get them unintentionally.

Serhiy Storchaka · Sep 3, 2012

I can't confirm this. At least users of wide builds will see a decrease in
memory use:

And only users of wide builds will see a 20% decrease in speed for this
data (with longer strings Python 3.3 will outstrip Python 3.2). This
happens because of the inevitable transformation UCS2 -> wchar_t and
wchar_t -> UCS2 on platform with 4-bytes wchar_t. On Windows there
should be no slowing down.

Ian Kelly · Sep 3, 2012

Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

Doh! In Python 3.3, strcoll and strxfrm are the same speed, so I
guess that the actual optimization I'm seeing here is that in Python
3.3, cmp_to_key(strcoll) has been optimized to return strxfrm.

Steven D'Aprano · Sep 4, 2012

http://hg.python.org/cpython/file/default/Tools/stringbench

However, stringbench is not good tool to measure the effectiveness of
new string representation, because it focuses mainly on ASCII strings
and comparing strings with bytes.

But it is a good place to start, so you can develop unicode benchmarks.

Chardet, file, ... and the Flexible String Representation	17	Sep 6, 2013
Is Unicode support so hard...	12	Apr 20, 2013
Thinking Unicode	0	Aug 8, 2013
Unicode questions	17	Oct 19, 2010
Flexible (liquid) web design	3	Oct 11, 2013
Verbose and flexible args and kwargs syntax	88	Dec 11, 2011
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Unicode Chars in Windows Path	12	Apr 3, 2014

Flexible string representation, unicode, typography, ...

Serhiy Storchaka

Steven D'Aprano

Steven D'Aprano

Terry Reedy

wxjmfauth

wxjmfauth

Mark Lawrence

Peter Otten

Terry Reedy

Roy Smith

Serhiy Storchaka

Serhiy Storchaka

Serhiy Storchaka

Ian Kelly

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads