Flexible string representation, unicode, typography, ...

S

Steven D'Aprano

- Unfortunately, I got opposite and even much worst results on my win
box, considering
- libfrancais is one of my module and it does a little bit more than the
std sorting tools.

How do we know that the problem isn't in your module?

My rationale: very simple.

1) I never heard about something better than sticking with one of the
Unicode coding scheme. (genreral theory)

Your ignorance is not a good reason for abandoning a powerful software
technique.


2) I am not at all convinced by
the "new" Py 3.3 algorithm. I'm not the only one guy, who noticed
problems.

That's nice.

Nobody has yet displayed genuine performance problems, only artificial
and platform-dependent slowdowns that are insignificant in practice. If
you can demonstrate genuine problems, people will be interested in fixing
them.

Let me be frank: nobody gives a damn if, for some rare circumstances,
some_string.replace(another_string) takes 0.3μs instead of 0.1μs.
Overall, considering multiple platforms and dozens of different string
operations, PEP 393 is a big win:

- many operations are faster
- a few operations are a LOT faster
- but a very few operations are sometimes slower
- many strings will use less memory
- sometimes a LOT less memory
- no more distinction between wide and narrow builds
- characters in the supplementary planes are now, for the first
time in Python, treated correctly by default

That's six wins versus one loss.

Arguing, "it is fast enough", is not a correct answer.

It is *exactly* the correct answer.

Nobody is going to revert this just because your script now runs in 5.7ms
instead of 5.2ms. Who cares?

If you are *seriously* interested in debugging why string code is slower
for you, you can start by running the full suite of Python string
benchmarks: see the stringbench benchmark in the Tools directory of
source installations, or see here:

http://hg.python.org/cpython/file/8ff2f4634ed8/Tools/stringbench
 
S

Steven D'Aprano

I see that this misconception widely spread.

I am not familiar enough with the C implementation to tell what Python
3.3 actually does, and the PEP assumes a fair amount of familiarity with
the CPython source. So I welcome corrections.

In fact Python 3.3 uses four kinds of ready strings.

* ASCII. All codes <= U+007F.
* UCS1. All codes <= U+00FF, at least one code > U+007F.
* UCS2. All codes <= U+FFFF, at least one code > U+00FF.
* UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF.

Where UCS1 is equivalent to Latin-1, correct?

UCS2 is what Python 3.2 narrow builds uses for all strings, including
codes > U+FFFF using surrogate pairs.

UCS4 is what Python 3.2 wide builds uses for all strings.

This means that Python 3.3 will no longer have surrogate pairs.

Am I right?

Indexing is O(0) for any string.

I think you mean O(1) for constant-time lookups.

Also the string can optionally cache UTF-8 and wchar_t* representation.

Right, that's the bit that wasn't clear -- the UTF-8 data is a cache, not
the canonical representation.
 
T

Terry Reedy

I am not familiar enough with the C implementation to tell what Python
3.3 actually does, and the PEP assumes a fair amount of familiarity with
the CPython source. So I welcome corrections.



Where UCS1 is equivalent to Latin-1, correct?

UCS2 is what Python 3.2 narrow builds uses for all strings, including
codes > U+FFFF using surrogate pairs.

UCS4 is what Python 3.2 wide builds uses for all strings.

This means that Python 3.3 will no longer have surrogate pairs.

Basically, yes. I believe CPython will only use surrogate code points if
one requests errors=surrogate-escape on decoding or explicitly puts them
in a literal (\unnnn or \Ummmmmmmm). The consequences fall under the
'consenting adults' policy.
 
W

wxjmfauth

Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :
Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

With a memory gain = 0 since my text contains non-latin-1 characters!

jmf
 
W

wxjmfauth

Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :
Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

With a memory gain = 0 since my text contains non-latin-1 characters!

jmf
 
M

Mark Lawrence

Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :

With a memory gain = 0 since my text contains non-latin-1 characters!

jmf

This is getting really funny. Do you make a living writing comedy for
big film or TV studios? Your response to Steven D'Aprano's "That's six
wins versus one loss." should be hilarious. Or do you not respond to
fact based posts?
 
P

Peter Otten

Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :

With a memory gain = 0 since my text contains non-latin-1 characters!

I can't confirm this. At least users of wide builds will see a decrease in
memory use:

$ cat strxfrm_getsize.py
import locale
import sys

print("maxunicode:", sys.maxunicode)
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
words = [
'noël', 'noir', 'nœud', 'noduleux',
'noétique', 'noèse', 'noirâtre']
print("total size of original strings:",
sum(sys.getsizeof(s) for s in words))
print(
"total size of transformed strings:",
sum(sys.getsizeof(locale.strxfrm(s)) for s in words))

$ python3.2 strxfrm_getsize.py
maxunicode: 1114111
total size of original strings: 584
total size of transformed strings: 980

$ python3.3 strxfrm_getsize.py
maxunicode: 1114111
total size of original strings: 509
total size of transformed strings: 483

The situation is more complex than you suppose -- you need less dogma and
more experiments ;)
 
T

Terry Reedy

At least users of wide builds will see a decrease in memory use:

Everyone saves because everyone uses large parts of the stdlib. When 3.3
start up in a Windows console, there are 56 modules in sys.modules. With
Idle, there are over 130. All the identifiers, all the global, local,
and attribute names are present as ascii-only strings. Now multiply that
by some reasonable average, keeping in mind that __builtins__ alone has
148 names.

Former narrow build users gain less space but also gain the elimination
of buggy behavior.
 
R

Roy Smith

Indexing is O(0) for any string.

I think you mean O(1) for constant-time lookups.[/QUOTE]

Why settle for constant-time, when you can have zero-time instead :)
 
S

Serhiy Storchaka

If you are *seriously* interested in debugging why string code is slower
for you, you can start by running the full suite of Python string
benchmarks: see the stringbench benchmark in the Tools directory of
source installations, or see here:

http://hg.python.org/cpython/file/8ff2f4634ed8/Tools/stringbench

http://hg.python.org/cpython/file/default/Tools/stringbench

However, stringbench is not good tool to measure the effectiveness of
new string representation, because it focuses mainly on ASCII strings
and comparing strings with bytes.
 
S

Serhiy Storchaka

This means that Python 3.3 will no longer have surrogate pairs.

Am I right?

As Terry said, basically, yes. Python 3.3 does not need in surrogate
pairs, but does not prevent their creation. You can create a surrogate
code (U+D800..U+DFFF) intentionally (as you can create a single accent
modifier or other senseless alone charcode), but less likely that you
will get them unintentionally.
 
S

Serhiy Storchaka

I can't confirm this. At least users of wide builds will see a decrease in
memory use:

And only users of wide builds will see a 20% decrease in speed for this
data (with longer strings Python 3.3 will outstrip Python 3.2). This
happens because of the inevitable transformation UCS2 -> wchar_t and
wchar_t -> UCS2 on platform with 4-bytes wchar_t. On Windows there
should be no slowing down.
 
I

Ian Kelly

Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

Doh! In Python 3.3, strcoll and strxfrm are the same speed, so I
guess that the actual optimization I'm seeing here is that in Python
3.3, cmp_to_key(strcoll) has been optimized to return strxfrm.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,087
Messages
2,570,600
Members
47,222
Latest member
jspanther

Latest Threads

Top