Flexible string representation, unicode, typography, ...

Steven D'Aprano · Aug 31, 2012

Steven D'Aprano said:
Steven D'Aprano said:

Is the implementation smart enough to know that x == y is always
False if x and y are using different internal representations?

Click to expand...

[...] There may be circumstances where two strings have different
internal representations even though their content is the same

Click to expand...

If there is a deterministic algorithm which maps string content to
representation type, then I don't see how it's possible for two strings
with different representation types to have the same content. Could you
give me an example of when this might happen?

There are deterministic algorithms which can result in the same result
with two different internal formats. Here's an example from Python 2:

py> sum([1, 2**30, -2**30, 2**30, -2**30])
1
py> sum([1, 2**30, 2**30, -2**30, -2**30])
1L

The internal representation (int versus long) differs even though the sum
is the same.

A second example: the order of keys in a dict is deterministic but
unpredictable, as it depends on the history of insertions and deletions
into the dict. So two dicts could be equal, and yet have radically
different internal layout.

One final example: list resizing. Here are two lists which are equal but
have different sizes:

py> a = [0]
py> b = range(10000)
py> del b[1:]
py> a == b
True
py> sys.getsizeof(a)
36
py> sys.getsizeof(b)
48

Is PEP 393 another example of this? I have no idea. Somebody who is more
familiar with the details of the implementation would be able to answer
whether or not that is the case. I'm just suggesting that it is possible.

Ian Kelly · Aug 31, 2012

That's one thing that I'm unclear about -- under what circumstances will
a string be in compact versus non-compact form?

I understand it to be entirely dependent on which API is used to
construct. The legacy API generates legacy strings, and the new API
generates compact strings. From the comments in unicodeobject.h:

/* ASCII-only strings created through PyUnicode_New use the PyASCIIObject
structure. state.ascii and state.compact are set, and the data
immediately follow the structure. utf8_length and wstr_length can be found
in the length field; the utf8 pointer is equal to the data pointer. */

....

Legacy strings are created by PyUnicode_FromUnicode() and
PyUnicode_FromStringAndSize(NULL, size) functions. They become ready
when PyUnicode_READY() is called.

....

/* Non-ASCII strings allocated through PyUnicode_New use the
PyCompactUnicodeObject structure. state.compact is set, and the data
immediately follow the structure. */

Since I'm not sure that this is clear, note that compact vs. legacy
does not describe which character width is used (except that
PyASCIIObject strings are always 1 byte wide). Legacy and compact
strings can each use the 1, 2, or 4 byte representations. "Compact"
merely denotes that the character data is stored inline with the
struct (as opposed to being stored somewhere else and pointed at by
the struct), not the relative size of the string data. Again from the
comments:

Compact strings use only one memory block (structure + characters),
whereas legacy strings use one block for the structure and one block
for characters.

Cheers,
Ian

wxjmfauth · Sep 2, 2012

Le jeudi 30 août 2012 17:01:50 UTC+2, Antoine Pitrou a écrit :

I honestly suggest you shut up until you have a clue.

Désolé Antoine,

I have not the knowledge to dive in the Python code,
but I know what is a character.

The coding of the characters is a domain per se,
independent from the os, from the computer languages.

Before spending time to implement a new algorithm,
maybe it is better to ask, if there is something
better than the actual schemes.

I still remember my thoughts when I read the PEP 393
discussion: "this is not logical", "they do no understand
typography", "atomic character ???", ...

Real world exemples.

import libfrancais
li = ['noël', 'noir', 'nœud', 'noduleux', \ .... 'noétique', 'noèse', 'noirâtre']
r = libfrancais.sortfr(li)
r

Click to expand...

Click to expand...

['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
'noirâtre']

(cf "Le Petit Robert")

or

The *letters* satisfying the requirements of the
"Imprimerie nationale".

jmf

wxjmfauth · Sep 2, 2012

Le jeudi 30 août 2012 17:01:50 UTC+2, Antoine Pitrou a écrit :

I honestly suggest you shut up until you have a clue.

Désolé Antoine,

I have not the knowledge to dive in the Python code,
but I know what is a character.

The coding of the characters is a domain per se,
independent from the os, from the computer languages.

Before spending time to implement a new algorithm,
maybe it is better to ask, if there is something
better than the actual schemes.

I still remember my thoughts when I read the PEP 393
discussion: "this is not logical", "they do no understand
typography", "atomic character ???", ...

Real world exemples.

import libfrancais
li = ['noël', 'noir', 'nœud', 'noduleux', \ .... 'noétique', 'noèse', 'noirâtre']
r = libfrancais.sortfr(li)
r

Click to expand...

Click to expand...

['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
'noirâtre']

(cf "Le Petit Robert")

or

The *letters* satisfying the requirements of the
"Imprimerie nationale".

jmf

Mark Lawrence · Sep 2, 2012

Le jeudi 30 août 2012 17:01:50 UTC+2, Antoine Pitrou a écrit :
Désolé Antoine,

I have not the knowledge to dive in the Python code,
but I know what is a character.

You're a character, and from my observations on this thread you're very
humorous. YMMV.

The coding of the characters is a domain per se,
independent from the os, from the computer languages.

Before spending time to implement a new algorithm,
maybe it is better to ask, if there is something
better than the actual schemes.

Please write a new PEP indicating how you would correct your perceived
deficiencies with PEP 393 and its implementation.

I still remember my thoughts when I read the PEP 393
discussion: "this is not logical", "they do no understand
typography", "atomic character ???", ...

When PEP 393 was first drafted how much input did you give during the
acceptance process, if any?

Real world exemples.

import libfrancais
li = ['noël', 'noir', 'nœud', 'noduleux', \

Click to expand...

Click to expand...

Why the unneeded continuation character, fancy wasting storage space?

... 'noétique', 'noèse', 'noirâtre']['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
'noirâtre']

What has sorting foreign words got to do with the internal representaion
of the individual characters?

(cf "Le Petit Robert")

or

The *letters* satisfying the requirements of the
"Imprimerie nationale".

jmf

I've just rechecked my calendar and it's definitly not 1st April today.
Poor old me I'm baffled as always.

Ian Kelly · Sep 2, 2012

I still remember my thoughts when I read the PEP 393
discussion: "this is not logical", "they do no understand
typography", "atomic character ???", ...

That would indicate one of two possibilities. Either:

1) Everybody in the PEP 393 discussion except for you is clueless
about how to implement a Unicode type; or

2) You are clueless about how to implement a Unicode type.

Taking into account Occam's razor, and also that you seem to be unable
or unwilling to offer a solid rationale for those thoughts, I have to
say that I'm currently leaning toward the second possibility.

Real world exemples.

import libfrancais
li = ['noël', 'noir', 'nœud', 'noduleux', \ ... 'noétique', 'noèse', 'noirâtre']
r = libfrancais.sortfr(li)
r

Click to expand...

Click to expand...

['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
'noirâtre']

libfrancais does not appear to be publicly available. It's not listed
in PyPI, and googling for "python libfrancais" turns up nothing
relevant.

Rewriting the example to use locale.strcoll instead:

li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']
import locale
locale.setlocale(locale.LC_ALL, 'French_France') 'French_France.1252'
import functools
sorted(li, key=functools.cmp_to_key(locale.strcoll))

Click to expand...

Click to expand...

['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', 'noirâtre']

# Python 3.2

import timeit
timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))","import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)

Click to expand...

Click to expand...

[0.5544277025009592, 0.5370117249557325, 0.5551836677925053]

# Python 3.3

import timeit
timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))","import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)

Click to expand...

Click to expand...

[0.1421166788364303, 0.12389078130001963, 0.13184190553613462]

As you can see, Python 3.3 is about 77% faster than Python 3.2 on this
example. If this was intended to show that the Python 3.3 Unicode
representation is a regression over the Python 3.2 implementation,
then it's a complete failure as an example.

Peter Otten · Sep 2, 2012

Ian said:
Rewriting the example to use locale.strcoll instead:

There is also locale.strxfrm() which you can use directly:

sorted(li, key=locale.strxfrm)

Mark Lawrence · Sep 2, 2012

I've found the white paper which gives the technical basis for the
claims made by jmf so thought I'd better share in order to explain his
rationale.

http://www.montypython.net/scripts/right-think.php

Serhiy Storchaka · Sep 2, 2012

There is also locale.strxfrm() which you can use directly:

sorted(li, key=locale.strxfrm)

Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

Mark Lawrence · Sep 2, 2012

Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

That's it then I'm giving up with Python. In future I'll be writing
everything in machine code to ensure that I get the fastest possible run
times.

Roy Smith · Sep 2, 2012

Mark Lawrence said:
That's it then I'm giving up with Python. In future I'll be writing
everything in machine code to ensure that I get the fastest possible run
times.

Feh. You software guys are always too willing to sacrifice performance
for convenience. If you really want speed, grab yourself a handful of
chips and a soldering iron.

Ramchandra Apte · Sep 2, 2012

That's it then I'm giving up with Python. In future I'll be writing

everything in machine code to ensure that I get the fastest possible run

times.

--

Cheers.

Mark Lawrence.

please make it *heavily optimized* machine code

Ramchandra Apte · Sep 2, 2012

That's it then I'm giving up with Python. In future I'll be writing

everything in machine code to ensure that I get the fastest possible run

times.

--

Cheers.

Mark Lawrence.

please make it *heavily optimized* machine code

Mark Lawrence · Sep 2, 2012

please make it *heavily optimized* machine code

Goes without saying. First thing I'll concentrate on is removing
superfluous newlines sent by crappy mail clients or similar.

wxjmfauth · Sep 2, 2012

Le dimanche 2 septembre 2012 11:07:35 UTC+2, Ian a écrit :

I still remember my thoughts when I read the PEP 393

Click to expand...

discussion: "this is not logical", "they do no understand

Click to expand...

typography", "atomic character ???", ...

Click to expand...

That would indicate one of two possibilities. Either:

1) Everybody in the PEP 393 discussion except for you is clueless

about how to implement a Unicode type; or

2) You are clueless about how to implement a Unicode type.

Taking into account Occam's razor, and also that you seem to be unable

or unwilling to offer a solid rationale for those thoughts, I have to

say that I'm currently leaning toward the second possibility.

Real world exemples.

import libfrancais
li = ['noël', 'noir', 'nœud', 'noduleux', \

Click to expand...

Click to expand...

... 'noétique', 'noèse', 'noirâtre']

r = libfrancais.sortfr(li)
r

Click to expand...

Click to expand...

['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',

Click to expand...

'noirâtre']

Click to expand...

libfrancais does not appear to be publicly available. It's not listed

in PyPI, and googling for "python libfrancais" turns up nothing

relevant.

Rewriting the example to use locale.strcoll instead:

li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']
import locale
locale.setlocale(locale.LC_ALL, 'French_France')

Click to expand...

'French_France.1252'

import functools
sorted(li, key=functools.cmp_to_key(locale.strcoll))

Click to expand...

Click to expand...

['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', 'noirâtre']

# Python 3.2

import timeit
timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)

Click to expand...

Click to expand...

[0.5544277025009592, 0.5370117249557325, 0.5551836677925053]

# Python 3.3

import timeit
timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)

Click to expand...

Click to expand...

[0.1421166788364303, 0.12389078130001963, 0.13184190553613462]

As you can see, Python 3.3 is about 77% faster than Python 3.2 on this

example. If this was intended to show that the Python 3.3 Unicode

representation is a regression over the Python 3.2 implementation,

then it's a complete failure as an example.

- Unfortunately, I got opposite and even much worst results on my win box,
considering
- libfrancais is one of my module and it does a little bit more than
the std sorting tools.

My rationale: very simple.

1) I never heard about something better than sticking with one
of the Unicode coding scheme. (genreral theory)
2) I am not at all convinced by the "new" Py 3.3 algorithm. I'm not the
only one guy, who noticed problems. Arguing, "it is fast enough", is not
a correct answer.

jmf

wxjmfauth · Sep 2, 2012

Le dimanche 2 septembre 2012 11:07:35 UTC+2, Ian a écrit :

I still remember my thoughts when I read the PEP 393

Click to expand...

discussion: "this is not logical", "they do no understand

Click to expand...

typography", "atomic character ???", ...

Click to expand...

That would indicate one of two possibilities. Either:

1) Everybody in the PEP 393 discussion except for you is clueless

about how to implement a Unicode type; or

2) You are clueless about how to implement a Unicode type.

Taking into account Occam's razor, and also that you seem to be unable

or unwilling to offer a solid rationale for those thoughts, I have to

say that I'm currently leaning toward the second possibility.

Real world exemples.

import libfrancais
li = ['noël', 'noir', 'nœud', 'noduleux', \

Click to expand...

Click to expand...

... 'noétique', 'noèse', 'noirâtre']

r = libfrancais.sortfr(li)
r

Click to expand...

Click to expand...

['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',

Click to expand...

'noirâtre']

Click to expand...

libfrancais does not appear to be publicly available. It's not listed

in PyPI, and googling for "python libfrancais" turns up nothing

relevant.

Rewriting the example to use locale.strcoll instead:

li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']
import locale
locale.setlocale(locale.LC_ALL, 'French_France')

Click to expand...

'French_France.1252'

import functools
sorted(li, key=functools.cmp_to_key(locale.strcoll))

Click to expand...

Click to expand...

['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', 'noirâtre']

# Python 3.2

import timeit
timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)

Click to expand...

Click to expand...

[0.5544277025009592, 0.5370117249557325, 0.5551836677925053]

# Python 3.3

import timeit
timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)

Click to expand...

Click to expand...

[0.1421166788364303, 0.12389078130001963, 0.13184190553613462]

As you can see, Python 3.3 is about 77% faster than Python 3.2 on this

example. If this was intended to show that the Python 3.3 Unicode

representation is a regression over the Python 3.2 implementation,

then it's a complete failure as an example.

- Unfortunately, I got opposite and even much worst results on my win box,
considering
- libfrancais is one of my module and it does a little bit more than
the std sorting tools.

My rationale: very simple.

1) I never heard about something better than sticking with one
of the Unicode coding scheme. (genreral theory)
2) I am not at all convinced by the "new" Py 3.3 algorithm. I'm not the
only one guy, who noticed problems. Arguing, "it is fast enough", is not
a correct answer.

jmf

Michael Torrie · Sep 2, 2012

My rationale: very simple.

1) I never heard about something better than sticking with one
of the Unicode coding scheme. (genreral theory)
2) I am not at all convinced by the "new" Py 3.3 algorithm. I'm not the
only one guy, who noticed problems. Arguing, "it is fast enough", is not
a correct answer.

If this is true, why were you holding ho Google Go as an example of
doing it right? Certainly Google Go doesn't line up with your rational.
Go has both Strings and Runes. But strings are UTF-8-encoded bytes
strings and Runes are 32-bit integers. They are not interchangeable
without a costly encoding and decoding process. Even worse, indexing a
Go string to get a "Rune" involves some very costly decoding that has to
be done starting at the beginning of the string each time.

In the worst case, Python's strings are as slow as Go because Python
does the exact same thing as Go, but chooses between three encodings
instead of just one. Best case scenario, Python's strings could be much
faster than Go's because indexing through 2 of the 3 encodings is O(1)
because they are constant-width encodings. If as you say, the latin-1
subset of UTF-8 is used, then UTF-8 indexing is O(1) too, otherwise it's
probably O(n).

Dave Angel · Sep 2, 2012

<jmfauth snipped>:
In the worst case, Python's strings are as slow as Go because Python
does the exact same thing as Go, but chooses between three encodings
instead of just one. Best case scenario, Python's strings could be
much faster than Go's because indexing through 2 of the 3 encodings is
O(1) because they are constant-width encodings. If as you say, the
latin-1 subset of UTF-8 is used, then UTF-8 indexing is O(1) too,
otherwise it's probably O(n).

I'm afraid you have it backwards. the Utf-8 version of the
latin-1-compatible characters would be variable length. But my
understanding of the pep is that the internal one-byte format is simply
the lowest order byte of each code point, after assuring that all code
points in the particular string are less than 256. That's going to
coincidentally resemble latin-1's encoding, but since it's an internal
form, the resemblance is irrelevant. Anyway, those one-byte values are
going to be O(1), naturally.

No encoding involved, and no searching nor expanding.

Terry Reedy · Sep 2, 2012

In the worst case, Python's strings are as slow as Go because Python
does the exact same thing as Go, but chooses between three encodings
instead of just one. Best case scenario, Python's strings could be much
faster than Go's because indexing through 2 of the 3 encodings is O(1)

In CPython 3.3, indexing of str text string objects is always O(1) and
it is always indexes and counts code points rather than code units. It
was the latter for narrow builds in 3.2 and before. As a result, single
character (code point) strings had a length of 2 rather than 1 for
extended plane characters. 3.3 corrects this.

Serhiy Storchaka · Sep 2, 2012

And Python's solution uses those: UCS-2, UCS-4, and UTF-8.

I see that this misconception widely spread. In fact Python 3.3 uses
four kinds of ready strings.

* ASCII. All codes <= U+007F.
* UCS1. All codes <= U+00FF, at least one code > U+007F.
* UCS2. All codes <= U+FFFF, at least one code > U+00FF.
* UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF.

Indexing is O(0) for any string.

Also the string can optionally cache UTF-8 and wchar_t* representation.

Chardet, file, ... and the Flexible String Representation	17	Sep 6, 2013
Is Unicode support so hard...	12	Apr 20, 2013
Thinking Unicode	0	Aug 8, 2013
Unicode questions	17	Oct 19, 2010
Flexible (liquid) web design	3	Oct 11, 2013
Verbose and flexible args and kwargs syntax	88	Dec 11, 2011
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Unicode Chars in Windows Path	12	Apr 3, 2014

Flexible string representation, unicode, typography, ...

Steven D'Aprano

Ian Kelly

wxjmfauth

wxjmfauth

Mark Lawrence

Ian Kelly

Peter Otten

Mark Lawrence

Serhiy Storchaka

Mark Lawrence

Roy Smith

Ramchandra Apte

Ramchandra Apte

Mark Lawrence

wxjmfauth

wxjmfauth

Michael Torrie

Dave Angel

Terry Reedy

Serhiy Storchaka

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads