Flexible string representation, unicode, typography, ...

S

Steven D'Aprano

Quite difficult. Even if we avoid having two or three separate
binaries, we would still have separate binary representations of the
string structs. It makes the maintainability of the software go down
instead of up.

In fairness, there are already multiple binary representations of strings
in Python 3.3:

- ASCII-only strings use a 1-byte format (PyASCIIObject);

- Compact Unicode objects (PyCompactObject), which if I'm reading
correctly, appears to use a non-fixed width UTF-8 format, but are only
used when the string length and maximum character are known ahead of
time;

- Legacy string objects (PyUnicodeObject), which are not compact, and
which may use as their internal format:

* 1-byte characters for Latin1-compatible strings;

* 2-byte UCS-2 characters for strings in the Basic Multilingual Plane;

* 4-byte UCS-4 characters for strings with at least one non-BMP
character.

http://www.python.org/dev/peps/pep-0393/#specification


By my calculations, that makes *five* different internal formats for
strings, at least two of which are capable of representing all Unicode
characters. I don't think it would add that much additional complexity to
have a runtime option --always-wide-strings to always use the UCS-4
format. For, you know, crazy people with more memory than sense.

But I don't think there's any point in exposing further runtime options
to choose the string representation:

- neither the ASCII nor Latin1 representations can store arbitrary
Unicode chars, so they're out;

- the UTF-8 format is only used under restrictive circumstances, and so
is (probably?) unsuitable for all strings.

- the UCS-2 format can, by using surrogate pairs, but that's troublesome
to get right, some might even say buggy.

So instead of having just one test for my Unicode-handling code, I'll
now have to run that same test *three times* -- once for each possible
string engine option. Choice isn't always a good thing.

There is that too.
 
W

wxjmfauth

Le lundi 27 août 2012 22:37:03 UTC+2, (inconnu) a écrit :
Le lundi 27 août 2012 22:14:07 UTC+2, Ian a écrit :




I know all this. The question is more, why not a uint32 knowing

there are only positive code points. It seems to me more "natural".

Answer found. In short: using negative ints
simplifies internal tasks.
 
W

wxjmfauth

Le lundi 27 août 2012 22:37:03 UTC+2, (inconnu) a écrit :
Le lundi 27 août 2012 22:14:07 UTC+2, Ian a écrit :




I know all this. The question is more, why not a uint32 knowing

there are only positive code points. It seems to me more "natural".

Answer found. In short: using negative ints
simplifies internal tasks.
 
W

wxjmfauth

Le mercredi 29 août 2012 06:16:05 UTC+2, Ian a écrit :
As has been pointed out repeatedly already, this is a microbenchmark.

jmf is focusing in one one particular area (string construction) where

Python 3.3 happens to be slower than Python 3.2, ignoring the fact

that real code usually does lots of things other than building

strings, many of which are slower to begin with. In the real-world

benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.

Here's a much more realistic benchmark that nonetheless still focuses

on strings: word counting.



Source: http://pastebin.com/RDeDsgPd





C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"

"wc.wc('unilang8.htm')"

1000 loops, best of 3: 310 usec per loop



C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"

"wc.wc('unilang8.htm')"

1000 loops, best of 3: 302 usec per loop



"unilang8.htm" is an arbitrary UTF-8 document containing a broad swath

of Unicode characters that I pulled off the web. Even though this

program is still mostly string processing, Python 3.3 wins. Of

course, that's not really a very good test -- since it reads the file

on every pass, it probably spends more time in I/O than it does in

actual processing. Let's try it again with prepared string data:





C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =

open('unilang8.htm', 'r', encoding

='utf-8').read()" "wc.wc_str(t)"

10000 loops, best of 3: 87.3 usec per loop



C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =

open('unilang8.htm', 'r', encoding

='utf-8').read()" "wc.wc_str(t)"

10000 loops, best of 3: 84.6 usec per loop



Nope, 3.3 still wins. And just for the sake of my own curiosity, I

decided to try it again using str.split() instead of a StringIO.

Since str.split() creates more strings, I expect Python 3.2 might

actually win this time.





C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =

open('unilang8.htm', 'r', encoding

='utf-8').read()" "wc.wc_split(t)"

10000 loops, best of 3: 88 usec per loop



C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =

open('unilang8.htm', 'r', encoding

='utf-8').read()" "wc.wc_split(t)"

10000 loops, best of 3: 76.5 usec per loop



Interestingly, although Python 3.2 performs the splits in about the

same time as the StringIO operation, Python 3.3 is significantly

*faster* using str.split(), at least on this data set.














Quite difficult. Even if we avoid having two or three separate

binaries, we would still have separate binary representations of the

string structs. It makes the maintainability of the software go down

instead of up.







So instead of having just one test for my Unicode-handling code, I'll

now have to run that same test *three times* -- once for each possible

string engine option. Choice isn't always a good thing.

Forget Python and all these benchmarks. The problem
is on an other level. Coding schemes, typography,
usage of characters, ...

For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.

If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!

Unicode (utf***), as just one another coding scheme, does
not escape to this rule.

This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.

jmf
 
W

wxjmfauth

Le mercredi 29 août 2012 06:16:05 UTC+2, Ian a écrit :
As has been pointed out repeatedly already, this is a microbenchmark.

jmf is focusing in one one particular area (string construction) where

Python 3.3 happens to be slower than Python 3.2, ignoring the fact

that real code usually does lots of things other than building

strings, many of which are slower to begin with. In the real-world

benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.

Here's a much more realistic benchmark that nonetheless still focuses

on strings: word counting.



Source: http://pastebin.com/RDeDsgPd





C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"

"wc.wc('unilang8.htm')"

1000 loops, best of 3: 310 usec per loop



C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"

"wc.wc('unilang8.htm')"

1000 loops, best of 3: 302 usec per loop



"unilang8.htm" is an arbitrary UTF-8 document containing a broad swath

of Unicode characters that I pulled off the web. Even though this

program is still mostly string processing, Python 3.3 wins. Of

course, that's not really a very good test -- since it reads the file

on every pass, it probably spends more time in I/O than it does in

actual processing. Let's try it again with prepared string data:





C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =

open('unilang8.htm', 'r', encoding

='utf-8').read()" "wc.wc_str(t)"

10000 loops, best of 3: 87.3 usec per loop



C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =

open('unilang8.htm', 'r', encoding

='utf-8').read()" "wc.wc_str(t)"

10000 loops, best of 3: 84.6 usec per loop



Nope, 3.3 still wins. And just for the sake of my own curiosity, I

decided to try it again using str.split() instead of a StringIO.

Since str.split() creates more strings, I expect Python 3.2 might

actually win this time.





C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =

open('unilang8.htm', 'r', encoding

='utf-8').read()" "wc.wc_split(t)"

10000 loops, best of 3: 88 usec per loop



C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =

open('unilang8.htm', 'r', encoding

='utf-8').read()" "wc.wc_split(t)"

10000 loops, best of 3: 76.5 usec per loop



Interestingly, although Python 3.2 performs the splits in about the

same time as the StringIO operation, Python 3.3 is significantly

*faster* using str.split(), at least on this data set.














Quite difficult. Even if we avoid having two or three separate

binaries, we would still have separate binary representations of the

string structs. It makes the maintainability of the software go down

instead of up.







So instead of having just one test for my Unicode-handling code, I'll

now have to run that same test *three times* -- once for each possible

string engine option. Choice isn't always a good thing.

Forget Python and all these benchmarks. The problem
is on an other level. Coding schemes, typography,
usage of characters, ...

For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.

If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!

Unicode (utf***), as just one another coding scheme, does
not escape to this rule.

This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.

jmf
 
D

Dave Angel

Forget Python and all these benchmarks. The problem is on an other
level. Coding schemes, typography, usage of characters, ... For a
given coding scheme, all code points/characters are equivalent.
Expecting to handle a sub-range in a coding scheme without shaking
that coding scheme is impossible. If a coding scheme does not give
satisfaction, the only valid solution is to create a new coding
scheme, cp1252, mac-roman, EBCDIC, ... or the interesting "TeX" case,
where the "internal" coding depends on the fonts! Unicode (utf***), as
just one another coding scheme, does not escape to this rule. This
"Flexible String Representation" fails. Not only it is unable to stick
with a coding scheme, it is a mixing of coding schemes, the worst of
all possible implementations. jmf

Nonsense. The discussion was not about an encoding scheme, but an
internal representation. That representation does not change the
programmer's interface in any way other than performance (cpu and memory
usage). Most of the rest of your babble is unsupported opinion.

Plonk.
 
C

Chris Angelico

For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.

Not all codepoints are equally likely. That's the whole point behind
variable-length encodings like Huffman compression (eg deflation as
used in zip/gzip), UTF-8, quoted-printable, and Morse code. They
handle a sub-range efficiently and the rest of the range less
efficiently.
If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!
http://xkcd.com/927/

This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.

I propose, then, that we abolish files. Who *knows* how many different
things might be represented in a file! We need a single coding scheme
that can handle everything, without changing representation. This
ridiculous state of affairs must not go on; the same representation
can be used for bitmapped images or raw audio data!

ChrisA
 
W

wxjmfauth

Le mercredi 29 août 2012 14:01:57 UTC+2, Dave Angel a écrit :
Nonsense. The discussion was not about an encoding scheme, but an

internal representation. That representation does not change the

programmer's interface in any way other than performance (cpu and memory

usage). Most of the rest of your babble is unsupported opinion.

I can hit the nail a little more.
I have even a better idea and I'm serious.

If "Python" has found a new way to cover the set
of the Unicode characters, why not proposing it
to the Unicode consortium?

Unicode has already three schemes covering practically
all cases: memory consumption, maximum flexibility and
an intermediate solution.
It would be to bad, to not share it.

What do you think? ;-)

jmf
 
W

wxjmfauth

Le mercredi 29 août 2012 14:01:57 UTC+2, Dave Angel a écrit :
Nonsense. The discussion was not about an encoding scheme, but an

internal representation. That representation does not change the

programmer's interface in any way other than performance (cpu and memory

usage). Most of the rest of your babble is unsupported opinion.

I can hit the nail a little more.
I have even a better idea and I'm serious.

If "Python" has found a new way to cover the set
of the Unicode characters, why not proposing it
to the Unicode consortium?

Unicode has already three schemes covering practically
all cases: memory consumption, maximum flexibility and
an intermediate solution.
It would be to bad, to not share it.

What do you think? ;-)

jmf
 
C

Chris Angelico

If "Python" has found a new way to cover the set
of the Unicode characters, why not proposing it
to the Unicode consortium?

Python's open source. If some other language wants to borrow the idea,
they can look at the code, or alternatively, just read PEP 393 and
implement something similar. It's a free world.

By the way, can you please trim the quoted text in your replies? It's
rather lengthy.

ChrisA
 
S

Steven D'Aprano

I can hit the nail a little more.
I have even a better idea and I'm serious.

If "Python" has found a new way to cover the set of the Unicode
characters, why not proposing it to the Unicode consortium?

Because the implementation of the str datatype in a programming language
has nothing to do with the Unicode consortium. You might as well propose
it to the International Union of Railway Engineers.

Unicode has already three schemes covering practically all cases: memory
consumption, maximum flexibility and an intermediate solution.

And Python's solution uses those: UCS-2, UCS-4, and UTF-8.

The only thing which is innovative here is that instead of the Python
compiler declaring that "all strings will be stored in UCS-2", the
compiler chooses an implementation for each string as needed. So some
strings will be stored internally as UCS-4, some as UCS-2, and some as
ASCII (which is a standard, but not the Unicode consortium's standard).

(And possibly some as UTF-8? I'm not entirely sure from reading the PEP.)

There's nothing radical here, honest.
 
W

wxjmfauth

Le jeudi 30 août 2012 08:55:01 UTC+2, Steven D'Aprano a écrit :


You are right.

But as soon as you introduce artificially a "latin-1"
bottleneck, all this machinery just become useless.

This flexible representation is working absurdly.
It optimizes the characters you are not using (in one
sense), it defaults to a non optimized form for the
characters you wish to use.

Pick up a random text and see the probability this
text match the most optimized case 1 char / 1 byte,
practically never.

If a user will use exclusively latin-1, she/he is better
served by using a dedicated tool for "latin-1"

If a user will comfortably work with Unicode, she/he is
better served by using one of this tools which is using
properly one of the available Unicode schemes.

In a funny way, this is what Python was doing and it
performs better!

(Enough for today, *I* should spend my spare time
to toy with Go, this discussion gave *me* the wish
to dive in it again).

jmf
 
C

Chris Angelico

Pick up a random text and see the probability this
text match the most optimized case 1 char / 1 byte,
practically never.

Only if you talk about a huge document. Try, instead, every string
ever used in a Python script.

Practically always.

But I'm wasting my time saying this again. It's been said by multiple
people multiple times.

ChrisA
 
R

Roy Smith

Steven D'Aprano said:
The only thing which is innovative here is that instead of the Python
compiler declaring that "all strings will be stored in UCS-2", the
compiler chooses an implementation for each string as needed. So some
strings will be stored internally as UCS-4, some as UCS-2, and some as
ASCII (which is a standard, but not the Unicode consortium's standard).

Is the implementation smart enough to know that x == y is always False
if x and y are using different internal representations?
 
S

Steven D'Aprano

Is the implementation smart enough to know that x == y is always False
if x and y are using different internal representations?

But x and y are not necessarily always False just because they have
different representations. There may be circumstances where two strings
have different internal representations even though their content is the
same, so it's an unsafe optimization to automatically treat them as
unequal.

The closest existing equivalent here is the relationship between ints and
longs in Python 2. 42 == 42L even though they have different internal
representations and take up a different amount of space.


My expectation is that the initial implementation of PEP 393 will be
relatively unoptimized, and over the next few releases it will get more
efficient. That's usually the way these things go.
 
I

Ian Kelly

But as soon as you introduce artificially a "latin-1"
bottleneck, all this machinery just become useless.

How is this a bottleneck? If you removed the Latin-1 encoding
altogether and limited the flexible representation to just UCS-2 /
UCS-4, I doubt very much that you would see any significant speed
gains. The flexibility is the part that makes string creation slower,
not the Latin-1 option in particular.
This flexible representation is working absurdly.
It optimizes the characters you are not using (in one
sense), it defaults to a non optimized form for the
characters you wish to use.

I'm sure that if you wanted to you could patch Python to use Latin-9
instead. Just be prepared for it to be slower than UCS-2, since it
would mean having to encode the code points rather than merely
truncating them.
Pick up a random text and see the probability this
text match the most optimized case 1 char / 1 byte,
practically never.

Pick up a random text and see that this text matches the next most
optimized case, 1 char / 2 bytes: practically always.
If a user will use exclusively latin-1, she/he is better
served by using a dedicated tool for "latin-1"

Speaker as a user who almost exclusively uses Latin-1, I strongly
disagree. What you're describing is Python 2.x. The user is always
almost better served by not having to worry about the full extent of
the character set their program might use. That's why we moved to
Unicode strings in Python 3 in the first place.
If a user will comfortably work with Unicode, she/he is
better served by using one of this tools which is using
properly one of the available Unicode schemes.

In a funny way, this is what Python was doing and it
performs better!

Seriously, please show us just one *real world* benchmark in which
Python 3.3 performs demonstrably worse than Python 3.2. All you've
shown so far is this one microbenchmark of string creation that is
utterly irrelevant to actual programs.
 
T

Terry Reedy

Yes, after checking lengths, and in same circumstances, x != y is True. From
http://hg.python.org/cpython/file/ab6ab44921b2/Objects/unicodeobject.c

PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
int result;

if (PyUnicode_Check(left) && PyUnicode_Check(right)) {
PyObject *v;
if (PyUnicode_READY(left) == -1 ||
PyUnicode_READY(right) == -1)
return NULL;
if (PyUnicode_GET_LENGTH(left) != PyUnicode_GET_LENGTH(right) ||
PyUnicode_KIND(left) != PyUnicode_KIND(right)) {
if (op == Py_EQ) {
Py_INCREF(Py_False);
return Py_False;
}
if (op == Py_NE) {
Py_INCREF(Py_True);
return Py_True;
}
}
....
KIND is 1,2,4 bytes/char

'a in s' is also False if a chars are wider than s chars.

If s is all ascii, s.encode('ascii') or s.encode('utf-8') is a fast,
constant time operation, as I showed earlier in this discussion. This is
one thing that is much faster in 3.3.

Such things can be tested by timing with different lengths of strings,
where the initial string creation is done in setup code rather than in
the repeated operation code.
But x and y are not necessarily always False just because they have
different representations. There may be circumstances where two strings
have different internal representations even though their content is the
same, so it's an unsafe optimization to automatically treat them as
unequal.

I am sure that str objects are always in canonical form once visible to
Python code. Note that unready (non-canonical) objects are rejected by
the rich comparison function.
My expectation is that the initial implementation of PEP 393 will be
relatively unoptimized,

The initial implementation was a year ago. At least three people have
expended considerable effort improving it since, so that the slowdown
mentioned in the PEP has mostly disappeared. The things that are still
slower are somewhat balanced by things that are faster.
 
S

Steven D'Aprano

Yes, after checking lengths, and in same circumstances, x != y is True.
[snip C code]

Thanks Terry for looking that up.
'a in s' is also False if a chars are wider than s chars.

Now that's a nice optimization!

[...]
I am sure that str objects are always in canonical form once visible to
Python code. Note that unready (non-canonical) objects are rejected by
the rich comparison function.

That's one thing that I'm unclear about -- under what circumstances will
a string be in compact versus non-compact form? Reading between the
lines, I guess that a lot of the complexity of the implementation only
occurs while a string is being built. E.g. if you have Python code like
this:

''.join(str(x) for x in something) # a generator expression

Python can't tell how much space to allocate for the string -- it doesn't
know either the overall length of the string or the width of the
characters. So I presume that there is string builder code for dealing
with that, and that it involves resizing blocks of memory.

But if you do this:

''.join([str(x) for x in something]) # a list comprehension

Python could scan the list first, find out the widest char, and allocate
exactly the amount of space needed for the string. Even in Python 2,
joining a list comp is much faster than joining a gen expression.
 
R

Roy Smith

Steven D'Aprano said:
Is the implementation smart enough to know that x == y is always False
if x and y are using different internal representations?

[...] There may be circumstances where two strings have different
internal representations even though their content is the same

If there is a deterministic algorithm which maps string content to
representation type, then I don't see how it's possible for two strings
with different representation types to have the same content. Could you
give me an example of when this might happen?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,149
Members
46,695
Latest member
StanleyDri

Latest Threads

Top