'Straße' ('Strasse') and Python 2

Chris Angelico · Jan 15, 2014

Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a Ã©crit :

Yes.
http://www.unicode.org/faq/char_combmark.html

No

Yes.
http://www.unicode.org/reports/tr17/
Specifically:
"Character Encoding Form: a mapping from a set of nonnegative integers
that are elements of a CCS to a set of sequences of particular code
units of some specified width, such as 32-bit integers"

Or are you saying that www.unicode.org is wrong about the definitions
of Unicode terms?

ChrisA

Travis Griggs · Jan 15, 2014

On 15/01/2014 12:13, Ned Batchelder wrote:
........
Semantics is everything. For me graphemes are the endpoint (or should be); to get a proper rendering of a sequence of graphemes I can use either a sequence of bytes or a sequence of codepoints. They are both encodings of the graphemes; what unicode says is an encoding doesn't define what encodings are ie mappings from some source alphabet to a target alphabet.

But youâ€™re talking about two levels of encoding. One runs on top of the other. So insisting that you be able to call them all encodings, makes the term pointless, because now itâ€™s ambiguous as to what youâ€™re referring to. Are you referring to encoding in the sense of representing code points with bytes? Or are you referring to what the unicode guys call â€œformsâ€?

For example, the NFC form of â€˜Ã±â€™ is â€™\u00F1â€™. â€˜nThe NFD form represents the exact same grapheme, but is â€˜\u006e\u0303â€™. You can call them encodings if you want, but I echo Nedâ€™s sentiment that you keep that to yourself. Conventionally, theyâ€™re different forms, not different encodings. You can encode either form with an encoding, e.g.

'\u00F1'.encode('utf8â€™)
'\u00F1'.encode('utf16â€™)

'\u006e\u0303'.encode('utf8â€™)
'\u006e\u0303'.encode('utf16')

Robin Becker · Jan 15, 2014

On 15/01/2014 16:28, Travis Griggs wrote:
......... of a sequence of graphemes I can use either a sequence of bytes or a
sequence of codepoints. They are both encodings of the graphemes; what unicode
says is an encoding doesn't define what encodings are ie mappings from some
source alphabet to a target alphabet.

But youâ€™re talking about two levels of encoding. One runs on top of the other. So insisting that you be able to call them all encodings, makes the term pointless, because now itâ€™s ambiguous as to what youâ€™re referring to. Are you referring to encoding in the sense of representing code points with bytes? Or are you referring to what the unicode guys call â€œformsâ€?

For example, the NFC form of â€˜Ã±â€™ is â€™\u00F1â€™. â€˜nThe NFD form represents the exact same grapheme, but is â€˜\u006e\u0303â€™. You can call them encodings if you want, but I echo Nedâ€™s sentiment that you keep that to yourself. Conventionally, theyâ€™re different forms, not different encodings. You can encode either form with an encoding, e.g.

'\u00F1'.encode('utf8â€™)
'\u00F1'.encode('utf16â€™)

'\u006e\u0303'.encode('utf8â€™)
'\u006e\u0303'.encode('utf16')

I think about these as encodings, because that's what they are mathematically,
logically & practically. I can encode the target grapheme sequence as a sequence
of bytes using a particular 'unicode encoding' eg utf8 or a sequence of code points.

The fact that unicoders want to take over the meaning of encoding is not relevant.

In my utf8 bash shell the python print() takes one encoding (python3 str) and
translates that to the stdout encoding which happens to be utf8 and passes that
to the shell which probably does a lot of work to render the result as graphical
symbols (or graphemes).

I'm not anti unicode, that's just an assignment of identity to some symbols.
Coding the values of the ids is a separate issue. It's my belief that we don't
need more than the byte level encoding to represent unicode. One of the claims
made for python3 unicode is that it somehow eliminates the problems associated
with other encodings eg utf8, but in fact they will remain until we force
printers/designers to stop using complicated multi-codepoint graphemes. I
suspect that won't happen.

Chris Angelico · Jan 15, 2014

I think about these as encodings, because that's what they are
mathematically, logically & practically. I can encode the target grapheme
sequence as a sequence of bytes using a particular 'unicode encoding' eg
utf8 or a sequence of code points.

By that definition, you can equally encode it as a bitmapped image, or
as a series of lines and arcs, and those are equally well "encodings"
of the character. This is not the normal use of that word.

http://en.wikipedia.org/wiki/Character_encoding

ChrisA

Robin Becker · Jan 15, 2014

By that definition, you can equally encode it as a bitmapped image, or
as a series of lines and arcs, and those are equally well "encodings"
of the character. This is not the normal use of that word.

http://en.wikipedia.org/wiki/Character_encoding

ChrisA

Actually I didn't use the term 'character encoding', but that doesn't alter the
argument. If I chose to embed the final graphemes as images encoded as bytes or
lists of numbers that would still be still be an encoding; it just wouldn't be
very easily usable (lots of typing).

Ian Kelly · Jan 15, 2014

The fact that unicoders want to take over the meaning of encoding is not
relevant.

A virus is a small infectious agent that replicates only inside the
living cells of other organisms. In the context of computing however,
that definition is completely false, and if you insist upon it when
trying to talk about computers, you're only going to confuse people as
to what you mean. Somehow, I haven't seen any biologists complaining
that computer users want to take over the meaning of virus.

Terry Reedy · Jan 16, 2014

The fact that unicoders want to take over the meaning of encoding is not
relevant.

I agree with you that 'encoding' should not be limited to 'byte encoding
of a (subset of) unicode characters. For instance, .jpg and .png are
byte encodings of images. In the other hand, it is common in human
discourse to omit qualifiers in particular contexts. 'Computer virus'
gets condensed to 'virus' in computer contexts.

The problem with graphemes is that there is no fixed set of unicode
graphemes. Which is to say, the effective set of graphemes is
context-specific. Just limiting ourselves to English, 'fi' is usually 2
graphemes when printing to screen, but often just one when printing to
paper. This is why the Unicode consortium punted 'graphemes' to
'application' code.

I'm not anti unicode, that's just an assignment of identity to some
symbols. Coding the values of the ids is a separate issue. It's my
belief that we don't need more than the byte level encoding to represent
unicode. One of the claims made for python3 unicode is that it somehow
eliminates the problems associated with other encodings eg utf8,

The claim is true for the following problems of the way-too-numerous
unicode byte encodings.

Subseting: only a subset of characters can be encoded.

Shifting: the meaning of a byte depends on a preceding shift character,
which might be back as the beginning of the sequence.

Varying size: the number of bytes to encode a character depends on the
character.

Both of the last two problems can turn O(1) operations into O(n)
operations. 3.3+ eliminates all these problems.

Steven D'Aprano · Jan 16, 2014

Yes.
http://www.unicode.org/reports/tr17/
Specifically:
"Character Encoding Form: a mapping from a set of nonnegative integers
that are elements of a CCS to a set of sequences of particular code
units of some specified width, such as 32-bit integers"

Technically Unicode talks about mapping code points and code *units*, but
since code units are defined in terms of bytes, I think it is fair to cut
out one layer of indirection and talk about mapping code points to bytes.
For instance, UTF-32 uses 4-byte code units, and every code point U+0000
through U+10FFFF is mapped to a single code unit, which is always a four-
byte quantity. UTF-8, on the other hand, uses single-byte code units, and
maps code points to a variable number of code units, so UTF-8 maps code
points to either 1, 2, 3 or 4 bytes.

Or are you saying that www.unicode.org is wrong about the definitions of
Unicode terms?

No, I think he is saying that he doesn't know Unicode anywhere near as
well as he thinks he does. The question is, will he cherish his
ignorance, or learn from this thread?

Steven D'Aprano · Jan 16, 2014

so two 'characters' are 3 (or 2 or more) codepoints.
Yes.

If I want to isolate so called graphemes I need an algorithm even
for python's unicode

Correct. Graphemes are language dependent, e.g. in Dutch "ij" is usually
a single grapheme, in English it would be counted as two. Likewise, in
Czech, "ch" is a single grapheme. The Latin form of Serbo-Croation has
two two-letter graphemes, DÅ¾ and Nj (it used to have three, but Dj is now
written as Ä).

Worse, linguists sometimes disagree as to what counts as a grapheme. For
instance, some authorities consider the English "sh" to be a separate
grapheme. As a native English speaker, I'm not sure about that. Certainly
it isn't a separate letter of the alphabet, but on the other hand I can't
think of any words containing "sh" that should be considered as two
graphemes "s" followed by "h". Wait, no, that's not true... compound
words such as "glasshouse" or "disheartened" are counter examples.

ie when it really matters, python3 str is just another encoding.

I'm not entirely sure how a programming language data type (str) can be
considered a transformation.

Chris Angelico · Jan 16, 2014

Worse, linguists sometimes disagree as to what counts as a grapheme. For
instance, some authorities consider the English "sh" to be a separate
grapheme. As a native English speaker, I'm not sure about that. Certainly
it isn't a separate letter of the alphabet, but on the other hand I can't
think of any words containing "sh" that should be considered as two
graphemes "s" followed by "h". Wait, no, that's not true... compound
words such as "glasshouse" or "disheartened" are counter examples.

Digression: When I was taught basic English during my school days, my
mum used Spalding's book and the 70 phonograms. 25 of them are single
letters (Q is not a phonogram - QU is), and the others are mostly
pairs (there are a handful of 3- and 4-letter phonograms). Not every
instance of "s" followed by "h" is the phonogram "sh" - only the times
when it makes the single sound "sh" (which it doesn't in "glasshouse"
or "disheartened").

Thing is, you can't define spelling and pronunciation in terms of each
other, because you'll always be bitten by corner cases. Everyone knows
how "Thames" is pronounced... right? Well, no. There are (at least)
two rivers of that name, the famous one in London p1[ and another one
further north [2]. The obscure one is pronounced the way the word
looks, the famous one isn't. And don't even get started on English
family names... Majorinbanks, Meux and Cholmodeley, as lampshaded [3]
in this song [4]! Even without names, though, there are the tricky
cases and the ones where different localities pronounce the same word
very differently; Unicode shouldn't have to deal with that by changing
whether something's a single character or two. Considering that
phonograms aren't even ligatures (though there is overlap, eg "Th"),
it's much cleaner to leave them as multiple characters.

ChrisA

[1] https://en.wikipedia.org/wiki/River_Thames
[2] Though it's better known as the Isis. https://en.wikipedia.org/wiki/The_Isis
[3] http://tvtropes.org/pmwiki/pmwiki.php/Main/LampshadeHanging
[4] http://www.stagebeauty.net/plays/th-arca2.html - "Mosh-banks",
"Mow", and "Chumley" are the pronunciations used

Robin Becker · Jan 16, 2014

No, I think he is saying that he doesn't know Unicode anywhere near as
well as he thinks he does. The question is, will he cherish his
ignorance, or learn from this thread?

I assure you that I fully understand my ignorance of unicode. Until recently I
didn't even know that the unicode in python 2.x is considered broken and that
str in python 3.x is considered 'better'.

I can say that having made a lot of reportlab work in both 2.7 & 3.3 I don't
understand why the latter seems slower especially since we try to convert early
to unicode/str as a desirable internal form. Probably I have some horrible error
going on(eg one of the C extensions is working in 2.7 and not in 3.3).
-stupidly yrs-
Robin Becker

Chris Angelico · Jan 16, 2014

I assure you that I fully understand my ignorance of unicode. Until recently
I didn't even know that the unicode in python 2.x is considered broken and
that str in python 3.x is considered 'better'.

Your wisdom, if I may paraphrase Master Foo, is that you know you are a fool.

http://catb.org/esr/writings/unix-koans/zealot.html

ChrisA

Frank Millman · Jan 16, 2014

Robin Becker said:
I assure you that I fully understand my ignorance of unicode. Until
recently I didn't even know that the unicode in python 2.x is considered
broken and that str in python 3.x is considered 'better'.

Hi Robin

I am pretty sure that Steven was referring to the original post from
jmfauth, not to anything that you wrote.

May I say that I am delighted that you are putting in the effort to port
ReportLab to python3, and I trust that you will get plenty of support from
the gurus here in achieving this.

Frank Millman

Robin Becker · Jan 16, 2014

On 16/01/2014 12:06, Frank Millman wrote:
...........

Hi Robin

I am pretty sure that Steven was referring to the original post from
jmfauth, not to anything that you wrote.

unfortunately my ignorance remains even in the absence of criticism

May I say that I am delighted that you are putting in the effort to port
ReportLab to python3, and I trust that you will get plenty of support from
the gurus here in achieving this.

.........
I have had a lot of support from the gurus thanks to all of them

Steven D'Aprano · Jan 16, 2014

I assure you that I fully understand my ignorance of unicode.

Robin, while I'm very happy to see that you have a good grasp of what you
don't know, I'm afraid that you're misrepresenting me. You deleted the
part of my post that made it clear that I was referring to our resident

Until
recently I didn't even know that the unicode in python 2.x is considered
broken and that str in python 3.x is considered 'better'.

No need for scare quotes.

The unicode type in Python 2.x is less-good because:

- it is not the default string type (you have to prefix the string
with a u to get Unicode);

- it is missing some functionality, e.g. casefold;

- there are two distinct implementations, narrow builds and wide builds;

- wide builds take up to four times more memory per string as needed;

- narrow builds take up to two times more memory per string as needed;

- worse, narrow builds have very naive (possibly even "broken")
handling of code points in the Supplementary Multilingual Planes.

The unicode string type in Python 3 is better because:

- it is the default string type;

- it includes more functionality;

- starting in Python 3.3, it gets rid of the distinction between
narrow and wide builds;

- which reduces the memory overhead of strings by up to a factor
of four in many cases;

- and fixes the issue of SMP code points.

I can say that having made a lot of reportlab work in both 2.7 & 3.3 I
don't understand why the latter seems slower especially since we try to
convert early to unicode/str as a desirable internal form.

*shrug*

Who knows? Is it slower or does it only *seem* slower? Is the performance
regression platform specific? Have you traded correctness for speed, that
is, does 2.7 version break when given astral characters on a narrow build?

Earlier in January, you commented in another thread that

"I'm not sure if we have any non-bmp characters in the tests."

If you don't, you should have some.

There's all sorts of reasons why your code might be slower under 3.3,
including the possibility of a non-trivial performance regression. If you
can demonstrate a test case with a significant slowdown for real-world
code, I'm sure that a bug report will be treated seriously.

Probably I
have some horrible error going on(eg one of the C extensions is working
in 2.7 and not in 3.3).

Well that might explain a slowdown.

But really, one should expect that moving from single byte strings to up
to four-byte strings will have *some* cost. It's exchanging functionality
for time. The same thing happened years ago, people used to be extremely
opposed to using floating point doubles instead of singles because of
performance. And, I suppose it is true that back when 64K was considered
a lot of memory, using eight whole bytes per floating point number (let
alone ten like the IEEE Extended format) might have seemed the height of
extravagance. But today we use doubles by default, and if singles would
be a tiny bit faster, who wants to go back to the bad old days of single
precision?

I believe the same applies to Unicode versus single-byte strings.

Tim Chase · Jan 16, 2014

The unicode type in Python 2.x is less-good because:

- it is missing some functionality, e.g. casefold;

Just for the record, str.casefold() wasn't added until 3.3, so
earlier 3.x versions (such as the 3.2.3 that is the default python3
on Debian Stable) don't have it either.

-tkc

Travis Griggs · Jan 16, 2014

I assure you that I fully understand my ignorance of ...

Robin, don’t take this personally, I totally got what you meant.

At the same time, I got a real chuckle out of this line. That beats “army intelligence” any day.

'Swampy' installation through 'pip'	3	May 20, 2014
Performance of int/long in Python 3	187	Mar 25, 2013
Change in Python 3.3 with the treatment of sys.argv	10	Mar 22, 2013
input() on python 2.7.5 vs 3.3.2	3	Dec 12, 2013
ImportError: No module named _gdb	3	Jun 1, 2014
On u'Unicode string literals' (Py3)	2	Feb 29, 2012
Python code problem	2	Apr 23, 2023
Representation of floats (-> Mark Dickinson?)	4	Sep 6, 2011

'Straße' ('Strasse') and Python 2

Chris Angelico

Travis Griggs

Robin Becker

Chris Angelico

Robin Becker

Ian Kelly

Terry Reedy

Steven D'Aprano

Steven D'Aprano

Chris Angelico

Robin Becker

Chris Angelico

Frank Millman

Robin Becker

Steven D'Aprano

Tim Chase

Travis Griggs

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads