Python Unicode handling wins again -- mostly

Steven D'Aprano · Dec 1, 2013

I should hope so ;-)

I blame my keyboard, where letters A and K are practically right next to
each other, only seven letters apart. An easy typo to make.

Tim Chase · Dec 1, 2013

I blame my keyboard, where letters A and K are practically right
next to each other, only seven letters apart. An easy typo to make.

I suppose I should have modified my attribution-quote to read "Steven
D'Kprano wrote" then

-tkc

Chris Angelico · Dec 1, 2013

I blame my keyboard, where letters A and K are practically right next to
each other, only seven letters apart. An easy typo to make.

â€œItâ€™s an easy mistake to makeâ€ the PFY concurs â€œManyâ€™s the time Iâ€™ve
picked up a cattle prod thinking it was a lint remover as Iâ€™ve helped
groom one of your predecessors before an important board meeting about
slashing the IT budget.â€

http://www.theregister.co.uk/2010/11/26/bofh_2010_episode_18/

ChrisA

Roy Smith · Dec 1, 2013

Chris Angelico said:
â€œItâ€™s an easy mistake to makeâ€ the PFY concurs â€œManyâ€™s the time Iâ€™ve
picked up a cattle prod thinking it was a lint remover as Iâ€™ve helped
groom one of your predecessors before an important board meeting about
slashing the IT budget.â€

http://www.theregister.co.uk/2010/11/26/bofh_2010_episode_18/

ChrisA

What means "PFY"? The only thing I can think of is "Poor F---ing
Yankee"

Chris Angelico · Dec 1, 2013

What means "PFY"? The only thing I can think of is "Poor F---ing
Yankee"

In the context of the BOFH, it stands for Pimply-Faced Youth and means
BOFH's assistant.

ChrisA

wxjmfauth · Dec 1, 2013

Le dimanche 1 décembre 2013 00:07:36 UTC+1, Ned Batchelder a écrit :

The fi ligature was created because visually, an f and i wouldn't work

well together: the crossbar of the f was near, but not connected to the

serif of the i, and the terminal bulb of the f was close to, but not

coincident, with the dot of the i.

This article goes into great detail, and has a good illustration of how

an f and i can clash, and how an fi ligature can fix the problem:

http://opentype.info/blog/2012/11/20/whats-a-ligature/ . Note the second

fi illustration, which demonstrates using a ligature to make the letters

appear *less* connected than they would individually!

This is also why "simply spacing the characters" isn't a solution: a

specially designed ligature looks better than a separate f and i, no

matter how minutely kerned.

It's unfortunate that Unicode includes presentation alternatives like

the fi (and ff, fl, ffi, and fl) ligatures. It was done to be a

superset of existing encodings.

Many typefaces have other non-encoded ligatures as well, especially

display faces, which also have alternate glyphs. Unicode is a funny mix

in that it includes some forms of alternates, but can't include all of

them, so we have to put up with both an ad-hoc Unicode that includes

presentational variants, and also some other way to specify variants

because Unicode can't include all of them.

I'm speaking about those times where the "characters" (some) were
not even built with metal, but with wood (see Garamond, Bodoni).

---------

Unicode is "only" collecting "characters" in the sense "abstract
entities". What is supposed to be a "character" is one problem.
How a tool is supposed to handle these "characters" is a problem
too, but a different one.

"Unicode" is not a coding scheme, it is a "repertoire".

Illustrative examples instead of explanations.

The ffl ligature is a "character" because it has always
existed.

The & and œ are considered today as unique "characters".
They were historically "ligaturated forms".

The Fahrenheit, Kelvin and Celsius are considered as
"characters", despite Fahrenheit, Kelvin are "letters".

Text justification. Calculating the space between "words"
in "rendering units" makes sense. Using a specific "character"
like a thin space to force a predefined space makes sense too.

The miscellaneous zeroes one may see, like uppercase O, O with
a dot in the center or a striked O are all the same zero, but
with stylistic variants, => a single "character" in the unicode
table.

.... but this medieval "character" existing in two forms (I do not
remember which one) was finally registrated as two "characters",
and not as a stylistic variant of a single "character".

There are no "characters" for the symbols of the chemical elements,
a latin script is good enough.

The QPlainTextEdit widget from Qt does not know '\n'. It uses
only the paragraph separator and the line separator. To render
a paragraph separator, it uses one another "character", the
pilcrow.

The µ "character" in the iso-8859-1 coding scheme is a greek
letter, it must be used or percieved as a SI unit prefix.
Unicode category: Ll, unicode name: micro sign.

How to place an arrow (vector) on top of an ê, if one cann't
decompose it?

Related, there are dotless variants of i and j.

STIX fonts with the huge number of math symbols, not
yet in the unicode repertoire but present in the PUA.

etc.

Unicode is quite open. It's a good idea to keep that
openess to the developer. Shortly, if a coder decomposes
a "character" like "â" in a "a" plus a "^", it's up to
the developer to know what to do when reversing such a
string and to count this sequence as two real "characters".

jmf

Serhiy Storchaka · Dec 1, 2013

30.11.13 02:44, Steven D'Aprano Ð½Ð°Ð¿Ð¸ÑÐ°Ð²(Ð»Ð°):

(2) If you reverse that string, does it give "lÃ«on"? The implication of
this question is that strings should operate on grapheme clusters rather
than code points. Python fails this test:

py> print("noe\u0308l"[::-1])
leon

print(unicodedata.normalize('NFC', "noe\u0308l")[::-1])

Click to expand...

Click to expand...

lÃ«on

(3) What are the first three characters? The author suggests that the
answer should be "noÃ«", in which case Python fails again:

py> print("noe\u0308l"[:3])
noe

print(unicodedata.normalize('NFC', "noe\u0308l")[:3])

Click to expand...

Click to expand...

noÃ«

(4) Likewise, what is the length of the decomposed string? The author
expects 4, but Python gives 5:

py> len("noe\u0308l")
5

4

wxjmfauth · Dec 1, 2013

0.11.13 02:44, Steven D'Aprano Ð½Ð°Ð¿Ð¸ÑÐ°Ð²(Ð»Ð°):

(2) If you reverse that string, does it give "lÃ«on"? The implicationof
this question is that strings should operate on grapheme clusters rather
than code points. ...

BTW, a grapheme cluster *is* a code points cluster.

jmf

Tim Delaney · Dec 1, 2013

0.11.13 02:44, Steven D'Aprano Ð½Ð°Ð¿Ð¸ÑÐ°Ð²(Ð»Ð°):

BTW, a grapheme cluster *is* a code points cluster.

Anyone with a decent level of reading comprehension would have understood
that Steven knows that. The implied word is "individual" i.e. "... rather
than [individual] code points".

Why am I responding to a troll? Probably because out of all his baseless
complaints about the FSR, he *did* have one valid point about performance
that has now been fixed.

Tim Delaney

Tim Delaney · Dec 1, 2013

I don't remember him ever having a valid point, so FTR can we have a
reference please. I do remember Steven D'Aprano showing that there was a
regression which I flagged up here http://bugs.python.org/issue16061. It
was fixed by Serhiy Storchaka, who appears to have forgotten more about
Python than I'll ever know, grrr!!!

From your own bug report (quoting Steven): "Nevertheless, I think there is

something here. The consequences are nowhere near as dramatic as jmf claims
...."

His initial postings did lead to a regression being found.

Tim Delaney

Ethan Furman · Dec 1, 2013

I don't remember him [jmf] ever having a valid point, so FTR can we have a reference please. I do remember Steven D'Aprano
showing that there was a regression which I flagged up here http://bugs.python.org/issue16061. It was fixed by Serhiy
Storchaka, who appears to have forgotten more about Python than I'll ever know, grrr!!!

The initial complaint came, unsurprisingly, from jmf. But don't worry much, even a stopped clock has a better track
record... it's at least right twice a day.

Mark Lawrence · Dec 1, 2013

On 2 December 2013 09:06, Mark Lawrence <[email protected]

I don't remember him ever having a valid point, so FTR can we have a
reference please. I do remember Steven D'Aprano showing that there
was a regression which I flagged up here
http://bugs.python.org/__issue16061
<http://bugs.python.org/issue16061>. It was fixed by Serhiy
Storchaka, who appears to have forgotten more about Python than I'll
ever know, grrr!!!

From your own bug report (quoting Steven): "Nevertheless, I think there
is something here. The consequences are nowhere near as dramatic as jmf
claims ..."

His initial postings did lead to a regression being found.

Tim Delaney

I'll begrudgungly concede that point, but must state that it was was an
edge case that is unlikely to have too much impact in the real world.
Unfortunately he's still making his ridiculous claims about the FSR,
hence my nickname of "Joseph McCarthy". I'll admit to liking that, it
just feels right to me, YMMV.

What also really riles me is that he uses double spaced google crap,
despite repeated requests from various people here for others to fix how
they use it, or get a decent email client.

wxjmfauth · Dec 2, 2013

Le dimanche 1 dÃ©cembre 2013 21:54:48 UTC+1, Tim Delaney a Ã©critÂ :

(2) If you reverse that string, does it give "lÃ«on"? The implication of

Click to expand...

this question is that strings should operate on grapheme clusters rather

Click to expand...

than code points. ...

Click to expand...

BTW, a grapheme cluster *is* a code points cluster.

Anyone with a decent level of reading comprehension would have understoodthat Steven knows that. The implied word is "individual" i.e. "... rather than [individual] code points".

Why am I responding to a troll? Probably because out of all his baseless complaints about the FSR, he *did* have one valid point about performance that has now been fixed.

Tim Delaney

My English is far too be perfect, I think I understood
it correctly.

The point in not in the words "grapheme" or "code point",
neither in "individual", ;-), the point is in "rather".

If one wishes to work on a set of graphemes, one can
only work with the set of the corresponding code points.

To complete Serhiy Storchaka's example:
True

is correct.

jmf

PS I did not even speak about the FSR.

Ned Batchelder · Dec 2, 2013

1) Your English is far from perfect as you clearly do not understand the
repeated requests *NOT* to send us double spaced crap via google groups.

2) You can't speak about the FSR as you know precisely nothing about it,
but as they say, ignorance is bliss.

As annoying as baseless claims against the FSR were, wxjmafauth is
right: he didn't even mention the FSR in this thread. There's really no
point dragging this thread into that territory.

--Ned.

Chris Angelico · Dec 2, 2013

He's quite deliberately dragged it up by using p.s. Without doubt he's the
worst loser in the world and I'm *NOT* stopping getting at him. I find his
behaviour, continuously and groundlessly insulting the Python core
developers, quite disgusting.

What he does is make very sure that the awesomeness of Python 3.3+ is
constantly being brought up on python-list. New users of Python who
come here will, within a fairly short time, learn that Python actually
gets Unicode right, unlike most languages out there, and that it's
efficient and high performance.

ChrisA

Ned Batchelder · Dec 2, 2013

He's quite deliberately dragged it up by using p.s. Without doubt he's
the worst loser in the world and I'm *NOT* stopping getting at him. I
find his behaviour, continuously and groundlessly insulting the Python
core developers, quite disgusting.

His PS is in reference to you, Ethan, and Tim reminiscing about his past
complaints against the FSR. He made three posts to this thread before
you started in on him, and none of them mentioned the FSR. Tim first
mentioned it.

There's no need to call him "the worst loser in the world." Nothing
good will come from that kind of attack. It doesn't make this community
better, and it will not change his behavior.

He said nothing in this thread that insulted the Python core developers.
His posts in this thread are not about the FSR, and yet you dragged the
old fights into it. You are being the troll here.

--Ned.

Terry Reedy · Dec 2, 2013

the worst loser in the world

Mark, I consider your continual direct personal attacks on other posters
to be a violation of the PSF Code of Conduct, which *does* apply to
python-list. Please stop.

Ethan Furman · Dec 2, 2013

Out of the nine tests, Python 3.3 passes six, with three tests being
failures or dubious. If you believe that the native string type should
operate on code-points, then you'll think that Python does the right
thing.

I think Python is doing it correctly. If I want to operate on "clusters" I'll normalize the string first.

Thanks for this excellent post.

Mark Lawrence · Dec 2, 2013

Mark, I consider your continual direct personal attacks on other posters
to be a violation of the PSF Code of Conduct, which *does* apply to
python-list. Please stop.

The attacks that "Joseph McCarthy" has been launching on the core
developers for the last 15 months are in my view now perfectly
acceptable. This is excellent news. Everybody can now say what they
like about the core developers and there's no comeback.

You can also stuff the code of conduct, it's quite clearly only brought
into play when it suits. Never, ever aim it at somebody who goes out of
their way to stir things up, always target it at the people who fight
back *IS THE RULE HERE*.

Ned Batchelder · Dec 2, 2013

I think Python is doing it correctly. If I want to operate on
"clusters" I'll normalize the string first.

Thanks for this excellent post.

This is where my knowledge about Unicode gets fuzzy. Isn't it the case
that some grapheme clusters (or whatever the right word is) can't be
normalized down to a single code point? Characters can accept many
accents, for example. In that case, you can't always normalize and use
the existing string methods, but would need more specialized code.

--Ned.

Unicode and Python - how often do you index strings?	33	Jun 4, 2014
Flexible string representation, unicode, typography, ...	94	Aug 23, 2012
Python's handling of unicode surrogates	17	Apr 20, 2007
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
unable to print Unicode characters in Python 3	12	Jan 26, 2009
Python beginner, unicode encode/decode Q	1	Jul 14, 2008
Counting unicode graphemes in python	2	Oct 24, 2003
Problem handling a Unicode file	16	Aug 28, 2006

Python Unicode handling wins again -- mostly

Steven D'Aprano

Tim Chase

Chris Angelico

Roy Smith

Chris Angelico

wxjmfauth

Serhiy Storchaka

wxjmfauth

Tim Delaney

Tim Delaney

Ethan Furman

Mark Lawrence

wxjmfauth

Ned Batchelder

Chris Angelico

Ned Batchelder

Terry Reedy

Ethan Furman

Mark Lawrence

Ned Batchelder

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads