Python Unicode handling wins again -- mostly

Chris Angelico · Dec 2, 2013

This is where my knowledge about Unicode gets fuzzy. Isn't it the case that
some grapheme clusters (or whatever the right word is) can't be normalized
down to a single code point? Characters can accept many accents, for
example.

You can't normalize everything down to a single code point, but you
can normalize the other way by breaking out everything that can be
broken out.
'a\u0308'

ChrisA

Ethan Furman · Dec 2, 2013

The attacks that "Joseph McCarthy" has been launching on the core developers for the last 15 months are in my view now
perfectly acceptable. This is excellent news. Everybody can now say what they like about the core developers and
there's no comeback.

You can also stuff the code of conduct, it's quite clearly only brought into play when it suits. Never, ever aim it at
somebody who goes out of their way to stir things up, always target it at the people who fight back *IS THE RULE HERE*.

Mark, I sympathize with your feelings. jmf is certainly a troll, and it doesn't feel like anything has been, or is
being, done about that situation (or for that matter, the help vampire situation... although I haven't seen any threads
from that one lately -- did he give up, or has he been moderated away?). However, I would suggest that when you are
venting, you write the email and then just delete it. I personally don't mind the light and humorous posts, but when
the name-calling starts it makes the list an unfriendly place to be. And, to be clear, the coddling of trolls and
help-vampires also makes the list an unfriendly place to be.

Terry, would it be appropriate to share some of what the moderators do do for us on this list and the others? And what
does the Code of Conduct have to say about trolls and help-vampires?

Ethan Furman · Dec 2, 2013

You can't normalize everything down to a single code point, but you
can normalize the other way by breaking out everything that can be
broken out.

'a\u0308'

Well, Stephen was right then! There's room for a library to handle this situation. Or is there one already?

MRAB · Dec 2, 2013

This is where my knowledge about Unicode gets fuzzy. Isn't it the case
that some grapheme clusters (or whatever the right word is) can't be
normalized down to a single code point? Characters can accept many
accents, for example. In that case, you can't always normalize and use
the existing string methods, but would need more specialized code.

A better way of saying it is that there are codepoints for some grapheme
clusters. Those 'precomposed' codepoints exist because some legacy
character sets contained them, and having a one-to-one mapping
encouraged Unicode's adoption.

Ned Batchelder · Dec 2, 2013

The attacks that "Joseph McCarthy" has been launching on the core
developers for the last 15 months are in my view now perfectly
acceptable. This is excellent news. Everybody can now say what they
like about the core developers and there's no comeback.

You can also stuff the code of conduct, it's quite clearly only brought
into play when it suits. Never, ever aim it at somebody who goes out of
their way to stir things up, always target it at the people who fight
back *IS THE RULE HERE*.

The point is that in this thread, no one was making attacks on core
developers. You were bringing up old animosity here for no reason at
all, and making them personal attacks to boot.

I don't see how you think wxjmfauth was "going out of his way to stir
things up" in *this* thread. He made three comments, none of which
mentioned the FSR or any other controversial topic. Can't we respond to
the content of posts, and not to past offenses by the poster?

Additionally, wxjmfauth's past complaints about the flexible string
representation were not personal. He didn't say, "Joe Smith is the
worst loser in the world for writing the FSR". He complained about a
feature of CPython, baselessly, but he never attacked the people doing
the work. His continued complaints were aggravating, I agree. I don't
know that they rose to the level of "disrespectful".

I know that your behavior here is disrespectful.

As to when the code of conduct is brought up, it's only fairly recently
that it has been mentioned in this forum. There have clearly been posts
in recent memory (the last year) which could have been examined in light
of the code of conduct, and were not. I think we are using it more
uniformly now. You helped me realize better how to apply it to this
forum, and I thank you for that. I welcome your help in applying it
better still. But it applies to you as well and I don't think it's too
much to ask that you abide by it.

The way to improve this list is to respectfully point to and demonstrate
community norms and ask people to conform to them. Spewing vitriol
isn't going to fix anything.

--Ned.

Mark Lawrence · Dec 2, 2013

Mark, I sympathize with your feelings. jmf is certainly a troll, and
it doesn't feel like anything has been, or is being, done about that
situation (or for that matter, the help vampire situation... although I
haven't seen any threads from that one lately -- did he give up, or has
he been moderated away?). However, I would suggest that when you are
venting, you write the email and then just delete it. I personally
don't mind the light and humorous posts, but when the name-calling
starts it makes the list an unfriendly place to be. And, to be clear,
the coddling of trolls and help-vampires also makes the list an
unfriendly place to be.

Terry, would it be appropriate to share some of what the moderators do
do for us on this list and the others? And what does the Code of
Conduct have to say about trolls and help-vampires?

I deleted the first really spiteful reply, but the hypocrisy that
continues to be shown gets right up both of my nostrils, hence I
couldn't resist the above, greatly toned down response. This will
surely give an indication of how strongly I feel on issues such as this.
Rules are rules to be applied evenly, not on a pick and choose basis.

Ned Batchelder · Dec 2, 2013

Mark, I sympathize with your feelings. jmf is certainly a troll, and
it doesn't feel like anything has been, or is being, done about that
situation (or for that matter, the help vampire situation... although I
haven't seen any threads from that one lately -- did he give up, or has
he been moderated away?). However, I would suggest that when you are
venting, you write the email and then just delete it. I personally
don't mind the light and humorous posts, but when the name-calling
starts it makes the list an unfriendly place to be. And, to be clear,
the coddling of trolls and help-vampires also makes the list an
unfriendly place to be.

Terry, would it be appropriate to share some of what the moderators do
do for us on this list and the others? And what does the Code of
Conduct have to say about trolls and help-vampires?

We have pointed help-vampires at the Code of Conduct:
https://mail.python.org/pipermail/python-list/2013-November/660343.html

He's also banned from the mailing list, which reduces the number of
people who see his questions, and helps keep threads from exploding. For
example, this message to the newsgroup
https://groups.google.com/d/msg/comp.lang.python/fdhF_Fr4fX0/9B0iK8jGigkJ (sorry
for the groups link, didn't know how else to link to a post) doesn't
appear at all in the mailing list, and therefore, in gmane.

But the mailing list ban isn't why you aren't seeing posts from him: he
hasn't posted again since that linked message, on Nov 21.

I think he's not posting in part because we adopted a uniform stance of
politely refusing to answer his questions, or even completely ignoring
his questions.

Of course, he could be back at any time. I hope we'll continue to
present a calm unified front.

--Ned.

Ned Batchelder · Dec 2, 2013

The point is that in this thread, no one was making attacks on core
developers. You were bringing up old animosity here for no reason at
all, and making them personal attacks to boot.

I don't see how you think wxjmfauth was "going out of his way to stir
things up" in *this* thread. He made three comments, none of which
mentioned the FSR or any other controversial topic. Can't we respond to
the content of posts, and not to past offenses by the poster?

Additionally, wxjmfauth's past complaints about the flexible string
representation were not personal. He didn't say, "Joe Smith is the
worst loser in the world for writing the FSR". He complained about a
feature of CPython, baselessly, but he never attacked the people doing
the work. His continued complaints were aggravating, I agree. I don't
know that they rose to the level of "disrespectful".

I know that your behavior here is disrespectful.

As to when the code of conduct is brought up, it's only fairly recently
that it has been mentioned in this forum. There have clearly been posts
in recent memory (the last year) which could have been examined in light
of the code of conduct, and were not. I think we are using it more
uniformly now. You helped me realize better how to apply it to this
forum, and I thank you for that. I welcome your help in applying it
better still. But it applies to you as well and I don't think it's too
much to ask that you abide by it.

The way to improve this list is to respectfully point to and demonstrate
community norms and ask people to conform to them. Spewing vitriol
isn't going to fix anything.

--Ned.

BTW: I think Mark has kill-filed me, so if anyone agrees enough with me
here to want Mark to see it, someone else will have to respond before he
gets the text.

--Ned.

Mark Lawrence · Dec 2, 2013

BTW: I think Mark has kill-filed me, so if anyone agrees enough with me
here to want Mark to see it, someone else will have to respond before he
gets the text.

--Ned.

I've kill-filed you on my personnal email address which I asked you
specifically *NOT* to message me on. You completely ignored that
request. FTR you're only the second person I've ever done that to, the
other being a pot smoking hippy who thankfully hasn't been seen for years.

Ethan Furman · Dec 2, 2013

... the other being a pot smoking hippy who ...

Please trim your posts. You comment a lot on people sending double-spaced google posts -- not trimming is nearly as bad.

The above is a good example of unnecessary name calling.

I value your good posts. Please keep a light-hearted and respectful tone. When light-hearted doesn't cut it, you can
still be respectful (of the other readers, even if the offender doesn't deserve it).

Ned Batchelder · Dec 2, 2013

I've kill-filed you on my personnal email address which I asked you
specifically *NOT* to message me on. You completely ignored that
request. FTR you're only the second person I've ever done that to, the
other being a pot smoking hippy who thankfully hasn't been seen for years.

Yes, I've apologized for that faux pas. I hope that you can forgive me.
Someday I hope to understand why it angered you so much. Good to hear
that we can communicate here.

--Ned.

Roy Smith · Dec 3, 2013

Mark Lawrence said:
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

"I believe that Pythonistas should commit themselves to achieving the
goal, before this decade is out, of making Python 3 the default version
and having everybody be cool with unicode."

Ethan Furman · Dec 3, 2013

"I believe that Pythonistas should commit themselves to achieving the
goal, before this decade is out, of making Python 3 the default version
and having everybody be cool with unicode."

Hear, Hear!

+1000!

Terry Reedy · Dec 3, 2013

jmf is certainly a troll

No, he is a person who discovered a minor performance regression in the
FSR, which we fixed. Unfortunately, he then continued for a year with a
strange troll-like anti-FSR crusade. But his posts in the Unicode
handling thread were not part of that. It seems to me that continually
beating someone over the head with the past discourages changed
behavior. To me, the point of asking someone to 'stop' is to persuade
them to stop. The reward for stopping should be to let the issue go.

haven't seen any threads from that one lately -- did he give up, or has
he been moderated away?).

Action was taken, including changing the usenet (clr) to mailing-list
gateway. (I already mentioned this twice here.) The was done by one of
the mailman infrastructure people at the request of the list
owner/moderators. The people who stuck their necks out to privately
contact the person in question displeased him and got privately
mail-bombed with repeated insults. I guess he subsequently gave up.

the coddling of trolls and help-vampires also makes the list an
unfriendly place to be.

I agree with the that as a statement, but not the implication. Was I
hallucinating, or did you not recently participate in the discussion and
decision to stop coddling our most obnoxious 'troll' in the community?

Terry, would it be appropriate to share some of what the moderators do
do for us on this list and the others?

Python-list moderators discard perhaps one spam post a day. You already
noticed a recent major benefit.

And what does the Code of
Conduct have to say about trolls and help-vampires?

I need to re-read it to really answer that adequately. The term and
defined concept 'help-vampire' is new to me (as of a month ago) and
probably to the CoC writers. However, the behavior strikes me as
disrespectful of the community, and that *is* generically covered.

Terry Reedy · Dec 3, 2013

This forum doesn't have authorised moderators,

At least some PSF mailing lists have 1 or more PSF-authorized moderators
(currently 4 for python-list) who pretty thanklessly check the initial
posts of new subscribers and posts flagged by the spam detector as
possible spam, or with other problems. We do not have 'every-post'
moderation.

If you perceive uneven application of our code of conduct,

As far as I know, there has been just one non-spam application of CoC
to python-list: Nikos. I do not see how anyone could call that uneven or
unfair.

Ethan Furman · Dec 3, 2013

No, he is a person who discovered a minor performance regression in the FSR, which we fixed. Unfortunately, he then
continued for a year with a strange troll-like anti-FSR crusade. But his posts in the Unicode handling thread were not
part of that. It seems to me that continually beating someone over the head with the past discourages changed behavior.
To me, the point of asking someone to 'stop' is to persuade them to stop. The reward for stopping should be to let the
issue go.

I remember it slightly differently, but you're right -- we should let it drop.

I agree with the that as a statement, but not the implication. Was I hallucinating, or did you not recently participate
in the discussion and decision to stop coddling our most obnoxious 'troll' in the community?

I'm afraid I don't see the point you are trying to make. I'm against coddling those who refuse to learn and participate
with respect to the rest of us, and I did vote to stop such coddling [1] of a certain troll. I don't see the discrepancy.

All that aside, thank you to you and the other moderators for your time and efforts.

--
~Ethan~

[1] Coddling can be an offensive word, and I wish to make clear that initial efforts to educate and help newcomers are
appropriate and warranted. However, after some time has passed and the newcomer is no longer a newcomer and is still
exhibiting rude and ignorant behavior, further attempts to help most likely won't, and that is when I would classify
such attempts as coddling.

Grant Edwards · Dec 3, 2013

"I believe that Pythonistas should commit themselves to achieving the
goal, before this decade is out, of making Python 3 the default version
and having everybody be cool with unicode."

I'm cool with Unicode as long as it "just works" without me ever
having to understand it and I can interact effortlessly with plain old
ASCII files. Evertime I start to read anything about Unicode with any
technical detail at all, I start to get dizzy and bleed from the ears.

Steven D'Aprano · Dec 3, 2013

This is where my knowledge about Unicode gets fuzzy. Isn't it the case
that some grapheme clusters (or whatever the right word is) can't be
normalized down to a single code point? Characters can accept many
accents, for example. In that case, you can't always normalize and use
the existing string methods, but would need more specialized code.

That is correct.

If Unicode had a distinct code point for every possible combination of
base-character plus an arbitrary number of diacritics or accents, the
0x10FFFF code points wouldn't be anywhere near enough.

I see over 300 diacritics used just in the first 5000 code points. Let's
pretend that's only 100, and that you can use up to a maximum of 5 at a
time. That gives 79375496 combinations per base character, much larger
than the total number of Unicode code points in total.

If anyone wishes to check my logic:

# count distinct combining chars
import unicodedata
s = ''.join(chr(i) for i in range(33, 5000))
s = unicodedata.normalize('NFD', s)
t = [c for c in s if unicodedata.combining(c)]
len(set(t))

# calculate the number of combinations
def comb(r, n):
"""Combinations nCr"""
p = 1
for i in range(r+1, n+1):
p *= i
for i in range(1, n-r+1):
p /= i
return p

sum(comb(i, 100) for i in range(6))

I'm not suggesting that all of those accents are necessarily in use in
the real world, but there are languages which construct arbitrary
combinations of accents. (Or so I have been lead to believe.)

Steven D'Aprano · Dec 3, 2013

I'm cool with Unicode as long as it "just works" without me ever having
to understand it

That will never happen. Unicode is a bit like floating point maths:
there's always *some* odd corner case that will lead to annoyance and
confusion and even murder:

http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail

And then there are legacy encodings. There are three things in life that
are inevitable: death, taxes, and text with the wrong encoding. Anyone
dealing with text they didn't generate themselves is going to have to
deal with mojibake at some point.

Having said that, if you control the text and always use UTF-8 for
storage and transmission, Unicode isn't that hard. Decode bytes to
Unicode as early as possible, do all your work in text rather than bytes,
then encode back to bytes as late as possible, and you'll be fine.

and I can interact effortlessly with plain old ASCII files.

That at least is easy, provided you can guarantee that what you think if
plain ol' ASCII actually is plain ol' ASCII, which isn't as easy as you
might think given that an awful lot of people think that "extended ASCII"
is a thing and that you ought to be able to deal with it just like ASCII.

Evertime I start to read anything about Unicode with any
technical detail at all, I start to get dizzy and bleed from the ears.

Heh, the standard certainly covers a lot of ground.

joe · Dec 3, 2013

How would a grapheme library work? Basic cluster combination, or would
implementing other algorithms (line break, normalizing to a "canonical"
form) be necessary?

How do people use grapheme clusters in non-rendering situations? Or here's
perhaps here's a better question: does anyone know any non-latin (Japanese
and Arabic come to mind) speakers who use python to process text in their
own language? Who could perhaps tell us what most bugs them about python's
current api and which standard libraries need work.

This is where my knowledge about Unicode gets fuzzy. Isn't it the case
that some grapheme clusters (or whatever the right word is) can't be
normalized down to a single code point? Characters can accept many
accents, for example. In that case, you can't always normalize and use
the existing string methods, but would need more specialized code.

Click to expand...

That is correct.

If Unicode had a distinct code point for every possible combination of
base-character plus an arbitrary number of diacritics or accents, the
0x10FFFF code points wouldn't be anywhere near enough.

I see over 300 diacritics used just in the first 5000 code points. Let's
pretend that's only 100, and that you can use up to a maximum of 5 at a
time. That gives 79375496 combinations per base character, much larger
than the total number of Unicode code points in total.

If anyone wishes to check my logic:

# count distinct combining chars
import unicodedata
s = ''.join(chr(i) for i in range(33, 5000))
s = unicodedata.normalize('NFD', s)
t = [c for c in s if unicodedata.combining(c)]
len(set(t))

# calculate the number of combinations
def comb(r, n):
"""Combinations nCr"""
p = 1
for i in range(r+1, n+1):
p *= i
for i in range(1, n-r+1):
p /= i
return p

sum(comb(i, 100) for i in range(6))

I'm not suggesting that all of those accents are necessarily in use in
the real world, but there are languages which construct arbitrary
combinations of accents. (Or so I have been lead to believe.)

Unicode and Python - how often do you index strings?	33	Jun 4, 2014
Flexible string representation, unicode, typography, ...	94	Aug 23, 2012
Python's handling of unicode surrogates	17	Apr 20, 2007
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
unable to print Unicode characters in Python 3	12	Jan 26, 2009
Python beginner, unicode encode/decode Q	1	Jul 14, 2008
Counting unicode graphemes in python	2	Oct 24, 2003
Problem handling a Unicode file	16	Aug 28, 2006

Python Unicode handling wins again -- mostly

Chris Angelico

Ethan Furman

Ethan Furman

MRAB

Ned Batchelder

Mark Lawrence

Ned Batchelder

Ned Batchelder

Mark Lawrence

Ethan Furman

Ned Batchelder

Roy Smith

Ethan Furman

Terry Reedy

Terry Reedy

Ethan Furman

Grant Edwards

Steven D'Aprano

Steven D'Aprano

joe

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads