Python Unicode handling wins again -- mostly

Mark Lawrence · Dec 3, 2013

"I believe that Pythonistas should commit themselves to achieving the
goal, before this decade is out, of making Python 3 the default version
and having everybody be cool with unicode."

I like that, thank you.

Mark Lawrence · Dec 3, 2013

I'm cool with Unicode as long as it "just works" without me ever
having to understand it and I can interact effortlessly with plain old
ASCII files. Evertime I start to read anything about Unicode with any
technical detail at all, I start to get dizzy and bleed from the ears.

I'm pleased to see that I'm not the only one who suffers in this way

Neil Cerutti · Dec 3, 2013

I think Python is doing it correctly. If I want to operate on
"clusters" I'll normalize the string first.

Normalizing doesn't resolve the issues the blog brings up; NFC
can't condense every multi-code-point sequence into one, and
normalizing can lose or mangle information. There are good
examples here: http://unicode.org/reports/tr15/

Thanks for this excellent post.

Agreed.

Ethan Furman · Dec 3, 2013

I think Python is doing it correctly. If I want to operate on "clusters" I'll normalize the string first.

Hrmm, well, after being educated

I think I may have to reverse my position. Given that not every cluster can be
normalized to a single code point perhaps Python is doing it the best possible way. On the other hand, we have a
uni*code* type, not a uni*char* type. Maybe 3.5 can have that.

At any rate, definitely good to be aware of the issue.

wxjmfauth · Dec 3, 2013

Le mardi 3 décembre 2013 06:06:26 UTC+1, Steven D'Aprano a écrit :

On 12/2/13 3:38 PM, Ethan Furman wrote:

This is where my knowledge about Unicode gets fuzzy. Isn't it the case

Click to expand...

that some grapheme clusters (or whatever the right word is) can't be

Click to expand...

normalized down to a single code point? Characters can accept many

Click to expand...

accents, for example. In that case, you can't always normalize and use

Click to expand...

the existing string methods, but would need more specialized code.

Click to expand...

That is correct.

If Unicode had a distinct code point for every possible combination of

base-character plus an arbitrary number of diacritics or accents, the

0x10FFFF code points wouldn't be anywhere near enough.

I see over 300 diacritics used just in the first 5000 code points. Let's

pretend that's only 100, and that you can use up to a maximum of 5 at a

time. That gives 79375496 combinations per base character, much larger

than the total number of Unicode code points in total.

If anyone wishes to check my logic:

# count distinct combining chars

import unicodedata

s = ''.join(chr(i) for i in range(33, 5000))

s = unicodedata.normalize('NFD', s)

t = [c for c in s if unicodedata.combining(c)]

len(set(t))

# calculate the number of combinations

def comb(r, n):

"""Combinations nCr"""

p = 1

for i in range(r+1, n+1):

p *= i

for i in range(1, n-r+1):

p /= i

return p

sum(comb(i, 100) for i in range(6))

I'm not suggesting that all of those accents are necessarily in use in

the real world, but there are languages which construct arbitrary

combinations of accents. (Or so I have been lead to believe.)

from one of my libs, bmp only
240

jmf

wxjmfauth · Dec 4, 2013

Le mardi 3 décembre 2013 15:26:45 UTC+1, Ethan Furman a écrit :

Hrmm, well, after being educated I think I may have to reverse my position. Given that not every cluster can be

normalized to a single code point perhaps Python is doing it the best possible way. On the other hand, we have a

uni*code* type, not a uni*char* type. Maybe 3.5 can have that.

------

Yon intuitively pointed a very important feature
of "unicode". However, it is not necessary, this is
exactly what unicode does (when used properly).

jmf

Mark Lawrence · Dec 4, 2013

On 04/12/2013 13:52, (e-mail address removed) wrote:

[snip all the double spaced stuff]

Yon intuitively pointed a very important feature
of "unicode". However, it is not necessary, this is
exactly what unicode does (when used properly).

jmf

Presumably using unicode correctly prevents messages being sent across
the ether with superfluous, extremely irritating double spacing? Or is
that down to poor tools in combination with the ignorance of their users?

Neil Cerutti · Dec 4, 2013

Yon intuitively pointed a very important feature of "unicode".
However, it is not necessary, this is exactly what unicode does
(when used properly).

Unicode only provides character sets. It's not a natural language
parsing facility.

Unicode and Python - how often do you index strings?	33	Jun 4, 2014
Flexible string representation, unicode, typography, ...	94	Aug 23, 2012
Python's handling of unicode surrogates	17	Apr 20, 2007
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
unable to print Unicode characters in Python 3	12	Jan 26, 2009
Python beginner, unicode encode/decode Q	1	Jul 14, 2008
Counting unicode graphemes in python	2	Oct 24, 2003
Problem handling a Unicode file	16	Aug 28, 2006

Python Unicode handling wins again -- mostly

Mark Lawrence

Mark Lawrence

Neil Cerutti

Ethan Furman

wxjmfauth

wxjmfauth

Mark Lawrence

Neil Cerutti

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads