Python Unicode handling wins again -- mostly

M

Mark Lawrence

"I believe that Pythonistas should commit themselves to achieving the
goal, before this decade is out, of making Python 3 the default version
and having everybody be cool with unicode."

I like that, thank you.
 
M

Mark Lawrence

I'm cool with Unicode as long as it "just works" without me ever
having to understand it and I can interact effortlessly with plain old
ASCII files. Evertime I start to read anything about Unicode with any
technical detail at all, I start to get dizzy and bleed from the ears.

I'm pleased to see that I'm not the only one who suffers in this way :)
 
E

Ethan Furman

I think Python is doing it correctly. If I want to operate on "clusters" I'll normalize the string first.

Hrmm, well, after being educated ;) I think I may have to reverse my position. Given that not every cluster can be
normalized to a single code point perhaps Python is doing it the best possible way. On the other hand, we have a
uni*code* type, not a uni*char* type. Maybe 3.5 can have that. ;)

At any rate, definitely good to be aware of the issue.
 
W

wxjmfauth

Le mardi 3 décembre 2013 06:06:26 UTC+1, Steven D'Aprano a écrit :
On 12/2/13 3:38 PM, Ethan Furman wrote:

This is where my knowledge about Unicode gets fuzzy. Isn't it the case
that some grapheme clusters (or whatever the right word is) can't be
normalized down to a single code point? Characters can accept many
accents, for example. In that case, you can't always normalize and use
the existing string methods, but would need more specialized code.



That is correct.



If Unicode had a distinct code point for every possible combination of

base-character plus an arbitrary number of diacritics or accents, the

0x10FFFF code points wouldn't be anywhere near enough.



I see over 300 diacritics used just in the first 5000 code points. Let's

pretend that's only 100, and that you can use up to a maximum of 5 at a

time. That gives 79375496 combinations per base character, much larger

than the total number of Unicode code points in total.



If anyone wishes to check my logic:



# count distinct combining chars

import unicodedata

s = ''.join(chr(i) for i in range(33, 5000))

s = unicodedata.normalize('NFD', s)

t = [c for c in s if unicodedata.combining(c)]

len(set(t))



# calculate the number of combinations

def comb(r, n):

"""Combinations nCr"""

p = 1

for i in range(r+1, n+1):

p *= i

for i in range(1, n-r+1):

p /= i

return p



sum(comb(i, 100) for i in range(6))





I'm not suggesting that all of those accents are necessarily in use in

the real world, but there are languages which construct arbitrary

combinations of accents. (Or so I have been lead to believe.)

from one of my libs, bmp only
240


jmf
 
W

wxjmfauth

Le mardi 3 décembre 2013 15:26:45 UTC+1, Ethan Furman a écrit :
Hrmm, well, after being educated ;) I think I may have to reverse my position. Given that not every cluster can be

normalized to a single code point perhaps Python is doing it the best possible way. On the other hand, we have a

uni*code* type, not a uni*char* type. Maybe 3.5 can have that. ;)

------


Yon intuitively pointed a very important feature
of "unicode". However, it is not necessary, this is
exactly what unicode does (when used properly).

jmf
 
M

Mark Lawrence

On 04/12/2013 13:52, (e-mail address removed) wrote:

[snip all the double spaced stuff]
Yon intuitively pointed a very important feature
of "unicode". However, it is not necessary, this is
exactly what unicode does (when used properly).

jmf

Presumably using unicode correctly prevents messages being sent across
the ether with superfluous, extremely irritating double spacing? Or is
that down to poor tools in combination with the ignorance of their users?
 
N

Neil Cerutti

Yon intuitively pointed a very important feature of "unicode".
However, it is not necessary, this is exactly what unicode does
(when used properly).

Unicode only provides character sets. It's not a natural language
parsing facility.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,085
Messages
2,570,597
Members
47,218
Latest member
GracieDebo

Latest Threads

Top