On 12/2/13 3:38 PM, Ethan Furman wrote:
This is where my knowledge about Unicode gets fuzzy. Isn't it the case
that some grapheme clusters (or whatever the right word is) can't be
normalized down to a single code point? Characters can accept many
accents, for example. In that case, you can't always normalize and use
the existing string methods, but would need more specialized code.
That is correct.
If Unicode had a distinct code point for every possible combination of
base-character plus an arbitrary number of diacritics or accents, the
0x10FFFF code points wouldn't be anywhere near enough.
I see over 300 diacritics used just in the first 5000 code points. Let's
pretend that's only 100, and that you can use up to a maximum of 5 at a
time. That gives 79375496 combinations per base character, much larger
than the total number of Unicode code points in total.
If anyone wishes to check my logic:
# count distinct combining chars
import unicodedata
s = ''.join(chr(i) for i in range(33, 5000))
s = unicodedata.normalize('NFD', s)
t = [c for c in s if unicodedata.combining(c)]
len(set(t))
# calculate the number of combinations
def comb(r, n):
"""Combinations nCr"""
p = 1
for i in range(r+1, n+1):
p *= i
for i in range(1, n-r+1):
p /= i
return p
sum(comb(i, 100) for i in range(6))
I'm not suggesting that all of those accents are necessarily in use in
the real world, but there are languages which construct arbitrary
combinations of accents. (Or so I have been lead to believe.)