unicodedata implementation

J

James Abley

Hi,

[Originally posted this to the dev list, but the moderator advised
posting here first]

I'm looking into implementing this module for Jython, and I'm trying
to understand the contracts promised by the various methods. Please
bear in mind that means I'm probably targeting the CPython
implementation as of 2.3, although I would obviously be quite happy if
my implementation doesn't need too much extra to fit the 2.5
functionality!

As someone has previously posted [1], the documentation is a little
thin and they were pointed at the Unicode specification [2]. I've done
a little reading there, and have a little knowledge now, which is
always dangerous. There are still gaps, and I was hoping someone here
might be able to point out what I'm missing.

My problem, described here [3], but I'll summarise and add a little to it.

2468;CIRCLED DIGIT NINE;No;0;EN; 0039;;9;9;N;;;;;

(UnicodeData.txt [4] for Unicode 3.2.0 [5] entry for code-point 0x2468)

verify(unicodedata.decimal(u'\u2468',None) is None)
verify(unicodedata.digit(u'\u2468') == 9)
verify(unicodedata.numeric(u'\u2468') == 9.0)

That works fine, and I can see in the UnicodeData.txt file (the
mirrored property N towards the end is a fine marker; go back three
fields and then start working forward from there) that the decimal
property isn't defined, the digit property is 9 and the numeric
property is also 9.

However, this next bit is what confuses me:

325F;CIRCLED NUMBER THIRTY FIVE;No;0;ON; 0033 0035;;;35;N;;;;;

(UnicodeData.txt for Unicode 3.2.0 entry for code-point 0x325F)

verify(unicodedata.decimal(u'\u325F',None) is None)
verify(unicodedata.digit(u'\u325F', None) is None)
verify(unicodedata.numeric(u'\u325F') == 35.0)

The last one fails - ValueError: not a numeric character.

Now, again looking at the UnicodeData.txt entry and the mirrored N
property, working back three fields and going forward from there shows
that the decimal property isn't set, the digit property isn't set and
the numeric property appears to be 35.

So from my understanding of the Unicode (3.2.0) spec, the code point
0x325F has a numeric property with a value of 35, but the python (2.3
and 2.4 - I haven't put 2.5 onto my box yet) implementation of
unicodedata disagrees, presumably for good reason.

I can't see where I'm going wrong.

Cheers,

James

[1] http://groups.google.com/group/comp...lnk=st&q=unicodedata&rnum=10#7dbdda27be118836
[2] http://www.unicode.org/
[3] http://eternusuk.blogspot.com/2007/02/jython-unicodedata-initial-overview.html
[4] http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt
[5] http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

James said:
So from my understanding of the Unicode (3.2.0) spec, the code point
0x325F has a numeric property with a value of 35, but the python (2.3
and 2.4 - I haven't put 2.5 onto my box yet) implementation of
unicodedata disagrees, presumably for good reason.

I can't see where I'm going wrong.

You might not be wrong at all. CPython has a hard-coded list for the
numeric mapping (see Object/unicodectype.c), and that hadn't been
updated even when the rest of the character database was updated.
Patch #1494554 corrected this and updated the numeric properties to
Unicode 4.1, for Python 2.5.

There is still a patch pending generating this function, instead
of maintaining it manually.

HTH,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top