unicodedata implementation - categories

J

James Abley

Hi,

I'm trying to understand how CPython implements unicodedata, with a view to
providing an implementation for Jython. This is a background, low priority
thing for me, since I last posted to this list about it in February!

Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
'Cn'

0x10FFFF is not a valid codepoint in Unicode 4.1, which is the version of
the Unicode standard that Python 2.5 supports.

So I have a couple of questions:

1) Why doesn't the category method raise an Exception, like the name method
does?
2) Given that the category method doesn't currently raise an Exception,
please could someone explain how the category is calculated? I have tried to
figure it out based on the CPython code, but I have thus far failed, and I
would also prefer to have it explicitly defined, rather than mandating that
a Jython (.NET, etc) implementation uses the same (possibly non-optimal for
Java) data structures and algorithms.

My background is Mathematics rather than pure Computer Science, so doubtless
I still have some gaps in my education to be filled when it comes to data
structures and algorithms and I would welcome the opportunity to fill some
of those in. References to Knuth or some on-line reading would be much
appreciated, to help me understand the CPython part.

Cheers,

James
 
C

chris.monsanto

Hi,

I'm trying to understand how CPython implements unicodedata, with a view to
providing an implementation for Jython. This is a background, low priority
thing for me, since I last posted to this list about it in February!

Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> import unicodedata
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name>>> unicodedata.category(unichr(0x10FFFF))

'Cn'

0x10FFFF is not a valid codepoint in Unicode 4.1, which is the version of
the Unicode standard that Python 2.5 supports.

So I have a couple of questions:

1) Why doesn't the category method raise an Exception, like the name method
does?
2) Given that the category method doesn't currently raise an Exception,
please could someone explain how the category is calculated? I have tried to
figure it out based on the CPython code, but I have thus far failed, and I
would also prefer to have it explicitly defined, rather than mandating that
a Jython (.NET, etc) implementation uses the same (possibly non-optimal for
Java) data structures and algorithms.

My background is Mathematics rather than pure Computer Science, so doubtless
I still have some gaps in my education to be filled when it comes to data
structures and algorithms and I would welcome the opportunity to fill some
of those in. References to Knuth or some on-line reading would be much
appreciated, to help me understand the CPython part.

Cheers,

James

Cn is the "Other, Not Assigned" category in Unicode. No characters in
Unicode have this property. I'm not sure why it doesn't raise an
Exception, but if category() returns Cn, then you know it's not a
valid character.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

1) Why doesn't the category method raise an Exception, like the name method

As Chris explains, the result category means "Other, Not Assigned".
Python returns this category because it's the truth: for those
characters, the value of the "category" property really *is* Cn;
it means that they are not assigned.

If you are wondering how unicodedata.c comes up with the result:
the unassigned characters get a record index of 0, and that has a
category value of 0, which is "Cn".
2) Given that the category method doesn't currently raise an Exception,
please could someone explain how the category is calculated? I have tried to
figure it out based on the CPython code, but I have thus far failed, and I
would also prefer to have it explicitly defined, rather than mandating that
a Jython (.NET, etc) implementation uses the same (possibly non-optimal for
Java) data structures and algorithms.

You definitely should *not* follow the Python implementation. Instead,
the Unicode database is defined by the Unicode consortium, so the
Unicode standard is the ultimate specification.

To implement it in Java, I recommend to use java.lang.Character.getType.
If that returns java.lang.Character.UNASSIGNED, return "Cn".

Regards
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

1) Why doesn't the category method raise an Exception, like the name method

As Chris explains, the result category means "Other, Not Assigned".
Python returns this category because it's the truth: for those
characters, the value of the "category" property really *is* Cn;
it means that they are not assigned.

If you are wondering how unicodedata.c comes up with the result:
the unassigned characters get a record index of 0, and that has a
category value of 0, which is "Cn".
2) Given that the category method doesn't currently raise an Exception,
please could someone explain how the category is calculated? I have tried to
figure it out based on the CPython code, but I have thus far failed, and I
would also prefer to have it explicitly defined, rather than mandating that
a Jython (.NET, etc) implementation uses the same (possibly non-optimal for
Java) data structures and algorithms.

You definitely should *not* follow the Python implementation. Instead,
the Unicode database is defined by the Unicode consortium, so the
Unicode standard is the ultimate specification.

To implement it in Java, I recommend to use java.lang.Character.getType.
If that returns java.lang.Character.UNASSIGNED, return "Cn".

Regards
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top