Incorrect title case?

M

MRAB

Python 2.6.1

I've just found that the following 4 Unicode characters/codepoints don't
behave as I'd expect: Dž (U+01C5), Lj (U+01C8), Nj (U+01CB), Dz (U+01F2).

For example, u"\u01C5".istitle() returns True and
unicodedata.category(u"\u01C5") returns "Lt", but u"\u01C5".title()
returns u'\u01C4', which is the uppercase equivalent. Are these mistakes
in the Unicode database?
 
J

John Machin

Python 2.6.1

I've just found that the following 4 Unicode characters/codepoints don't
behave as I'd expect: Dž (U+01C5), Lj (U+01C8), Nj (U+01CB), Dz (U+01F2).

For example, u"\u01C5".istitle() returns True and
unicodedata.category(u"\u01C5") returns "Lt", but u"\u01C5".title()
returns u'\u01C4', which is the uppercase equivalent. Are these mistakes
in the Unicode database?

Doesn't look like it. AFAICT it's a mistake in Objects/unicodetype.c,
function _PyUnicode_ToTitlecase.

See http://svn.python.org/view/python/trunk/Objects/unicodectype.c?rev=66362&view=markup

The code that says:
if (ctype->title)
delta = ctype->title;
else
delta = ctype->upper;
should IMHO merely be:
delta = ctype->title;

A value of zero for ctype->title should be interpreted simply as the
offset to add to the ordinal, as it is in the sibling _PyUnicode_To
(Upper|Lower)case functions. See also Tools/unicode/makeunicodedata.py
which treats upper, lower and title identically when preparing the
tables used by those 3 functions.

AFAICT making that change will fix the problem for those four
characters and not ruin any others.

The error that you noticed occurs as far back as I've looked (2.1) and
also occurs in 3.0.

Cheers,
John
 
T

Terry Reedy

John said:
Doesn't look like it. AFAICT it's a mistake in Objects/unicodetype.c,
function _PyUnicode_ToTitlecase.

See http://svn.python.org/view/python/trunk/Objects/unicodectype.c?rev=66362&view=markup

The code that says:
if (ctype->title)
delta = ctype->title;
else
delta = ctype->upper;
should IMHO merely be:
delta = ctype->title;

A value of zero for ctype->title should be interpreted simply as the
offset to add to the ordinal, as it is in the sibling _PyUnicode_To
(Upper|Lower)case functions. See also Tools/unicode/makeunicodedata.py
which treats upper, lower and title identically when preparing the
tables used by those 3 functions.

AFAICT making that change will fix the problem for those four
characters and not ruin any others.

The error that you noticed occurs as far back as I've looked (2.1) and
also occurs in 3.0.

Please post a report to the tracker at bugs.python.org.
 
M

Martin v. Löwis

A value of zero for ctype->title should be interpreted simply as the
offset to add to the ordinal, as it is in the sibling _PyUnicode_To
(Upper|Lower)case functions.

Interestingly enough, according to the spec of UnicodeData.txt,
these should *not* be siblings. Refer to

http://www.unicode.org/Public/UNIDATA/UCD.html

For lower and upper case, it says

Note: The simple uppercase is omitted in the data file if the uppercase
is the same as the code point itself.

whereas for titlecase, it says

Note: The simple titlecase may be omitted in the data file if the
titlecase is the same as the uppercase.

So unicodectype is right to fall back to uppercase if no titlecase
mapping is given.

However, this looks like a bug in UCD.html: they probably should have
the same note for titlecase as they have for lower and uppercase
(at least, that's how UnicodeData seems to be generated).

Regards,
Martin
 
J

John Machin

Interestingly enough, according to the spec of UnicodeData.txt,
these should *not* be siblings. Refer to

http://www.unicode.org/Public/UNIDATA/UCD.html

For lower and upper case, it says

Note: The simple uppercase is omitted in the data file if the uppercase
is the same as the code point itself.

whereas for titlecase, it says

Note: The simple titlecase may be omitted in the data file if the
titlecase is the same as the uppercase.

However: (1) there seem to be no examples in the current data file
where the titlecase is empty and the uppercase is not empty
(2) the titlecase is *NOT* empty for the four characters in question
-- they have [in effect] ch.title() -> ch as MRAB expected.

See my response in the bug tracker for further info/comment.
So unicodectype is right to fall back to uppercase if no titlecase
mapping is given.

Correct -- but this is currently hypothetical; moreover the "fallback"
is being done in the wrong place; it should be done in Tools/Unicode/
makeunicodedata.py when it reads the UnicodeData.txt file. The current
implementation codes the ch.title() -> ch mapping as delta = 0 which
is the same coding as used for "no titlecase specified in file"
leaving the runtime unicodetype with a dilemema which it resolves
wrongly -- it is *NOT* correct to pick uppercase when the titlecase is
actually specified in the UnicodeData.txt file.

Note that although it's not mentioned in the modification history for
UnicodeData.txt, the titlecase entry for the 4 characters changed from
"empty" to "self" in Unicode 4.0.0.

HTH,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,299
Messages
2,571,545
Members
48,299
Latest member
Ruby87897

Latest Threads

Top