Incorrect title case?

MRAB · Jan 16, 2009

Python 2.6.1

I've just found that the following 4 Unicode characters/codepoints don't
behave as I'd expect: Ç… (U+01C5), Çˆ (U+01C8), Ç‹ (U+01CB), Ç² (U+01F2).

For example, u"\u01C5".istitle() returns True and
unicodedata.category(u"\u01C5") returns "Lt", but u"\u01C5".title()
returns u'\u01C4', which is the uppercase equivalent. Are these mistakes
in the Unicode database?

John Machin · Jan 17, 2009

Python 2.6.1

I've just found that the following 4 Unicode characters/codepoints don't
behave as I'd expect: Ç… (U+01C5), Çˆ (U+01C8), Ç‹ (U+01CB), Ç² (U+01F2).

For example, u"\u01C5".istitle() returns True and
unicodedata.category(u"\u01C5") returns "Lt", but u"\u01C5".title()
returns u'\u01C4', which is the uppercase equivalent. Are these mistakes
in the Unicode database?

Doesn't look like it. AFAICT it's a mistake in Objects/unicodetype.c,
function _PyUnicode_ToTitlecase.

See http://svn.python.org/view/python/trunk/Objects/unicodectype.c?rev=66362&view=markup

The code that says:
if (ctype->title)
delta = ctype->title;
else
delta = ctype->upper;
should IMHO merely be:
delta = ctype->title;

A value of zero for ctype->title should be interpreted simply as the
offset to add to the ordinal, as it is in the sibling _PyUnicode_To
(Upper|Lower)case functions. See also Tools/unicode/makeunicodedata.py
which treats upper, lower and title identically when preparing the
tables used by those 3 functions.

AFAICT making that change will fix the problem for those four
characters and not ruin any others.

The error that you noticed occurs as far back as I've looked (2.1) and
also occurs in 3.0.

Cheers,
John

Terry Reedy · Jan 17, 2009

John said:
Doesn't look like it. AFAICT it's a mistake in Objects/unicodetype.c,
function _PyUnicode_ToTitlecase.

See http://svn.python.org/view/python/trunk/Objects/unicodectype.c?rev=66362&view=markup

The code that says:
if (ctype->title)
delta = ctype->title;
else
delta = ctype->upper;
should IMHO merely be:
delta = ctype->title;

A value of zero for ctype->title should be interpreted simply as the
offset to add to the ordinal, as it is in the sibling _PyUnicode_To
(Upper|Lower)case functions. See also Tools/unicode/makeunicodedata.py
which treats upper, lower and title identically when preparing the
tables used by those 3 functions.

AFAICT making that change will fix the problem for those four
characters and not ruin any others.

The error that you noticed occurs as far back as I've looked (2.1) and
also occurs in 3.0.

Please post a report to the tracker at bugs.python.org.

MRAB · Jan 17, 2009

Terry said:
Please post a report to the tracker at bugs.python.org.

Already done: http://bugs.python.org/issue4971

Martin v. LÃ¶wis · Jan 17, 2009

A value of zero for ctype->title should be interpreted simply as the

offset to add to the ordinal, as it is in the sibling _PyUnicode_To
(Upper|Lower)case functions.

Interestingly enough, according to the spec of UnicodeData.txt,
these should *not* be siblings. Refer to

http://www.unicode.org/Public/UNIDATA/UCD.html

For lower and upper case, it says

Note: The simple uppercase is omitted in the data file if the uppercase
is the same as the code point itself.

whereas for titlecase, it says

Note: The simple titlecase may be omitted in the data file if the
titlecase is the same as the uppercase.

So unicodectype is right to fall back to uppercase if no titlecase
mapping is given.

However, this looks like a bug in UCD.html: they probably should have
the same note for titlecase as they have for lower and uppercase
(at least, that's how UnicodeData seems to be generated).

Regards,
Martin

John Machin · Jan 18, 2009

Interestingly enough, according to the spec of UnicodeData.txt,
these should *not* be siblings. Refer to

http://www.unicode.org/Public/UNIDATA/UCD.html

For lower and upper case, it says

Note: The simple uppercase is omitted in the data file if the uppercase
is the same as the code point itself.

whereas for titlecase, it says

Note: The simple titlecase may be omitted in the data file if the
titlecase is the same as the uppercase.

However: (1) there seem to be no examples in the current data file
where the titlecase is empty and the uppercase is not empty
(2) the titlecase is *NOT* empty for the four characters in question
-- they have [in effect] ch.title() -> ch as MRAB expected.

See my response in the bug tracker for further info/comment.

So unicodectype is right to fall back to uppercase if no titlecase
mapping is given.

Correct -- but this is currently hypothetical; moreover the "fallback"
is being done in the wrong place; it should be done in Tools/Unicode/
makeunicodedata.py when it reads the UnicodeData.txt file. The current
implementation codes the ch.title() -> ch mapping as delta = 0 which
is the same coding as used for "no titlecase specified in file"
leaving the runtime unicodetype with a dilemema which it resolves
wrongly -- it is *NOT* correct to pick uppercase when the titlecase is
actually specified in the UnicodeData.txt file.

Note that although it's not mentioned in the modification history for
UnicodeData.txt, the titlecase entry for the 4 characters changed from
"empty" to "self" in Unicode 4.0.0.

HTH,
John

Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
uniicode and executing a process with subprocess.call, or os.system	1	Jul 18, 2009
Tasks	1	Nov 29, 2022
Ascii to Unicode.	4	Jul 28, 2010
PEP 3131: Supporting Non-ASCII Identifiers	399	May 13, 2007
"encoding specified in XML declaration is incorrect"	1	Dec 2, 2004
urllib2 header casing discrepancy	0	Feb 23, 2011

Incorrect title case?

MRAB

John Machin

Terry Reedy

MRAB

Martin v. LÃ¶wis

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads