Can upper() or lower() ever change the length of a string?

Steven D'Aprano · May 24, 2010

Do unicode.lower() or unicode.upper() ever change the length of the
string?

The Unicode standard allows for case conversions that change length, e.g.
sharp-S in German should convert to SS:

http://unicode.org/faq/casemap_charprop.html#6

but I see that Python doesn't do that:
'PAÃŸSTRAÃŸE'

The more I think about this, the more I think that upper/lower/title case
conversions should change length (at least sometimes) and if Python
doesn't do so, that's a bug. Any thoughts?

Mark Dickinson · May 24, 2010

Do unicode.lower() or unicode.upper() ever change the length of the
string?

From looking at the source, in particular the fixupper and fixlower
functions in Objects/unicode.c [1], it looks like not: they do a
simple character-by-character replacement.

[1] http://svn.python.org/view/python/trunk/Objects/unicodeobject.c?view=markup

Mark Dickinson · May 24, 2010

Do unicode.lower() or unicode.upper() ever change the length of the
string?

The Unicode standard allows for case conversions that change length, e.g.
sharp-S in German should convert to SS:

http://unicode.org/faq/casemap_charprop.html#6

but I see that Python doesn't do that:

'PAßSTRAßE'

The more I think about this, the more I think that upper/lower/title case
conversions should change length (at least sometimes) and if Python
doesn't do so, that's a bug. Any thoughts?

Digging a bit deeper, it looks like these methods are using the
Simple_{Upper,Lower,Title}case_Mapping functions described at
http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
of the unicode data; you can see this in the source in Tools/unicode/
makeunicodedata.py, which is the Python code that generates the
database of unicode properties. It contains code like:

if record[12]:
upper = int(record[12], 16)
else:
upper = char
if record[13]:
lower = int(record[13], 16)
else:
lower = char
if record[14]:
title = int(record[14], 16)

.... and so on.

I agree that it might be desirable for these operations to product the
multicharacter equivalents. That idea looks like a tough sell,
though: apart from backwards compatibility concerns (which could
probably be worked around somehow), it looks as though it would
require significant effort to implement.

MRAB · May 24, 2010

Mark said:
Do unicode.lower() or unicode.upper() ever change the length of the
string?

The Unicode standard allows for case conversions that change length, e.g.
sharp-S in German should convert to SS:

http://unicode.org/faq/casemap_charprop.html#6

but I see that Python doesn't do that:

'PAÃŸSTRAÃŸE'

The more I think about this, the more I think that upper/lower/title case
conversions should change length (at least sometimes) and if Python
doesn't do so, that's a bug. Any thoughts?

Click to expand...

Digging a bit deeper, it looks like these methods are using the
Simple_{Upper,Lower,Title}case_Mapping functions described at
http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
of the unicode data; you can see this in the source in Tools/unicode/
makeunicodedata.py, which is the Python code that generates the
database of unicode properties. It contains code like:

if record[12]:
upper = int(record[12], 16)
else:
upper = char
if record[13]:
lower = int(record[13], 16)
else:
lower = char
if record[14]:
title = int(record[14], 16)

... and so on.

I agree that it might be desirable for these operations to product the
multicharacter equivalents. That idea looks like a tough sell,
though: apart from backwards compatibility concerns (which could
probably be worked around somehow), it looks as though it would
require significant effort to implement.

If we were to make such a change, I think we should also cater for
locale-specific case changes (passing the locale to 'upper', 'lower' and
'title').

For example, normally "i".upper() returns "I", but in Turkish
"i".upper() should return "Ä°" (the uppercase version of lowercase dotted
i is uppercase dotted I).

Terry Reedy · May 24, 2010

Mark Dickinson wrote:

Digging a bit deeper, it looks like these methods are using the
Simple_{Upper,Lower,Title}case_Mapping functions described at
http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
of the unicode data; you can see this in the source in Tools/unicode/
makeunicodedata.py, which is the Python code that generates the
database of unicode properties. It contains code like:

if record[12]:
upper = int(record[12], 16)
else:
upper = char
if record[13]:
lower = int(record[13], 16)
else:
lower = char
if record[14]:
title = int(record[14], 16)

... and so on.

I agree that it might be desirable for these operations to product the
multicharacter equivalents. That idea looks like a tough sell,
though: apart from backwards compatibility concerns (which could
probably be worked around somehow), it looks as though it would
require significant effort to implement.

Click to expand...

If we were to make such a change, I think we should also cater for
locale-specific case changes (passing the locale to 'upper', 'lower' and
'title').

For example, normally "i".upper() returns "I", but in Turkish
"i".upper() should return "Ä°" (the uppercase version of lowercase dotted
i is uppercase dotted I).

Given that the current (siimple) functions implement standard-defined
functions, I think any change should be to *add* new
'complex-case-change' functions.

Terry Jan Reedy

enhance an array's static type by a lower length-bound.	17	Aug 17, 2011
How to get education and coding job coming from abroad starting new in the US? Advice of courses or places to look?	2	May 18, 2023
Pytz error: unpack requires a string argument of length 44	7	Jun 13, 2012
Multiprocessing bug, is information ever omitted from a traceback?	7	Dec 9, 2011
Asyncio (or something better) for control of a vacuumsystem/components.	1	Mar 24, 2014
New Dojo Site--Most incompetent ever?	49	Mar 8, 2010
How can I remove the first line of a multi-line string?	0	Sep 2, 2013
FAQ 4.27 How can I access or change N characters of a string?	0	Feb 27, 2011

Can upper() or lower() ever change the length of a string?

Steven D'Aprano

Mark Dickinson

Mark Dickinson

MRAB

Terry Reedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads