Can upper() or lower() ever change the length of a string?

S

Steven D'Aprano

Do unicode.lower() or unicode.upper() ever change the length of the
string?

The Unicode standard allows for case conversions that change length, e.g.
sharp-S in German should convert to SS:

http://unicode.org/faq/casemap_charprop.html#6

but I see that Python doesn't do that:
'PAßSTRAßE'


The more I think about this, the more I think that upper/lower/title case
conversions should change length (at least sometimes) and if Python
doesn't do so, that's a bug. Any thoughts?
 
M

Mark Dickinson

Do unicode.lower() or unicode.upper() ever change the length of the
string?

The Unicode standard allows for case conversions that change length, e.g.
sharp-S in German should convert to SS:

http://unicode.org/faq/casemap_charprop.html#6

but I see that Python doesn't do that:


'PAßSTRAßE'

The more I think about this, the more I think that upper/lower/title case
conversions should change length (at least sometimes) and if Python
doesn't do so, that's a bug. Any thoughts?

Digging a bit deeper, it looks like these methods are using the
Simple_{Upper,Lower,Title}case_Mapping functions described at
http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
of the unicode data; you can see this in the source in Tools/unicode/
makeunicodedata.py, which is the Python code that generates the
database of unicode properties. It contains code like:

if record[12]:
upper = int(record[12], 16)
else:
upper = char
if record[13]:
lower = int(record[13], 16)
else:
lower = char
if record[14]:
title = int(record[14], 16)

.... and so on.

I agree that it might be desirable for these operations to product the
multicharacter equivalents. That idea looks like a tough sell,
though: apart from backwards compatibility concerns (which could
probably be worked around somehow), it looks as though it would
require significant effort to implement.
 
M

MRAB

Mark said:
Do unicode.lower() or unicode.upper() ever change the length of the
string?

The Unicode standard allows for case conversions that change length, e.g.
sharp-S in German should convert to SS:

http://unicode.org/faq/casemap_charprop.html#6

but I see that Python doesn't do that:

'PAßSTRAßE'

The more I think about this, the more I think that upper/lower/title case
conversions should change length (at least sometimes) and if Python
doesn't do so, that's a bug. Any thoughts?

Digging a bit deeper, it looks like these methods are using the
Simple_{Upper,Lower,Title}case_Mapping functions described at
http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
of the unicode data; you can see this in the source in Tools/unicode/
makeunicodedata.py, which is the Python code that generates the
database of unicode properties. It contains code like:

if record[12]:
upper = int(record[12], 16)
else:
upper = char
if record[13]:
lower = int(record[13], 16)
else:
lower = char
if record[14]:
title = int(record[14], 16)

... and so on.

I agree that it might be desirable for these operations to product the
multicharacter equivalents. That idea looks like a tough sell,
though: apart from backwards compatibility concerns (which could
probably be worked around somehow), it looks as though it would
require significant effort to implement.
If we were to make such a change, I think we should also cater for
locale-specific case changes (passing the locale to 'upper', 'lower' and
'title').

For example, normally "i".upper() returns "I", but in Turkish
"i".upper() should return "Ä°" (the uppercase version of lowercase dotted
i is uppercase dotted I).
 
T

Terry Reedy

Mark Dickinson wrote:
Digging a bit deeper, it looks like these methods are using the
Simple_{Upper,Lower,Title}case_Mapping functions described at
http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
of the unicode data; you can see this in the source in Tools/unicode/
makeunicodedata.py, which is the Python code that generates the
database of unicode properties. It contains code like:

if record[12]:
upper = int(record[12], 16)
else:
upper = char
if record[13]:
lower = int(record[13], 16)
else:
lower = char
if record[14]:
title = int(record[14], 16)

... and so on.

I agree that it might be desirable for these operations to product the
multicharacter equivalents. That idea looks like a tough sell,
though: apart from backwards compatibility concerns (which could
probably be worked around somehow), it looks as though it would
require significant effort to implement.
If we were to make such a change, I think we should also cater for
locale-specific case changes (passing the locale to 'upper', 'lower' and
'title').

For example, normally "i".upper() returns "I", but in Turkish
"i".upper() should return "Ä°" (the uppercase version of lowercase dotted
i is uppercase dotted I).

Given that the current (siimple) functions implement standard-defined
functions, I think any change should be to *add* new
'complex-case-change' functions.

Terry Jan Reedy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top