convert Unicode to lower/uppercase?

  • Thread starter Hallvard B Furuseth
  • Start date
J

jallan

A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
but doesn't include 1:many mappings such as the one needed for
uppercasing ß. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file
(SpecialCasing.txt) was added to provide the 1:many mappings. For more
information, see UTR #21- Case Mappings [MD]
Python specifications make an implied claim of full support for
Unicode and an implied claim that the function upper() uppercases a
string properly.

This is a contradiction: SpecialCasing contains 1:n mappings, whereas
.upper() can only return a single result. So how do you think
SpecialCasing should be considered in the implementation of .upper()?

I am not aware that it is philosophically a *necessary* feature of
..upper() that a single character not be replaced by a string of two or
more characters.

One should fix the contradition by either changing the behavior of
..upper() so that it will properly case all strings or documenting
clearly that .upper() does not handle particular kinds of casing. Of
course users often don't read the documentation. :-(
Things are more difficult than they appear to be.

Yes.

Again and again one thinks one has a solution for a problem and then
exceptions turn up.

Again and again one finds things that one's code doesn't handle, often
from failure to analyze fully in the intitial stages and adopting
algorithms that prove insufficient to handle the data found in
reality.

Jim Allan








Jim Allan
 
N

Neil Hodgson

jallan:
(e-mail address removed) (Martin v. Löwis) wrote

I am not aware that it is philosophically a *necessary* feature of
.upper() that a single character not be replaced by a string of two or
more characters.

That is not the issue. The issue is that .upper would have to return a
list or map of results (for an illustrative but incorrect example
"ca~non".upper() -> {'portugal':'CANON','spain':'CA~NON'}), which would be
difficult for the caller to make use of without performing some additional
work, finding the correct result for its locale. It is simpler for the
caller to provide a locale argument in the .upper call or in its context.

Neil
 
N

Neil Hodgson

Me:
for an illustrative but incorrect example
"ca~non".upper() -> {'portugal':'CANON','spain':'CA~NON'}),

For a real example from the Microsoft web site, uppercasing "indigo"
(u'\u0069\u006e\u0064\u0069\u0067\u006f') gives "INDIGO"
(u'\u0049\u004e\u0044\u0049\u0047\u004f') for English-US and similar but
with dots above the 'I's for Turkish:
(u'\u0130\u004e\u0044\u0130\u0047\u004f').

Neil
 
J

jallan

Neil Hodgson said:
Me:


For a real example from the Microsoft web site, uppercasing "indigo"
(u'\u0069\u006e\u0064\u0069\u0067\u006f') gives "INDIGO"
(u'\u0049\u004e\u0044\u0049\u0047\u004f') for English-US and similar but
with dots above the 'I's for Turkish:
(u'\u0130\u004e\u0044\u0130\u0047\u004f').

The file http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
purportedly contains *all* casings for all scripts for all languages
where the casings are not one-to-one or are otherwise not
straightforward.

The *only* locale oddities there are for Lithuanian and the two
languages Turkish and Azeri and concern only dot/no-dot variants of
the letters _i_, _I_, _j_, _J_ and no others.

There are *no* other locale-based oddities. The mess is thankfully
*very* limited in scope.

In my opinion, if the full Unicode casing specification is to be
followed, the most useful solution would be a parameter allowing the
user to choose among (1) normal Latin casing, (2) Turkish/Azeri or (2)
Lithuanian as the casing model for treatment of these letters.

The default for the parameter would either be based on current locale
or be normal Latin casing. I think the latter far better as it is
dangerous to have functions in a language differ from machine to
machine according to the current locale.

Also, in case someone brings it up, it was formerly standard to
generally omit diacritics on capital letters in Portuguese and in
French (in France but not in Quebec!)

This is no longer the norm for either language. See
http://www.academie-francaise.fr/langue/questions.html#accentuation
and http://www.press.uchicago.edu/Misc/Chicago/cmosfaq/cmosfaq.SpecialCharacters.html.

I have seen academic style sheets with a silly rule that diacritics
should be placed on capital letters as on lowercase letters except for
the word "A". See http://www.alphaacademic.co.uk/fcs.htm and
http://www.sagepub.com/journalManuscript.aspx?pid=9669&sc=1:

<< We use accents on capital letters, but capital A does not take a
grave accent. >>

It would not hurt to make a casing table customizable for such unusual
styles. But that is beyond Unicode's specifications.

A programmer who wishes odd customization beyond the norms of a
language and Unicode specifications can do it through transformations
outside of normal casing.

Jim Allan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,164
Messages
2,570,898
Members
47,439
Latest member
shasuze

Latest Threads

Top