Trouble sorting lists (unicode/locale related?)

E

Erlend Fuglum

Hi everyone,

I'm having some trouble sorting lists. I suspect this might have
something to do with locale settings and/or character
encoding/unicode.

Consider the following example, text containing norwegian special
characters æ, ø and å.
liste = ["ola", "erlend", "trygve", "Ærlige anders", "Lars", "Øksemorderen", "Åsne", "Akrobatiske Anna", "leidulf"]
liste.sort()
liste
['Akrobatiske Anna', 'Lars', 'erlend', 'leidulf', 'ola', 'trygve',
'\xc5sne', '\xc6rlige anders', '\xd8ksemorderen']

There are a couple of issues for me here:
* The sorting method apparently places strings starting with uppercase
characters before strings staring with lowercase. I would like to
treat them them equally when sorting. OK, this could probably be fixed
by hacking with .toupper() or something, but isn't it possible to
achieve this in a more elegant way?

* The norwegian special characters are sorted in a wrong way.
According to our alphabet the correct order is (...) x, y, z, æ, ø å.
Python does it this way: (...) x, y, z, å, æ, ø ?

I would really appreciate any help and suggestions - I have been
fiddling with this mess for quite some time now :)
 
P

Peter Otten

Erlend said:
Hi everyone,

I'm having some trouble sorting lists. I suspect this might have
something to do with locale settings and/or character
encoding/unicode.

Consider the following example, text containing norwegian special
characters æ, ø and å.
liste = ["ola", "erlend", "trygve", "Ærlige anders", "Lars",
"Øksemorderen", "Åsne", "Akrobatiske Anna", "leidulf"] liste.sort()
liste
['Akrobatiske Anna', 'Lars', 'erlend', 'leidulf', 'ola', 'trygve',
'\xc5sne', '\xc6rlige anders', '\xd8ksemorderen']

There are a couple of issues for me here:
* The sorting method apparently places strings starting with uppercase
characters before strings staring with lowercase. I would like to
treat them them equally when sorting. OK, this could probably be fixed
by hacking with .toupper() or something, but isn't it possible to
achieve this in a more elegant way?

* The norwegian special characters are sorted in a wrong way.
According to our alphabet the correct order is (...) x, y, z, æ, ø å.
Python does it this way: (...) x, y, z, å, æ, ø ?

I would really appreciate any help and suggestions - I have been
fiddling with this mess for quite some time now :)

Try setting the appropriate locale first:

import locale
locale.setlocale(locale.LC_ALL, ("no", None))

Then for a case-insensitive sort:

wordlist.sort(locale.strcoll)

should do (disclaimer: all untested).

Peter
 
E

Erlend Fuglum

Try setting the appropriate locale first:

import locale
locale.setlocale(locale.LC_ALL, ("no", None))

Then for a case-insensitive sort:

wordlist.sort(locale.strcoll)

should do (disclaimer: all untested).

Grrrrrreat, that did it! Thank you very much!

Erlend
 
J

Jeff Epler

Try setting the appropriate locale first:

import locale
locale.setlocale(locale.LC_ALL, ("no", None))

Then for a case-insensitive sort:

wordlist.sort(locale.strcoll)

should do (disclaimer: all untested).

If this did work, then you can use the DSU pattern like so:

decorated_wordlist = [(locale.strxfrm(w), w) for w in wordlist]
decorated_wordlist.sort()
wordlist[:] = [i[1] for i in decorated_wordlist]

from "man strxfrm"
DESCRIPTION
The strxfrm() function transforms the src string into a form such that
the result of strcmp() on two strings that have been transformed with
strxfrm() is the same as the result of strcoll() on the two strings
before their transformation. The first n characters of the transformed
string are placed in dest. The transformation is based on the pro-
gram’s current locale for category LC_COLLATE. (See setlocale(3)).

Jeff
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John said:
locale.setlocale(locale.LC_ALL, ("no", None))
[...]

If this did work, then you can use the DSU pattern like so:

[...strxfrm...]

The advantage of which is that you don't have to mess with the locale.

No, it doesn't. strxfrm requires that the locale is adjusted to the
desired LC_COLLATE facet. Otherwise, it will collate according to the
C locale (in which case strxfrm is the identity).

Regards,
Martin
 
J

John J. Lee

Martin v. Löwis said:
John said:
locale.setlocale(locale.LC_ALL, ("no", None)) [...]

If this did work, then you can use the DSU pattern like so:
[...strxfrm...]
The advantage of which is that you don't have to mess with the
locale.

No, it doesn't.

It does set the locale, you mean? So I guess there's no advantage
there at all?

[...]
strxfrm requires that the locale is adjusted to the
desired LC_COLLATE facet.
[...]


John
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

If this did work, then you can use the DSU pattern like so:
[...strxfrm...]
The advantage of which is that you don't have to mess with the
locale.

No, it doesn't.

It does set the locale, you mean?

Calling locale.strxfrm does not cause locale.setlocale to be called,
i.e. it does not set the locale. Likewise, calling locale.strcoll
does not cause setlocale to be called.

Instead, the application needs to call setlocale *explicitly* for
proper operation of either function.
So I guess there's no advantage there at all?

Using strxfrm is an advantage if you have to collate the same string
multiple times, e.g. when sorting a list of strings. It is an
advantage because sorting will complete faster.

Regards,
Martin
 
P

Peter Otten

Jeff said:
If this did work, then you can use the DSU pattern like so:

decorated_wordlist = [(locale.strxfrm(w), w) for w in wordlist]
decorated_wordlist.sort()
wordlist[:] = [i[1] for i in decorated_wordlist]

Everytime that someone posts a naive list.sort(compare), the DSU pattern is
proposed to improve execution speed.

So maybe it's about time to change the sort() method to support a second
argument

list.sort(compare=None, mapping=None)

that, if provided, would perform the DSU magic. Or was that already proposed
and rejected?

Peter
 
A

Alex Martelli

Peter said:
Jeff said:
If this did work, then you can use the DSU pattern like so:

decorated_wordlist = [(locale.strxfrm(w), w) for w in wordlist]
decorated_wordlist.sort()
wordlist[:] = [i[1] for i in decorated_wordlist]

Everytime that someone posts a naive list.sort(compare), the DSU pattern
is proposed to improve execution speed.

So maybe it's about time to change the sort() method to support a second
argument

list.sort(compare=None, mapping=None)

that, if provided, would perform the DSU magic. Or was that already
proposed and rejected?

I have not seen this proposed before, and I'm not very clear on what
the "compare" and "mapping" arguments are supposed to be in order to
let you specify any DSU. Basically it seems you would need two
callables, "decorate" to be called with each item in the list (to
return for each item the decorated tuple) and "undecorate" to be
called with each decorated tuple after the sort (to return the item
for the result). How do you turn that into "compare" and "mapping"?


Alex
 
D

Duncan Booth

I have not seen this proposed before, and I'm not very clear on what
the "compare" and "mapping" arguments are supposed to be in order to
let you specify any DSU. Basically it seems you would need two
callables, "decorate" to be called with each item in the list (to
return for each item the decorated tuple) and "undecorate" to be
called with each decorated tuple after the sort (to return the item
for the result). How do you turn that into "compare" and "mapping"?
You don't need two callables because the sort function would be doing
the decorating, so it knows also how to undecorate. I think the
suggestion is that the mapping argument returns something that can be
compared.

For example, here is a DSU function that does a not-in-place sort and
takes a suitable mapping argument. Changing it to in-place sort is,
of course, trivial.
newList = [ (aMapping(item), index, item) for (index, item)
in enumerate(aList) ]
newList.sort()
return [ item[2] for item in newList ]
print DSU(['Alex Martelli', 'Duncan Booth', 'Peter Otten', 'Jeff Epler'],
str.lower)
['Alex Martelli', 'Duncan Booth', 'Jeff Epler', 'Peter Otten'] names = name.split()
names = names[-1:] + names[:-1]
return [name.lower() for name in names]
print DSU(['Alex Martelli', 'Duncan Booth', 'Peter Otten', 'Jeff Epler'],
lastFirst)
['Duncan Booth', 'Jeff Epler', 'Alex Martelli', 'Peter Otten']
 
P

Peter Otten

Alex said:
I have not seen this proposed before, and I'm not very clear on what
the "compare" and "mapping" arguments are supposed to be in order to
let you specify any DSU. Basically it seems you would need two
callables, "decorate" to be called with each item in the list (to
return for each item the decorated tuple) and "undecorate" to be
called with each decorated tuple after the sort (to return the item
for the result). How do you turn that into "compare" and "mapping"?

My first idea was to add a .dsu(mapping) where the tuples in the decoration
phase would be generated as (mapping(item), item).
But I would rather enhance the already-existing sort() to provide for both
the traditional and the dsu sorting. Then you have to detect if the
function passed to sort(fun) takes one or two arguments, which is not
reliable when default values come into play. So I resorted to named
arguments. Didn't make it any clearer?

So here is a pure python prototype to shed some light on the matter:

class dsu(list):
def sort(self, compare=None, mapping=None):
if mapping:
decorated = [(mapping(i), i) for i in self]
decorated.sort()
self[:] = [i[1] for i in decorated]
else:
list.sort(self, compare)

sample = dsu("ab Aa AC d e f".split())

# "traditional" sort
sample.sort(lambda s, t: -cmp(s, t))
print sample

# decorate-sort-undecorate
sample.sort(mapping=lambda s: s.lower())
print sample


Peter
 
A

Alex Martelli

Duncan Booth wrote:
...
I have not seen this proposed before, and I'm not very clear on what
the "compare" and "mapping" arguments are supposed to be in order to
let you specify any DSU. Basically it seems you would need two
callables, "decorate" to be called with each item in the list (to
return for each item the decorated tuple) and "undecorate" to be
called with each decorated tuple after the sort (to return the item
for the result). How do you turn that into "compare" and "mapping"?
You don't need two callables because the sort function would be doing
the decorating, so it knows also how to undecorate. I think the
suggestion is that the mapping argument returns something that can be
compared.

For example, here is a DSU function that does a not-in-place sort and
takes a suitable mapping argument. Changing it to in-place sort is,
of course, trivial.
newList = [ (aMapping(item), index, item) for (index, item)

Ah, I see -- the alleged "mapping" is in fact meant to be a
*CALLABLE*, NOT a mapping in the Python sense. Yes, you could
get away with just one callable in this way, of course. It
also seems to me that shoe-horning that second argument (in
practice mutually exclusive with the first one) to the sort
method, when there seem to be no real advantages to keeping
it a method, is not a great idea. Rather, a new built-in
function appears to me to be best here -- something like:

def sort(iterable, decorate=None):
if decorate is None:
aux = [ (item, index, item)
for index, item in enumerate(iterable) ]
else:
aux = [ (decorate(item), index, item)
for index, item in enumerate(iterable) ]
aux.sort()
return [ item for __, __, item in aux ]

so as to have no arbitrary constraints on the iterable being
a list, or the sort being necessarily in-place, when there
is no performance advantage in so doing -- and also serve the
frequent need of a non-in-place sort returning the sorted
list without any actual need for decoration. E.g., I often
use the equivalent of:

for key in sort(somesmalldict):
print key, somesmalldict[key]

and it would be handy to have that little 'sort' function
built-in for all such uses.

The only downside I can see to this proposal is that Python
isn't really all that good at passing the callable decorate
argument -- one would end up with a lot of little ad hoc
functions, or lambdas, and neither is a great solution. It's
the kind of situation where Smalltalk/Ruby shine by letting
an arbitrary block of code (albeit one only) be passed into
any method (in Ruby, this DSU-sort would probably become a
method of the Enumerable "module" [mix-in] -- a rather elegant
solution overall, "architecturally" nicer-looking than Python's,
although "pragmatically" we're probably abot even).


Alex
 
S

Shu-Hsien Sheu

Hi,

I have a question about the comparison of efficiency of string slicing
and using string.startswith.
For example, which one of the following would be more efficient, or ,
moreover, more pythonic?

if aa[:3] == 'abc':

vs

if aa.startswith('abc'):


Thanks!

-shuhsien
 
D

Duncan Booth

My first idea was to add a .dsu(mapping) where the tuples in the
decoration phase would be generated as (mapping(item), item).

Note that if anyone proposes this seriously, it should generate a 3-tuple
(mapping(item), index, item) rather than the 2-tuple you suggest.

This is because the mapping function could reasonably be used to impose an
ordering on objects that have no natural order, so you need to be sure that
the comparison never falls through the the original object even where the
mapping compares equal. It also has the side effect of making the sort
stable (although if stability is a goal you might want another option to
reverse the sort which would use '-index' as the second element and call
..reverse() on the result).

FWIW, I think something like this belongs in the standard library rather
than as a method on lists or a new builtin.
 
A

Alex Martelli

Duncan Booth wrote:
...
FWIW, I think something like this belongs in the standard library rather
than as a method on lists or a new builtin.

Definitely not a method on lists, we agree on that. Problem is, there
is no module in the standard library to collect "functions that are often
useful but not QUITE often enough to warrant being built-ins" (and if
there was, there are quite a few current built-ins I'd like to relegate
into such an auxiliary/utilities module), so that finding a good place,
other than built-ins, for such utility functions, is often a bother.


Alex
 
B

Bob Gailer

At said:
Hi,

I have a question about the comparison of efficiency of string slicing and
using string.startswith.
For example, which one of the following would be more efficient, or ,
moreover, more pythonic?

if aa[:3] == 'abc':

vs

if aa.startswith('abc'):

Here's one way to address this kind of question:
.... st = time.time()
.... for i in range(500000):
.... if aa.startswith('abc'):pass
.... print time.time() - st
........ st = time.time()
.... for i in range(500000):
.... if aa[:3] == 'abc':pass
.... print time.time() - st
....1.01100003719

Bob Gailer
(e-mail address removed)
303 442 2625
 
D

Duncan Booth

Duncan Booth wrote:
...

Definitely not a method on lists, we agree on that. Problem is, there
is no module in the standard library to collect "functions that are often
useful but not QUITE often enough to warrant being built-ins" (and if
there was, there are quite a few current built-ins I'd like to relegate
into such an auxiliary/utilities module), so that finding a good place,
other than built-ins, for such utility functions, is often a bother.
So someone needs to write a PEP proposing such a module.
 
P

Peter Hansen

Shu-Hsien Sheu said:
I have a question about the comparison of efficiency of string slicing
and using string.startswith.
For example, which one of the following would be more efficient, or ,
moreover, more pythonic?

if aa[:3] == 'abc':

vs

if aa.startswith('abc'):

The latter is clearly more readable.
 
D

David Eppstein

For example, which one of the following would be more efficient, or ,
moreover, more pythonic?

if aa[:3] == 'abc':

vs

if aa.startswith('abc'):

The latter is clearly more readable.[/QUOTE]

More Pythonic, too, I think. "Readability counts," and "There should be
one-- and preferably only one --obvious way to do it." In this case,
startswith must be the one obvious way, or else why would it exist in
the standard library at all?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,818
Latest member
SapanaCarpetStudio

Latest Threads

Top