Ordering python sets

Steven D'Aprano · Oct 28, 2008

Ick. Costs twice as much.

But that's only a constant cost. For small amounts of data, it's trivial,
and for large amounts, it's dominated by the cost of sorting N items,
which is O(N log N) by memory. (Python's sort is a modified heap sort, I
believe.)

[...]

This solution seems to be the solution I was looking for. I think "Neg"
is a poor name, but all the good names are significantly longer... Given
its usage, though, in a sorted call, "Rev" would be a better poor name!
ReversedCompare might be a better long name.

How can I suggest adding it to the documentation for Python 3.0 ? It is
much preferable to the "obvious" solution of sorting twice, and doing
that is a trap which many people might fall into. I posted here
instead, and thank you for the solution.

Why do you call it a trap? You're preferred solution is significantly
slower for small amounts of data than the solution you call a trap:

class Neg(object):
def __init__(self, value):
self.value = value
def __cmp__(self, other):
return -cmp(self.value, other.value)

def sort_updown1(pairs):
return sorted(pairs, key=lambda (s, t): (s, Neg(t)))

import operator
def sort_updown2(pairs):
pairs = pairs[:]
pairs.sort(key=operator.itemgetter(1), reverse=True)
pairs.sort(key=operator.itemgetter(0))
return pairs

pairs = [('al', 'boo'), ('zul', 'ao'), ('al', 'den'), ('zul', 'o')]
from timeit import Timer
t1 = Timer("sort_updown1(pairs)",
"from __main__ import sort_updown1, pairs")
t2 = Timer("sort_updown2(pairs)",
"from __main__ import sort_updown2, pairs")

And the results:

t1.repeat(number=1000) [0.053807973861694336, 0.053076028823852539, 0.052966117858886719]
t2.repeat(number=1000)

Click to expand...

Click to expand...

[0.022480964660644531, 0.022369861602783203, 0.02264404296875]

The difference is even more significant for larger amounts of data:

pairs = pairs*100
t1.repeat(number=1000) [5.3583409786224365, 4.6985390186309814, 4.6953370571136475]
t2.repeat(number=1000)

Click to expand...

Click to expand...

[1.1064291000366211, 1.0017910003662109, 0.48421788215637207]

And more significant still as we sort even more data:

pairs = pairs*10
t1.repeat(number=1000) [53.255411863327026, 53.792617082595825, 53.549386024475098]
t2.repeat(number=1000)

Click to expand...

Click to expand...

[5.8788919448852539, 5.0701780319213867, 4.8528299331665039]

Or, tabulating the results:

N = 4 400 4000
------------------------
t1 = 0.05 4.6 53.3
t2 = 0.02 0.5 4.9

As you can see, the supposed "trap" of sorting twice is significantly
faster, by an order of magnitude, than sorting it once. It would be even
faster if I didn't bother making a copy of the list before sorting.

Glenn Linderman · Oct 28, 2008

Ick. Costs twice as much.

Click to expand...

But that's only a constant cost. For small amounts of data, it's trivial,
and for large amounts, it's dominated by the cost of sorting N items,
which is O(N log N) by memory. (Python's sort is a modified heap sort, I
believe.)

Sure.

[...]

This solution seems to be the solution I was looking for. I think "Neg"
is a poor name, but all the good names are significantly longer... Given
its usage, though, in a sorted call, "Rev" would be a better poor name!
ReversedCompare might be a better long name.

How can I suggest adding it to the documentation for Python 3.0 ? It is
much preferable to the "obvious" solution of sorting twice, and doing
that is a trap which many people might fall into. I posted here
instead, and thank you for the solution.

Click to expand...

Why do you call it a trap? You're preferred solution is significantly
slower for small amounts of data than the solution you call a trap:

class Neg(object):
def __init__(self, value):
self.value = value
def __cmp__(self, other):
return -cmp(self.value, other.value)

def sort_updown1(pairs):
return sorted(pairs, key=lambda (s, t): (s, Neg(t)))

import operator
def sort_updown2(pairs):
pairs = pairs[:]
pairs.sort(key=operator.itemgetter(1), reverse=True)
pairs.sort(key=operator.itemgetter(0))
return pairs

pairs = [('al', 'boo'), ('zul', 'ao'), ('al', 'den'), ('zul', 'o')]
from timeit import Timer
t1 = Timer("sort_updown1(pairs)",
"from __main__ import sort_updown1, pairs")
t2 = Timer("sort_updown2(pairs)",
"from __main__ import sort_updown2, pairs")

And the results:

[0.053807973861694336, 0.053076028823852539, 0.052966117858886719]
[0.022480964660644531, 0.022369861602783203, 0.02264404296875]

The difference is even more significant for larger amounts of data:

[5.3583409786224365, 4.6985390186309814, 4.6953370571136475]
[1.1064291000366211, 1.0017910003662109, 0.48421788215637207]

And more significant still as we sort even more data:

[53.255411863327026, 53.792617082595825, 53.549386024475098]
[5.8788919448852539, 5.0701780319213867, 4.8528299331665039]

Or, tabulating the results:

N = 4 400 4000
------------------------
t1 = 0.05 4.6 53.3
t2 = 0.02 0.5 4.9

As you can see, the supposed "trap" of sorting twice is significantly
faster, by an order of magnitude, than sorting it once. It would be even
faster if I didn't bother making a copy of the list before sorting.

Well, I'm kind of a Python newbie, so I could be missing some things.

Seems like your "larger data" is really just a replication of a small
set of small tuples. In practice, it doesn't seem representative of
most data. First, the 4000 entries are really only 4 entries duplicated
1000 times. Second, there is no "payload", only the key. It seemed to
me that these issues could affect the results. Third, comparing a
"built-in" string comparison to an "object" comparison, looks like it
could tilt the balance.

So I went to
http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=2701 and
downloaded the Text format version, discovering it has 23244 lines.

I used this program:

class Neg(object):
def __init__(self, value):
self.value = value
def __cmp__(self, other):
return -cmp(self.value, other.value)

class Pos(object):
def __init__(self, value):
self.value = value
def __cmp__(self, other):
return cmp(self.value, other.value)

def sort_updown1(recs):
return sorted(recs, key=lambda( rec ): ( rec[ 0:20 ], Neg( rec[
25:45 ])))

import operator
def sort_updown2(recs):
recs = recs[:]
recs.sort(key=lambda( rec ): ( Pos( rec[ 25:45 ])), reverse=True)
recs.sort(key=lambda( rec ): ( rec[ 0:20 ]))
return recs

recs = [('al', 'boo'), ('zul', 'ao'), ('al', 'den'), ('zul', 'o')]

fil = open("moby10b.txt", "r")
recs = []
for lin in fil:
recs += [ lin ]
fil.close()

recs = recs
print len( recs )

from timeit import Timer
t1 = Timer("sort_updown1(recs)",
"from __main__ import sort_updown1, recs")
t2 = Timer("sort_updown2(recs)",
"from __main__ import sort_updown2, recs")

print t1.repeat(number=10)
print t2.repeat(number=10)

=============
with these results:

23244

[1.5413431036626164, 1.5402491098729028, 1.5377672301926637]
[4.9985489775935212, 4.9921157577289836, 4.9402731352727773]

So then I modified things to not use objects at all, just the slice
syntax, and not to do reverse comparisons (since with strings I can't do
one of each without the object), to compare the two-sort process to the
one sort, two key process. Here's the code:

def sort_updown1(recs):
return sorted(recs, key=lambda( rec ): ( rec[ 0:20 ], rec[ 25:45 ]))

import operator
def sort_updown2(recs):
recs = recs[:]
recs.sort(key=lambda( rec ): ( rec[ 25:45 ]))
recs.sort(key=lambda( rec ): ( rec[ 0:20 ]))
return recs

recs = [('al', 'boo'), ('zul', 'ao'), ('al', 'den'), ('zul', 'o')]

fil = open("moby10b.txt", "r")
recs = []
for lin in fil:
recs += [ lin ]
fil.close()

recs = recs
print len( recs )

from timeit import Timer
t1 = Timer("sort_updown1(recs)",
"from __main__ import sort_updown1, recs")
t2 = Timer("sort_updown2(recs)",
"from __main__ import sort_updown2, recs")

print t1.repeat(number=100)
print t2.repeat(number=100)

=========
and the results

23244

[10.14563359309633, 10.180301127657287, 10.211666287195719]
[10.334310086896522, 10.34983412696306, 10.360717912472111]

So it seems that:

1) Possibly I made some errors invalidating some of these conclusions

2) Manipulating a key that is twice as long once in one sort, vs. doing
two sorts with half the key is slightly more efficient. Bumping that up
to a key that is 5 times as long, vs. 5 sorts, though (code not shown,
but I just added 3 more 21 character ranges at the end of the list of
fields to sort by, 0:20, 25:45, 8:28, 5:25, 10:30), shows the overhead
for the 5 sorts outweighs the length of the key.

3) The overhead of adding a Python-defined object to the key, and using
its comparison function has a huge impact on the sort times.

4) Not sure how much contribution is made by the payload, vs. how much
contribution is made by unique rather than repeated records, or other
issues.

Just for reference, I'm trying to figure out how to best deal with
sorted indexes for an in-memory database. There may well be 5 part
keys, and although I'm not sure there will be need for descending sort
for the applications at hand, such is certainly a feature of general
database indexes. It needs to allow the application to dynamically
create the indexes, probably based on a tuple of tuples like ((field_no,
ascending_flag), ... )

Lie Ryan · Nov 1, 2008

So I don't accept so much different data structures to have the
same name

You need to adjust the current mindset slightly (but in an important way
to understand the "why" behind this idea). The current notion is: list
and dict is a data structure. With this idea, list and dict is an
abstract type, not a data structure. array, linked list, binary tree, red-
black tree, hashed are data structure. We create a data structure by
passing the data structure's identifier string to a factory function
provided by the abstract type.

There are two possible syntax sugar:
a = list.bintree([...])
b = list([...]) # also for backward compat, have reasonable default

In short, the data structures doesn't share the same name, the abstract
data-type does. The suggestion is to change the place where we define the
what data structure to use.

Said that, for a high-level language like Python I can see another
possible solution. To have a type that contains several kinds of data
structures, for example a "dict" that contains a hash implementation, a
red-black tree, etc. Such data structure can use a small part of the
computational power given to it to collect statistics usages of each
object (or each variable, that may contain in succession several ojects
of the same type). Such data structure can then at run time adapt
dynamically, chosing to use the implementation fitter for the specific
usage of each object or variable (the programmer can give hints of
course, or sometimes even coerce the usage a specific implementation).
(such statistics can also be saved on disk to be used for the successive
run of the program, to help them work well from the start too, and not
just after a while). If this capability works well in practice, then it
can solve the problem you were talking about, I think.

I presume data structures in future high-level languages will be quite
more able to adapt themselves to their usages. Today it's the second
time I talk about future programming languages

Although you said you disagree with the general idea, you actually take
the idea two steps further, I take that as an implicit agreement to
several parts of the idea.

bearophileHUGS · Nov 1, 2008

Lie Ryan:

Although you said you disagree with the general idea, you actually take the idea two steps further, I take that as an implicit agreement to several parts of the idea.<

Think about a bridge: building half bridge may be bad/useless, while
building it whole may lead to something useful

Bye,
bearophile

Tim Rowe · Nov 1, 2008

2008/10/27 said:
Lie Ryan:

I don't agree with the general idea. If the operations done by your
data structure have different computational complexity, then they are
fit for different usages. When you program you must know what
computational complexity has each of the operations of your data
structyre, otherwise there's no way to know the complexity of your
whole program, so instead of programming you are just become a mage
that tries magical spells and hopes for the better.

No, when you program you know how you will be using the data
structure, so you can choose the implementation that's right for the
application. That's what the container libraries for other languages
do. At the moment, you just have one implementation, and have to hope
it's right for the job. Adding an *optional* parameter that says, in
effect, "I want this list optimised for writes and reads at both ends"
or "I want this list optimised for fully random reads but writes only
at the end" doesn't *lose* you any information about what you're
programming with. Of course it's essential that the data structure has
identican /functional/ behaviour whatever optimisation you use. Other
languages can enforce that, but Python programmers are used to taking
care of that side of things for themselves.

So I don't accept
so much different data structures to have the same name. That's why
I'll never appreciate the Python list type to be named list instead of
array, despite it supports more or less all the functions you expect
from an abstract list type.

They're not different data structures from the client point of view.
"More or less" all the functions wouldn't be enough.

Terry Reedy · Nov 1, 2008

Lie said:
You need to adjust the current mindset slightly (but in an important way
to understand the "why" behind this idea). The current notion is: list
and dict is a data structure. With this idea, list and dict is an
abstract type, not a data structure. array, linked list, binary tree, red-
black tree, hashed are data structure.

Did you miss my earlier post? Python already has about 20 abstract
types. It has a naming scheme for these, as well for concrete
structures, both built-in and imported. Agitating for names to be
swapped around is a waste of time. You are free, however, to do this in
your own code.

> We create a data structure by
passing the data structure's identifier string to a factory function
provided by the abstract type.

Python is object-oriented rather than name-oriented. When one knows the
name of an object at compile time, the Python way is to use it directly
in the code, unquoted. In other words, work with and pass around
objects as much as possible and only use names when they are truly
variable and not known constants.

What you propose is equivalent to using getattr for all attribute
access: "getattr(list, 'append')" versus "list.append".

Terry Jan Reedy

Steven D'Aprano · Nov 2, 2008

No, when you program you know how you will be using the data structure,
so you can choose the implementation that's right for the application.

I don't understand why you have prefixed that sentence with "No". It
seems like you are agreeing with bearophile's point.

That's what the container libraries for other languages do. At the
moment, you just have one implementation, and have to hope it's right
for the job.

To the extent that this is true, it is a sign of the poverty of the
standard Python library. But in fact it isn't true. Python has quite a
decent collection of collection types. Just off the top of my head:

tuple, namedtuple, list, set, frozenset, dict, defaultdict, queue, deque

set, list and dict are highly optimized for general use, and are very
suitable for building new data structures. And you are wrong to say that
we have to "hope it's right for the job", we can build perfectly
acceptable pure-Python data structures from these building blocks, or add
our own C extensions. You're never forced to use a standard library data
structure, it's a trade-off of your time and effort against runtime
efficiency. And because the standard library types are generally so
excellent, that trade-off is often (but not always) a no-brainer.

Adding an *optional* parameter that says, in effect, "I
want this list optimised for writes and reads at both ends" or "I want
this list optimised for fully random reads but writes only at the end"
doesn't *lose* you any information about what you're programming with.

It might not lose information, but it does obfuscate it.

Right now, we can do this:

if isinstance(obj, list):
print "optimized for random reads and writes at the end"
elif isinstance(obj, deque):
print "optimized for writes at both the start and end"

If you care about such things, it's easy to find out, because there is a
one-to-one mapping from type to behaviour.

Under Lie's suggestion, we would have a one-to-many mapping from type to
behaviour, forcing us to do this:

if isinstance(obj, list):
if obj.implementation == 'deque'
print "optimized for writes at both the start and end"
else:
print "optimized for random reads and writes at the end"

That's assuming the implementation is available at runtime at all; if
it's not, then you have lost information.

Of course it's essential that the data structure has identican
/functional/ behaviour whatever optimisation you use.

"Essential"? That's an interesting point. Let's look at an actual example.

Consider one of Lie's initial examples (paraphrased):

D1 = dict({1: 'a', 2: 'b'}) # standard hash-table
D2 = dict({1: 'a', 2: 'b'}, implementation="binary tree")

You say it's "essential" that the binary tree implementation has
identical functional behaviour to standard hash-table implementation of
dict. Okay, let's see where that leads us. The functional behaviour of
hash-tables is that they are O(1) for reads, while binary trees are
O(log(N)). The only way for them to have identical functional behaviour
is to *artificially* slow down dicts to the lowest common speed. That is
a Bad Thing, and I don't understand why you think it is essential.

But perhaps you didn't actually mean identical behaviour, and merely an
identical *interface*. That's more reasonable, we no longer have to
artificially change the behaviour of the type -- be we do have to
artificially cripple the API of the type!

The advantage of binary trees as a mapping is that they are ordered.
Binary trees have a rich set of traversal methods: you can walk the tree
in post-order, pre-order, in-order or depth first. But hash-tables don't
have *any* of those, so now we have to forbid all binary-tree traversal
methods. In which case, why bother paying the cost of binary trees if you
don't get the extra functionality?

But maybe you'll argue that Lie's examples are bad examples. Perhaps what
you have in mind is something like the difference between "dicts
implemented as hash tables with chaining" and "dicts implemented as hash
tables with open addressing". (Aside: what sort of open addressing?
Linear? Quadratic? Something else?) At least now we're talking about
different implementations with the same API and more-or-less the same
functional behaviour.

I'd suggest that 99% of the time, client code simply shouldn't care about
the differences. The existing implementations are Good Enough. Have you
*ever* found yourself saying "I can't use the built-in dict, because it
uses chaining and for my application I need linear open addressing"? (Or
vice versa.)

I doubt you've every said that. I doubt you know anyone who has said it.
I doubt you know *of* anyone who has said it. And even if you have, then
you are free to go and write your own dict type as a C extension type and
call it dict_with_linear_openaddressing and write this:

D = dict_with_linear_openaddressing(data)

instead of:

D = dict(data, implementation = "linear open addressing")

Yet again Lie's proposal gains you nothing. In fact, as an application
writer, you're better off writing your own type rather than waiting for
it to be added to the built-in implementation of dict.

The one exception to this is when you're developing a *new type* that
doesn't already exist in the standard library, and you haven't decided
which implementation(s) you should use, so you write multiple
implementations for testing. Here's the wrong way to do it:

class Foo(object):
def __init__(self, data, implementation='standard'):
self.implementation = implementation
if implementation == 'standard':
# save data one way
self.data = foo(data)
elif implementation == 'magic':
# save data another way
self.data = bar(data)
elif implementation == 'green':
# save data a third way
self.data = baz(data)
else:
raise ValueError('unknown implementation')
def len(self):
if implementation == 'standard':
return len(self.data)
elif implementation == 'magic':
return len(self.data[0]) + len(self.data[1:])
elif implementation == 'green':
return self._len

Why would you write it that way? Worst Design Ever -- it's confusing and
error prone and fails to obey the Don't Repeat Yourself principle, thus
increasing the risk of errors. And testing and documentation are
significantly more complicated.

Here's a better way:

class Foo(object):

class _StandardFoo(data):
def __init__(self, data):
self.data = foo(data)
def __len__(self):
return len(self.data)

class _MagicFoo(data):
def __init__(self, data):
self.data = bar(data)
def __len__(self):
return len(self.data[0]) + len(self.data[1:])

class _GreenFoo(data):
def __init__(self, data):
self.data = baz(data)
def __len__(self):
return self._len

_mapping = {'standard': _StandardFoo,
'magic': _MagicFoo, 'green': _GreenFoo}

def __new__(cls, data, implementation='standard'):
try:
return cls._mapping[implementation]
except KeyError:
raise ValueError('unknown implementation')

Here's an even better way:

class StandardFoo(data):
def __init__(self, data):
self.data = foo(data)
def __len__(self):
return len(self.data)

class MagicFoo(data):
def __init__(self, data):
self.data = bar(data)
def __len__(self):
return len(self.data[0]) + len(self.data[1:])

class GreenFoo(data):
def __init__(self, data):
self.data = baz(data)
def __len__(self):
return self._len

and let the client code call whichever implementation they want.

[snip]

They're not different data structures from the client point of view.

But of course they are. They have different functional behaviour, and
different APIs. Unless we artificially cripple the implementations and
make them all identical (which defeats the purpose of having different
implementations!) they are different data structures, and those
differences will be visible to the client code.

This scheme is a case of over-generalisation. It's also impractical: who
is going to spend the time and effort to do it? It's hard enough to get
somebody to write a binary tree implementation for the standard library,
without expecting them to *also* write a wrapper for it to make it look
like a dict. It's working against the grain of Python, which is well
suited for the one-type = one-API model instead of the one-type = many-
APIs model which Lie is suggesting. And for what benefit? The only
benefit suggested so far is that we can write:

dict({1: 'a', 2: 'b'}, implementation="binary tree")

instead of

binarytree({1: 'a', 2: 'b'})

If you call that a benefit. I don't.

Lie Ryan · Nov 4, 2008

I don't understand why you have prefixed that sentence with "No". It
seems like you are agreeing with bearophile's point.

On linguistic note: As a non-native speaker of English, I've never relied
on the correct usage of Yes and No and would instead rely on the
following text. In some languages, situations where English-people
usually use Yes is answered with No and vice versa.

To the extent that this is true, it is a sign of the poverty of the
standard Python library. But in fact it isn't true. Python has quite a
decent collection of collection types. Just off the top of my head:

tuple, namedtuple, list, set, frozenset, dict, defaultdict, queue, deque

set, list and dict are highly optimized for general use, and are very
suitable for building new data structures. And you are wrong to say that
we have to "hope it's right for the job", we can build perfectly
acceptable pure-Python data structures from these building blocks, or
add our own C extensions. You're never forced to use a standard library
data structure, it's a trade-off of your time and effort against runtime
efficiency. And because the standard library types are generally so
excellent, that trade-off is often (but not always) a no-brainer.

There are cases where the algorithm you uses ignited the standard
library's data structure's worst case scenario.

It might not lose information, but it does obfuscate it.

Right now, we can do this:

<snip>

I've repelled this same argument multiple times. How often do you think
you'd need to do check for an object's implementation? Compare that to
how often we need to ensure that the object we're handling is a list-
replacement.

That's assuming the implementation is available at runtime at all; if
it's not, then you have lost information.

The implementation property is always available, because it's a non-
optional argument when registering the data structure.

"Essential"? That's an interesting point. Let's look at an actual
example.

Consider one of Lie's initial examples (paraphrased):

D1 = dict({1: 'a', 2: 'b'}) # standard hash-table D2 = dict({1: 'a', 2:
'b'}, implementation="binary tree")

You say it's "essential" that the binary tree implementation has
identical functional behaviour to standard hash-table implementation of
dict. Okay, let's see where that leads us. The functional behaviour of
hash-tables is that they are O(1) for reads, while binary trees are
O(log(N)). The only way for them to have identical functional behaviour
is to *artificially* slow down dicts to the lowest common speed. That is
a Bad Thing, and I don't understand why you think it is essential.

I can't think why you consider a function's performance characteristic as
its behavior, a "characteristic" of something means it is a it's property/
character. Not behavior.

we no longer have to
artificially change the behaviour of the type -- be we do have to
artificially cripple the API of the type!

I don't get why you think we've got to restrict it to the lowest common
denominator. We expect people that messes with a field called
"implementation" to know enough about the limitations of the
implementation he chose. Regular python programmers nowadays have to be
aware that the limitation of the (hashed) dict is its key must be
immutable. A true dict/mapping type doesn't have that limitation, we
already have the so-called "artificial" limitations right now. The way it
is now gives the impression that hashed dict's limitations is equal to
dict/mapping's limitation. Instead, it should be: dict is a mapping type
of any object to any object, however the default hashed dict's limitation
is immutable key.

Aside: Also with my way, we could classify a "feature" into three status:
limitation, exact replacement, extension. limitation is minor points
where the implementation simply couldn't fulfill (not fulfilling major
points would fail registration), exact replacement is the regular things,
and extension as features that, if used, might make changing of
implementations harder, so should only be used if that implementation's
extension is the natural way to write an algorithm.

But maybe you'll argue that Lie's examples are bad examples.

Yes, I agree it is a bad example for this issue. When I wrote that, what
I had in mind is to demonstrate how it'd look like in a simple way. Way
too simple, it seems.

Perhaps
what you have in mind is something like the difference between "dicts
implemented as hash tables with chaining" and "dicts implemented as hash
tables with open addressing". (Aside: what sort of open addressing?
Linear? Quadratic? Something else?) At least now we're talking about
different implementations with the same API and more-or-less the same
functional behaviour.

Not quite. I'm looking for a more radical differences. This is more
apparent in array-list vs bintree-list. One is better for straight
traversal and random access, while the other is better for searching and
inserting.

I'd suggest that 99% of the time, client code simply shouldn't care
about the differences. The existing implementations are Good Enough.
Have you *ever* found yourself saying "I can't use the built-in dict,
because it uses chaining and for my application I need linear open
addressing"? (Or vice versa.)

Yes, that's why we have reasonable default.

I doubt you've every said that. I doubt you know anyone who has said it.
I doubt you know *of* anyone who has said it. And even if you have, then
you are free to go and write your own dict type as a C extension type
and call it dict_with_linear_openaddressing and write this:

D = dict_with_linear_openaddressing(data)

instead of:

D = dict(data, implementation = "linear open addressing")

Yet again Lie's proposal gains you nothing. In fact, as an application
writer, you're better off writing your own type rather than waiting for
it to be added to the built-in implementation of dict.

There are times when it'd be too much work to implement it, test it, and
integrate it. There are times when we know that using alternative
implementation could make our program faster, but simply lack enough time
to implement and test it. I was talking about rapid application
development or even the wilder cousin, quick-one-time-disposable scripts.
(I'm not implying that only rapid programming benefits from this, it is
only one example)

Plus, currently it is impossible, without extensive reading of the whole
docs, to search for other data structure that implements the same
abstract type to list/dict/anything. Their documentations are scattered
around here and there. It's helpful if help(list) could list what
implementations are available (no pun intended).

Here's an even better way:

class StandardFoo(data):
def __init__(self, data):
self.data = foo(data)
def __len__(self):
return len(self.data)

class MagicFoo(data):
def __init__(self, data):
self.data = bar(data)
def __len__(self):
return len(self.data[0]) + len(self.data[1:])

class GreenFoo(data):
def __init__(self, data):
self.data = baz(data)
def __len__(self):
return self._len

and let the client code call whichever implementation they want.

You missed the last part. Integrating the implementation to current code.
They are the one that makes the difference.

And I wasn't talking about any of those you mentioned.
It's more like this:

class StandardFoo(data):
def __init__(self, data):
self.data = foo(data)
def __len__(self):
return len(self.data)

class MagicFoo(data):
def __init__(self, data):
self.data = bar(data)
def __len__(self):
return len(self.data[0]) + len(self.data[1:])

class GreenFoo(data):
def __init__(self, data):
self.data = baz(data)
def __len__(self):
return self._len

class Foo(AbstractType): pass
Foo.register({'standard': StandardFoo, 'green': GreenFoo, 'magic':
MagicFoo})

[snip]

They're not different data structures from the client point of view.

Click to expand...

But of course they are. They have different functional behaviour, and
different APIs.

Unless we artificially cripple the implementations

Then please state the implementation's limitations on the docs. People
who knows enough to chose non-standard implementation should be
knowledgeable enough about the limitations and extension of the chosen
implementation.

and
make them all identical (which defeats the purpose of having different
implementations!) they are different data structures, and those
differences will be visible to the client code.

Yes, they're different data structure, same Abstract Type.

This scheme is a case of over-generalisation. It's also impractical: who
is going to spend the time and effort to do it? It's hard enough to get
somebody to write a binary tree implementation for the standard library,
without expecting them to *also* write a wrapper for it to make it look
like a dict.

It's hard to write it because everyone knows noone is going to use it
because their implementation would be buried inside 60 feet of soil and
sand somewhere in the dark corners of the docs. And when you said "it's
hard to get somebody to write", do you realize how many people have, in
desperation, write their own implementation of something because they
can't find it in the standard library. And I think the 'register' way
would make it unnecessary to write wrappers as long as you've provided a
compatible data structure the generic register function would handle the
rest.

It's working against the grain of Python, which is well
suited for the one-type = one-API model instead of the one-type = many-
APIs model which Lie is suggesting. And for what benefit? The only
benefit suggested so far is that we can write:

dict({1: 'a', 2: 'b'}, implementation="binary tree")

instead of

binarytree({1: 'a', 2: 'b'})

If you call that a benefit. I don't.

A real benefit is apparent if we have this kind of snippet in our codes
(again, in my habit of oversimplifying: f is handling too many things for
its own good, isinstance is usually evil, and there is a strong smell of
bad design):

def f(listofiterable, someiterable):
for lst in listofiterable:
if isinstance(lst, list):
lst.extend(somelist)
elif isinstance(lst, [set, dict]):
lst.update(somelist)
else:
for n in somelist:
lst += n

if you have that kind of code, and many of them scattered, when you want
to change the data structure, you've got to hunt all those isinstance
down and change them to the new data structure. The other solutions, to
shadow 'list' (i.e. the previous name), doesn't work if we want to use
the original data structure too in other unrelated places. With
implementation, only the object creation would need to be searched and
isinstance(something, list) would pass it as an array-list replacement.

MRAB · Nov 5, 2008

On Nov 4 said:
On linguistic note: As a non-native speaker of English, I've never relied
on the correct usage of Yes and No and would instead rely on the
following text. In some languages, situations where English-people
usually use Yes is answered with No and vice versa.

[snip]
In English the rule tends to be that "yes" and "no" don't mean "I
agree" and "I disagree" like in other languages, but that "yes"
asserts the positive and "no" asserts the negative:

Positive question:

Does it work? Yes, it works.

Does it work? No, it doesn't work.

Negative question:

Doesn't it work? Yes, it does work. ["does" added for
emphasis]

Doesn't it work? No, it doesn't work.

Mr.SpOOn · Nov 5, 2008

The discussion's gone a bit off topic so I don't know if it is a good
idea to continue here. I'll try.

My first question was about a way to order a python set. Someone
suggested to try this module:

http://code.activestate.com/recipes/528878/

It seemed pretty good, but I've tried it just today and something doesn't work.
I think it doesn't work because what I have to order are objects and
not just numbers. Anyway, I'm not sure that this is the problem.

What I get when I try to append one of my object is:

File "notes.py", line 245, in __init__
self.append(root)
File "ordered_set.py", line 78, in append
self._insertatnode(self._end.prev, element)
File "ordered_set.py", line 110, in _insertatnode
if element in self._map:
TypeError: unhashable instance

Don't know if it is important, but inside the class of the object that
I'm trying to append, I redefined the __eq__ and __gt__ method.

Arnaud Delobelle · Nov 5, 2008

Mr.SpOOn said:
The discussion's gone a bit off topic so I don't know if it is a good
idea to continue here. I'll try.

My first question was about a way to order a python set. Someone
suggested to try this module:

http://code.activestate.com/recipes/528878/

It seemed pretty good, but I've tried it just today and something
doesn't work. I think it doesn't work because what I have to order
are objects and not just numbers. Anyway, I'm not sure that this is
the problem.

What I get when I try to append one of my object is:

File "notes.py", line 245, in __init__
self.append(root)
File "ordered_set.py", line 78, in append
self._insertatnode(self._end.prev, element)
File "ordered_set.py", line 110, in _insertatnode
if element in self._map:
TypeError: unhashable instance

Don't know if it is important, but inside the class of the object that
I'm trying to append, I redefined the __eq__ and __gt__ method.

Only hashable objects can go in a set. By default a class you define is
not hashable (unless it descends from a hashable class). To remedy this
you can define a __hash__ method in your class. IIRC the only
requirement of a hash function is that two equal objects have the same
hash.

HTH

Mr.SpOOn · Nov 6, 2008

Only hashable objects can go in a set. By default a class you define is
not hashable (unless it descends from a hashable class). To remedy this
you can define a __hash__ method in your class. IIRC the only
requirement of a hash function is that two equal objects have the same
hash.

Thanks, now it works.

Calculate rang and derang of ordering of subsets	1	Feb 6, 2024
Calculate rang and derang of ordering of subsets	0	Feb 6, 2024
Python sets which support multiple same elements	2	May 20, 2011
Algorithm for generating pitch-class sets in prime form	1	Jan 28, 2011
sparse sets of integers?	6	Apr 4, 2005
overlapping sets	6	Mar 24, 2006
Feedback on Sets, and Partitions	7	Apr 30, 2004
Z-Ordering (Morton ordering) question	2	Nov 5, 2009

Ordering python sets

Steven D'Aprano

Glenn Linderman

Lie Ryan

bearophileHUGS

Tim Rowe

Terry Reedy

Steven D'Aprano

Lie Ryan

MRAB

Mr.SpOOn

Arnaud Delobelle

Mr.SpOOn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads