Tuples and immutability

S

Steven D'Aprano

That's true but irrelevant to my point, which was to counter the
assertion that mutable types can always be assumed to be able to perform
operations in-place.

"Always"? Not so fast.

This is Python. We have freedom to abuse nearly everything, and if you
want to shoot yourself in the foot, you can. With the exception of a
handful of things which cannot be overridden (e.g. None, numeric
literals, syntax) you cannot strictly assume anything about anything.
Python does not enforce that iterators raise StopIteration when empty, or
that indexing beyond the boundaries of a sequence raises IndexError, or
that __setitem__ of a mapping sets the key and value, or that __len__
returns a length.

Augmented assignment is no different. The docs describe the intention of
the designers and the behaviour of the classes that they control, so with
standard built-in classes like int, str, list, tuple etc. you can safely
assume that mutable types will perform the operation in place and
immutable types won't, but with arbitrary types from some arbitrarily
eccentric or twisted programmer, who knows what it will do?
 
I

Ian Kelly

"Always"? Not so fast.

This is Python. We have freedom to abuse nearly everything, and if you
want to shoot yourself in the foot, you can. With the exception of a
handful of things which cannot be overridden (e.g. None, numeric
literals, syntax) you cannot strictly assume anything about anything.
Python does not enforce that iterators raise StopIteration when empty, or
that indexing beyond the boundaries of a sequence raises IndexError, or
that __setitem__ of a mapping sets the key and value, or that __len__
returns a length.

Thank you; you've stated my point more succinctly than I did.
Augmented assignment is no different. The docs describe the intention of
the designers and the behaviour of the classes that they control, so with
standard built-in classes like int, str, list, tuple etc. you can safely
assume that mutable types will perform the operation in place and
immutable types won't, but with arbitrary types from some arbitrarily
eccentric or twisted programmer, who knows what it will do?

This got me curious about how consistent the standard library is about
this exactly, so I did some grepping. In the standard library there
are 5 mutable types that support concatenation that I was able to
find: list, deque, array, bytearray, and Counter. There are none that
support addition, which I find interesting in that the language
provides hooks for in-place addition but never uses them itself.

All of the classes above appear to follow the rule that if you can
concatenate an operand, you can in-place concatenate the same operand.
The converse however does not hold: list.__iadd__ and
Counter.__iadd__ are both more permissive in what types they will
accept than their __add__ counterparts, and especially interesting to
me is that deque implements __iadd__ but does not implement __add__ at
all. This last in particular seems to support the assertion that +=
should be viewed more as a shorthand for an in-place operation, less
as an equivalent for x = x + y.
Traceback (most recent call last):
[1, 2, 3, 4, 5, 6]
Traceback (most recent call last):
Counter({'s': 4, 'a': 4, 'i': 4, 'p': 2, 'm': 2, 'b': 1, 'l': 1})
d = collections.deque([1,2,3])
d += [4,5,6]
d deque([1, 2, 3, 4, 5, 6])
d + [7,8,9]
Traceback (most recent call last):
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'collections.deque' object has no attribute '__add__'
 
T

Terry Reedy

"Always"? Not so fast.

This is Python. We have freedom to abuse nearly everything, and if you
want to shoot yourself in the foot, you can. With the exception of a
handful of things which cannot be overridden (e.g. None, numeric
literals, syntax) you cannot strictly assume anything about anything.
Python does not enforce that iterators raise StopIteration when empty, or
that indexing beyond the boundaries of a sequence raises IndexError, or
that __setitem__ of a mapping sets the key and value, or that __len__
returns a length.

Thank you; you've stated my point more succinctly than I did.
Augmented assignment is no different. The docs describe the intention of
the designers and the behaviour of the classes that they control, so with
standard built-in classes like int, str, list, tuple etc. you can safely
assume that mutable types will perform the operation in place and
immutable types won't, but with arbitrary types from some arbitrarily
eccentric or twisted programmer, who knows what it will do?

This got me curious about how consistent the standard library is about
this exactly, so I did some grepping. In the standard library there
are 5 mutable types that support concatenation that I was able to
find: list, deque, array, bytearray, and Counter. There are none that
support addition, which I find interesting in that the language
provides hooks for in-place addition but never uses them itself.

All of the classes above appear to follow the rule that if you can
concatenate an operand, you can in-place concatenate the same operand.
The converse however does not hold: list.__iadd__ and
Counter.__iadd__ are both more permissive in what types they will
accept than their __add__ counterparts, and especially interesting to
me is that deque implements __iadd__ but does not implement __add__ at
all. This last in particular seems to support the assertion that +=
should be viewed more as a shorthand for an in-place operation, less
as an equivalent for x = x + y.
l = [1,2,3]
l + (4,5,6)
Traceback (most recent call last):
[1, 2, 3, 4, 5, 6]

Like it or not, one should actually think of 'somelist += iterable' as
equivalent to 'somelist.extend(iterable)'. Without looking at the C
code, I suspect that the latter is the internal implementation.
Collections.deque also has .extend. Collections.Counter has .update and
that is += seems to be doing.

Traceback (most recent call last):
Counter({'s': 4, 'a': 4, 'i': 4, 'p': 2, 'm': 2, 'b': 1, 'l': 1})
d = collections.deque([1,2,3])
d += [4,5,6]
d deque([1, 2, 3, 4, 5, 6])
d + [7,8,9]
Traceback (most recent call last):
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'collections.deque' object has no attribute '__add__'
 
I

Ian Kelly

The main thing is there are use cases where order is essential. That's
why I have had to implement the AVL tree in Python myself. No biggie,
but a C implementation would probably be much faster. Also, a standard
version would likely be reviewed and tested better and have all Pythonic
accessors in place.

That last is actually quite easy to achieve, especially using the
built-in ABCs. Just subclass from collections.MutableMapping and
implement __len__, __iter__, __getitem__, __setitem__ and __delitem__;
and default implementations of the rest will be mixed in for you. Or
you can override those with optimized implementations.
 
S

Steven D'Aprano

The main thing is there are use cases where order is essential. That's
why I have had to implement the AVL tree in Python myself.

from collections import OrderedDict

gives you a fast, ordered mapping. In this case, the order is that of
insertion order. If you want sort order instead, for "small" amounts of
data, say below a million or ten million items, you're likely to have
acceptable if not superior results by just sorting them items when and as
needed.

Otherwise, I expect that following the basic technique of OrderedDict,
and building your data structure from a dict and an associated list,
keeping the two in sync, will be faster than a pure Python implementation
of an AVL tree. But of course only testing it will make that clear.

http://code.activestate.com/recipes/576693-ordered-dictionary-for-py24/

Modifying the above recipe to keep items in something other than
insertion order is left as an exercise.
 
D

Dan Stromberg

from collections import OrderedDict

gives you a fast, ordered mapping. In this case, the order is that of
insertion order. If you want sort order instead, for "small" amounts of
data, say below a million or ten million items, you're likely to have
acceptable if not superior results by just sorting them items when and as
needed.

This is one of my pet things.

Sorting the same (slightly tweaked) data inside of a tight loop is
rarely a good idea - despite the fact that the "sort" itself tends to
be O(n) with Python's rather awesome builtin sorting algorithm. This
is because sorting inside a loop tends to yield O(n^2) algorithms, or
even O((n^2)*logn).

But if you're sorting once at the end of whatever other processing,
sorting is great. It's (still) O(nlogn), but it's got a terrific
constant.
 
S

Steven D'Aprano

Sorting the same (slightly tweaked) data inside of a tight loop is
rarely a good idea - despite the fact that the "sort" itself tends to be
O(n) with Python's rather awesome builtin sorting algorithm. This is
because sorting inside a loop tends to yield O(n^2) algorithms, or even
O((n^2)*logn).

I agree with you except for the word "rarely". It depends on how much
data you have, and in my experience most people think that N = 100 is a
lot :)

Remember that Big Oh doesn't say anything about the constant of
proportionality. If you have one function that behaves like O(N) but with
a constant factor of 1000, and another function that behaves like
O(N**2) with a constant factor of 0.001, the second, theoretically
slower, function will actually be faster for N < one million.

So *in practice*, a solution that involves a theoretically worse Big Oh
function but a significantly lower constant factor may turn out to be
better for all the values you care about. Given the difference in speed
between Python's built-ins, and the code you can write yourself in pure
Python, it is normally better to push as much work as you can onto the
built-ins.

And of course, we should not make assumptions as to what is faster and
what is slower. After 15+ years, I'm still surprised about what ends up
being faster in Python.

But if you're sorting once at the end of whatever other processing,
sorting is great. It's (still) O(nlogn), but it's got a terrific
constant.

Exactly!
 
J

Joshua Landau

I've found this link useful http://kmike.ru/python-data-structures/

I also don't want all sorts of data structures added to the Python library.
I believe that there are advantages to leaving specialist data structureson
pypi or other sites, plus it means Python in a Nutshell can still fit in
your pocket and not a 40 ton articulated lorry, unlike the Java equivalent.

The thing we really need is for the blist containers to become stdlib
(but not to replace the current list implementation). The rejected PEP
(http://legacy.python.org/dev/peps/pep-3128/) misses a few important
points, largely in how the "log(n)" has a really large base:
random.choice went from 1.2µs to 1.6µs from n=1 to n=10â¸, vs 1.2µs for
a standard list.

Further, it's worth considering a few advantages:

* copy is O(1), allowing code to avoid mutation by just copying its
input, which is good practice.

* FIFO is effectively O(1), as the time just about doubles from n=1 to
n=10⸠so will never actually branch that much. There is still a speed
benefit of collections.deque, but it's much, much less significant.
This is very useful when considering usage as a multi-purpose data
structure, and removes demand for explicit linked lists (which have
foolishly been reimplemented loads of times).

* It reduces demand for trees:
* There are efficient implementations of sortedlist, sortedset and
sorteddict.
* Slicing, slice assignment and slice deletion are really fast.
* Addition of lists is sublinear. Instead of
"list(itertools.chain(...))", one can add in a loop and end up
*faster*.

I think blist isn't very popular not because it isn't really good, but
because it isn't a specialised structure. It is, however, almost there
for almost every circumstance. This can help keep the standard library
clean, especially of tree data structures.

Here's what we kill:

* Linked lists and doubly-linked lists, which are scarily popular for
whatever reason. Sometimes people claim that collections.deque isn't
powerful enough for whatever they want, and blist will almost
definitely sate those cases.

* Balanced trees, with blist.sortedlist. This is actually needed right now.

* Poor performance in the cases where a lot of list merging and pruning happens.

* Most uses of bisect.

* Some instances where two data structures are used in parallel in
order to keep performance fast on disparate operations (like `x in y`
and `y`).

Now, I understand there are downsides to blist. Particularly, I've
looked through the "benchmarks" and they seem untruthful. Further,
we'd need a maintainer. Finally, nobody jumps at blists because
they're rarely the obvious solution. Rather, they attempt to be a
different general solution. Hopefully, though, a stdlib inclusion
could make them a lot more obvious, and support in some current
libraries could make them feel more at home.

I don't know whether this is a good idea, but I do feel that it is
more promising and general than having a graph in the standard
library.
 
M

Mark Lawrence

I've found this link useful http://kmike.ru/python-data-structures/

I also don't want all sorts of data structures added to the Python library.
I believe that there are advantages to leaving specialist data structures on
pypi or other sites, plus it means Python in a Nutshell can still fit in
your pocket and not a 40 ton articulated lorry, unlike the Java equivalent.

The thing we really need is for the blist containers to become stdlib
(but not to replace the current list implementation). The rejected PEP
(http://legacy.python.org/dev/peps/pep-3128/) misses a few important
points, largely in how the "log(n)" has a really large base:
random.choice went from 1.2µs to 1.6µs from n=1 to n=10â¸, vs 1.2µs for
a standard list.

Further, it's worth considering a few advantages:

* copy is O(1), allowing code to avoid mutation by just copying its
input, which is good practice.

* FIFO is effectively O(1), as the time just about doubles from n=1 to
n=10⸠so will never actually branch that much. There is still a speed
benefit of collections.deque, but it's much, much less significant.
This is very useful when considering usage as a multi-purpose data
structure, and removes demand for explicit linked lists (which have
foolishly been reimplemented loads of times).

* It reduces demand for trees:
* There are efficient implementations of sortedlist, sortedset and
sorteddict.
* Slicing, slice assignment and slice deletion are really fast.
* Addition of lists is sublinear. Instead of
"list(itertools.chain(...))", one can add in a loop and end up
*faster*.

I think blist isn't very popular not because it isn't really good, but
because it isn't a specialised structure. It is, however, almost there
for almost every circumstance. This can help keep the standard library
clean, especially of tree data structures.

Here's what we kill:

* Linked lists and doubly-linked lists, which are scarily popular for
whatever reason. Sometimes people claim that collections.deque isn't
powerful enough for whatever they want, and blist will almost
definitely sate those cases.

* Balanced trees, with blist.sortedlist. This is actually needed right now.

* Poor performance in the cases where a lot of list merging and pruning happens.

* Most uses of bisect.

* Some instances where two data structures are used in parallel in
order to keep performance fast on disparate operations (like `x in y`
and `y`).

Now, I understand there are downsides to blist. Particularly, I've
looked through the "benchmarks" and they seem untruthful. Further,
we'd need a maintainer. Finally, nobody jumps at blists because
they're rarely the obvious solution. Rather, they attempt to be a
different general solution. Hopefully, though, a stdlib inclusion
could make them a lot more obvious, and support in some current
libraries could make them feel more at home.

I don't know whether this is a good idea, but I do feel that it is
more promising and general than having a graph in the standard
library.


I'd raise this on python-dev or python ideas as a result of reading PEP
3128. Specifically the second paragraph states Raymond Hettinger's sage
advice:-

"Depending on its success as a third-party module, it still has a chance
for inclusion in the collections module. The essential criteria for that
is whether it is a superior choice for some real-world use cases. I've
scanned my own code and found no instances where BList would have been
preferable to a regular list. However, that scan has a selection bias
because it doesn't reflect what I would have written had BList been
available. So, after a few months, I intend to poll comp.lang.python for
BList success stories. If they exist, then I have no problem with
inclusion in the collections module. After all, its learning curve is
near zero -- the only cost is the clutter factor stemming from
indecision about the most appropriate data structure for a given task."
 
D

Daniel Stutzbach

Now, I understand there are downsides to blist. Particularly, I've
looked through the "benchmarks" and they seem untruthful.


I worked hard to make those benchmarks as fair as possible. I recognize
that evaluating your own work always runs the risk of introducing hidden
biases, and I welcome input on how they could be improved.
 
M

Marko Rauhamaa

Joshua Landau said:
The thing we really need is for the blist containers to become stdlib
(but not to replace the current list implementation).

Very interesting. Downloaded blist but didn't compile it yet. It *could*
be the missing link.

I would *love* to see some comparative performance results between
blist.sorteddict and an AVL tree. A B+ tree could be nicer to the CPU
cache, but then again, the cache line is only a few pointers wide and
there appears to be a lot of shoving stuff left and right -- based on a
10-second "analysis" of the code:

while (src >= stop)
*dst-- = *src--;

Personally, I find it a bit odd to place lists at the center of the
proposal and have trees come out as a side effect. I would rather see it
the other way round: sorteddict -> sortedset et al.
* It reduces demand for trees:

I would hope it would satisfy the demand for trees.
nobody jumps at blists because they're rarely the obvious solution.

I haven't jumped at them because they were nowhere to be found on <URL:
http://docs.python.org/3/genindex-B.html>.


Marko
 
D

Dan Stromberg

Very interesting. Downloaded blist but didn't compile it yet. It *could*
be the missing link.

I would *love* to see some comparative performance results between
blist.sorteddict and an AVL tree.

I added blist.sorteddict and removed (temporarily) Pypy and Jython.
The results are at
http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/2014-03/

In short, blist.sorteddict didn't do that well, despite being in C.
In the random workloads, blist.sorteddict was dead last. In the
sequential workloads blist.sorteddict fell somewhere in the middle.

I excluded Pypy and Jython because blist.sorteddict probably doesn't
run on them.

HTH
 
D

Daniel Stutzbach

In short, blist.sorteddict didn't do that well, despite being in C.

blist.blist is written in C, but blist.sorteddict is written in Python on
top of blist.blist. It won't perform as well as a class written entirely
in C that has similar asymptotic properties. If the objects in the tree
are non-trivial (i.e., not a built-in type), the cost of calling the __lt__
method may be more important than the choice of blist vs. AVL tree, but I
haven't tested it.
 
M

Marko Rauhamaa

Dan Stromberg said:

Unfortunately I'm having a hard time understanding the results.

The 50/50 get/set ratio is most interesting to me.

I'm seeing (under cpython-3.3):


Size: 1048576, duration: 75.3, dictionary type: dict
[...]
Size: 262144, duration: 66.1, dictionary type: AVL_tree
[...]
Size: 65536, duration: 77.3, dictionary type: blist.sorteddict


What does it mean?


Marko
 
D

Dan Stromberg

Dan Stromberg said:

Unfortunately I'm having a hard time understanding the results.

The 50/50 get/set ratio is most interesting to me.

I'm seeing (under cpython-3.3):


Size: 1048576, duration: 75.3, dictionary type: dict
[...]
Size: 262144, duration: 66.1, dictionary type: AVL_tree
[...]
Size: 65536, duration: 77.3, dictionary type: blist.sorteddict


What does it mean?

dict was able to do 1048576 operations on a dictionary before taking
more than 120 seconds to complete - it took 75.3 seconds to do 1048576
operations.

AVL_tree was able to do 262144 operations on a dictionary before
taking more than 120 seconds to complete - it took 66.1 seconds to do
262144 operations.

blist.sorteddict was able to do 65536 operations on a dictionary
before taking more than 120 seconds to complete - it took 77.3
seconds to do 65536 operations.

For the 50/50 workload; the "operations" were half adding key, value
pairs; and half lookups of values by keys we know are in the
dictionary.

I used to run all the dictionaries for as long as it took to do 4
million operations, but for (EG) unbalanced binary trees, that would
take far too long in the ordered tests, so I changed the code to try a
given tree type until the time for an operation became prohibitive.

If you look at the graphs (I have to admit they've become a little
cluttered), you can see the slower trees "escaping" rapidly (exceeding
the 120 second threshold), while the better performing trees grow more
slowly and are allowed to continue proving themselves longer.
Inspecting these graphs may help in developing an intuition for how
the tests were conducted.

The code implementing this method of testing is in
http://stromberg.dnsalias.org/svn/python-tree-and-heap-comparison/trunk/tester

HTH
 
M

Marko Rauhamaa

Dan Stromberg said:
dict was able to do 1048576 operations on a dictionary before taking
more than 120 seconds to complete - it took 75.3 seconds to do 1048576
operations.

AVL_tree was able to do 262144 operations on a dictionary before
taking more than 120 seconds to complete - it took 66.1 seconds to do
262144 operations.

blist.sorteddict was able to do 65536 operations on a dictionary
before taking more than 120 seconds to complete - it took 77.3 seconds
to do 65536 operations.

For a proper comparison, I'd like a fixed, identical dataset and set of
operations run against each data structure.

How about this test program:

generate random datasets of 100, 10000, 1000000 and 100000000 elements
generate random testset of 1000000 elements
for each data structure:
for each dataset:
initialize data structure with dataset
head, tail = testset[:100], testset[100:]
t0 = current timestamp
for each element in head:
add element to data structure
for each element in tail:
add element to data structure
append element to head
remove head.pop(0) from data structure
for each element in head:
remove element from data structure
t1 = current timestamp
report data structure type, dataset size, t1 - t0


Marko
 
D

Dan Stromberg

Dan Stromberg <[email protected]>:
For a proper comparison, I'd like a fixed, identical dataset and set of
operations run against each data structure.

How about this test program:

I used to do essentially this, but it was time-prohibitive and
produced harder-to-read graphs - harder to read because the enormous
values of the bad trees were dwarfing the values of the good trees.

Imagine doing 100000000 operation tests for the unbalanced binary
tree. For a series of random keys, it would do quite well (probably
second only to dict), but for a series of sequential keys it would
take longer than anyone would reasonably want to wait because it's
basically a storage-inefficient linked list.

Rather than throw out unbalanced binary tree altogether, it makes more
sense to run it until it gets "too slow".

The workload+interpreter pairs are all tested the same way, it's just
that the ones that are doing badly are thrown out before they're able
to get a lot worse. Studying the graphs will likely help develop an
intuition for what's happening.
 
M

Marko Rauhamaa

Dan Stromberg said:
I used to do essentially this, but it was time-prohibitive and
produced harder-to-read graphs - harder to read because the enormous
values of the bad trees were dwarfing the values of the good trees.

Imagine doing 100000000 operation tests for the unbalanced binary
tree. For a series of random keys, it would do quite well (probably
second only to dict), but for a series of sequential keys it would
take longer than anyone would reasonably want to wait because it's
basically a storage-inefficient linked list.

Rather than throw out unbalanced binary tree altogether, it makes more
sense to run it until it gets "too slow".

I disagree strongly. You should throw out unbalanced binary trees and
linked lists and the like and concentrate on solid contenders.

But it's your test. You do as you like.

Anyway, even a well-thought-out test is subject to all kinds of
criticisms due to the CPU architecture, the distribution of the key
values, the quality of the data structure implementation etc etc.


Marko
 
C

Chris Kaynor

Size: 1048576, duration: 75.3, dictionary type: dict
[...]
Size: 262144, duration: 66.1, dictionary type: AVL_tree
[...]
Size: 65536, duration: 77.3, dictionary type: blist.sorteddict

Taking a quick look at this, I think it might be made much clearer if the
number/second were added to the output. While it can be computed off the
displayed data, it would make it much easier to compare in the table view
by including it. Something like:

Size: 1048576, duration: 75.3, dictionary type: 13925/second: dict
Size: 262144, duration: 66.1, dictionary type: 3965/second: AVL_tree
Size: 65536, duration: 77.3, dictionary type: 847/second:
blist.sorteddict

Chris
 
S

Steven D'Aprano

Dan Stromberg <[email protected]>:

I disagree strongly. You should throw out unbalanced binary trees and
linked lists and the like and concentrate on solid contenders.

If you are in a position to randomize the data before storing it in the
tree, an unbalanced binary tree is a solid contender. The overhead will
likely be much less than any balanced tree, and the probability of
degenerate behaviour negligible for any amount of data big enough to
really matter.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,077
Messages
2,570,567
Members
47,203
Latest member
EmmaSwank1

Latest Threads

Top