PEP 372 -- Adding an ordered directory to collections

A

Armin Ronacher

Martin v. Löwis said:
Right. So byindex(n) would be O(n) then, right? If so, what's the
purpose of having that method in the first place?
What's the purpose of having list.insert?
The PEP doesn't give a rationale, but just proposes that the method
be there. My guess is that it includes it for performance reasons.
However, I think the PEP (author) is misguided in assuming that
making byindex() a method of odict, you get better performance than
directly doing .items()[n] - which, as you say, you won't.
Without byindex the only way to cherry pick an item is either doing
something like

i = od.iteritems()
for idx in xrange(offset):
value = idx.next()
return value

Or

return od.items()[offset]

One creates tons of unnecessary method calls, the other creates a full
blown list object just to throw it away later. Both less than optimal
solutions that can be implemented in a more efficient way on the C
layer where one only has to iterate over the linked list offset times
and return the item. And iteration for that linked list is most likely
something like "for (n = 0; n != offset; ++n) iter = iter->next".

Regards,
Armin
 
M

Martin v. Löwis

What's the purpose of having list.insert?

It's a convenience function: you don't have to write a loop to move all
items to a later index. Any reformulation of it is easy to get wrong,
and difficult to read.
One creates tons of unnecessary method calls, the other creates a full
blown list object just to throw it away later. Both less than optimal
solutions that can be implemented in a more efficient way on the C
layer where one only has to iterate over the linked list offset times
and return the item. And iteration for that linked list is most likely
something like "for (n = 0; n != offset; ++n) iter = iter->next".

Ok, so it is about performance, and intended to provide a speedup by
a constant factor (over the trivial reformulation without it).

What are the use cases for this function? I.e. in what specific
applications did you use it, or would you want to use it?

In these applications, could you instead also have used a (hypothetical)
function

def nth(iter, n):
while n:
iter.next()
n-=1
return iter.next()

instead?

Regards,
Martin
 
D

dbpokorny

Wow, I was completely wrong about sorted dicts and odicts.

mean. I think for this data structure it's important to keep all the
normal dict operations at the same speed. If you use a C

Why keep the normal dict operations at the same speed? There is a
substantial cost this entails.

It seems that some odict solutions take up 4n words per entry in the
limit (ignoring the fixed percentage growth space for lists and
dicts). This is n words for _keys and 3n words for the underlying
dict. What about storing the key:value pairs as a pair of lists (maybe
interleaved)? Key lookup then becomes an O(n) operation, but the
storage requirements are reduced to 2n from 4n.

Here is a hypothetical (somewhat contrived) example: odicts are used
to store the attributes of XML elements in a medium-sized XML document
(say 250K). If the required munging does some surgery on elements 3-4
layers deep in a few places, then the number of key lookups may be
relatively small. (Assuming that very few odicts will have more than
30 entries, each individual lookup will be fairly quick as well). The
document has to be loaded into memory anyway, so the extra cost of key
lookups is more than compensated for by the substantial reduction in
the number of cache line misses.

The main concern is that doubling the size of the data structure in
order to turn key lookup into an O(1) operation may give worse
performance in the common case (this is a personal impression from
reading the use case list in the rationale), and a data structure like
this seems to be way off by itself in the time-space complexity
landscape.

In this case the odict no longer inherits from the dict.

Cheers,
David
 
B

bearophileHUGS

dbpoko...:
Why keep the normal dict operations at the same speed? There is a
substantial cost this entails.

I presume now we can create a list of possible odict usages, because I
think that despite everyone using it for different purposes, we may
find some main groups of its usage. I use odicts is situations where
an dict is nearly the right data structure, so keeping all operations
close to the time complexity of dicts has a purpose.

but the storage requirements are reduced to 2n from 4n.

In Python 2.5 a dict(int:None) needs about 36.2 bytes/element. I am
suggesting to add 2 pointers, to create a linked list, so it probably
becomes (on 32 bit systems) about 44.2 bytes/pair.

Note that computer science is full of strange data structures, so
maybe a skip list can be used here, to increase some operation
timings, and reduce other ones... :)

Bye,
bearophile
 
M

Matimus

Abstract
========

This PEP proposes an ordered dictionary as a new data structure for
the ``collections`` module, called "odict" in this PEP for short.  The
proposed API incorporates the experiences gained from working with
similar implementations that exist in various real-world applications
and other programming languages.

Rationale
=========

In current Python versions, the widely used built-in dict type does
not specify an order for the key/value pairs stored.  This makes it
hard to use dictionaries as data storage for some specific use cases.

Some dynamic programming languages like PHP and Ruby 1.9 guarantee a
certain order on iteration.  In those languages, and existing Python
ordered-dict implementations, the ordering of items is defined by the
time of insertion of the key.  New keys are appended at the end, keys
that are overwritten and not moved.

The following example shows the behavior for simple assignments:
d = odict()
d['parrot'] = 'dead'
d['penguin'] = 'exploded'
d.items()

[('parrot', 'dead'), ('penguin', 'exploded')]

That the ordering is preserved makes an odict useful for a couple of
situations:

- XML/HTML processing libraries currently drop the ordering of
  attributes, use a list instead of a dict which makes filtering
  cumbersome, or implement their own ordered dictionary.  This affects
  ElementTree, html5lib, Genshi and many more libraries.

- There are many ordererd dict implementations in various libraries
  and applications, most of them subtly incompatible with each other.
  Furthermore, subclassing dict is a non-trivial task and many
  implementations don't override all the methods properly which can
  lead to unexpected results.

  Additionally, many ordered dicts are implemented in an inefficient
  way, making many operations more complex then they have to be.

- PEP 3115 allows metaclasses to change the mapping object used for
  the class body.  An ordered dict could be used to create ordered
  member declarations similar to C structs.  This could be useful, for
  example, for future ``ctypes`` releases as well as ORMs that define
  database tables as classes, like the one the Django framework ships.
  Django currently uses an ugly hack to restore the ordering of
  members in database models.

- Code ported from other programming languages such as PHP often
  depends on a ordered dict.  Having an implementation of an
  ordering-preserving dictionary in the standard library could ease
  the transition and improve the compatibility of different libraries.

Ordered Dict API
================

The ordered dict API would be mostly compatible with dict and existing
ordered dicts.  (Note: this PEP refers to the Python 2.x dictionary
API; the transfer to the 3.x API is trivial.)

The constructor and ``update()`` both accept iterables of tuples as
well as mappings like a dict does.  The ordering however is preserved
for the first case:
d = odict([('a', 'b'), ('c', 'd')])
d.update({'foo': 'bar'})
d

collections.odict([('a', 'b'), ('c', 'd'), ('foo', 'bar')])

If ordered dicts are updated from regular dicts, the ordering of new
keys is of course undefined again unless ``sort()`` is called.

All iteration methods as well as ``keys()``, ``values()`` and
``items()`` return the values ordered by the the time the key-value
pair was inserted:
d['spam'] = 'eggs'
d.keys()

['a', 'c', 'foo', 'spam']>>> d.values()

['b', 'd', 'bar', 'eggs']>>> d.items()

[('a', 'b'), ('c', 'd'), ('foo', 'bar'), ('spam', 'eggs')]

New methods not available on dict:

    ``odict.byindex(index)``

        Index-based lookup is supported by ``byindex()`` which returns
        the key/value pair for an index, that is, the "position" of a
        key in the ordered dict.  0 is the first key/value pair, -1
        the last.

        >>> d.byindex(2)
        ('foo', 'bar')

    ``odict.sort(cmp=None, key=None, reverse=False)``

        Sorts the odict in place by cmp or key.  This works exactly
        like ``list.sort()``, but the comparison functions are passed
        a key/value tuple, not only the value.

        >>> d = odict([(42, 1), (1, 4), (23, 7)])
        >>> d.sort()
        >>> d
        collections.odict([(1, 4), (23, 7), (42, 1)])

    ``odict.reverse()``

        Reverses the odict in place.

    ``odict.__reverse__()``

        Supports reverse iteration by key.

Questions and Answers
=====================

What happens if an existing key is reassigned?

    The key is not moved but assigned a new value in place.  This is
    consistent with existing implementations and allows subclasses to
    change the behavior easily::

        class movingcollections.odict):
            def __setitem__(self, key, value):
                self.pop(key, None)
                odict.__setitem__(self, key, value)

What happens if keys appear multiple times in the list passed to the
constructor?

    The same as for regular dicts: The latter item overrides the
    former.  This has the side-effect that the position of the first
    key is used because the key is actually overwritten:

    >>> odict([('a', 1), ('b', 2), ('a', 3)])
    collections.odict([('a', 3), ('b', 2)])

    This behavior is consistent with existing implementations in
    Python, the PHP array and the hashmap in Ruby 1.9.

Why is there no ``odict.insert()``?

    There are few situations where you really want to insert a key at
    an specified index.  To avoid API complication, the proposed
    solution for this situation is creating a list of items,
    manipulating that and converting it back into an odict:

    >>> d = odict([('a', 42), ('b', 23), ('c', 19)])
    >>> l = d.items()
    >>> l.insert(1, ('x', 0))
    >>> odict(l)
    collections.odict([('a', 42), ('x', 0), ('b', 23), ('c', 19)])

Example Implementation
======================

A poorly performing example implementation of the odict written in
Python is available:

    `odict.py <http://dev.pocoo.org/hg/sandbox/raw-file/tip/
odict.py>`_

The version for ``collections`` should be implemented in C and use a
linked list internally.

Other implementations of ordered dicts in various Python projects or
standalone libraries, that inspired the API proposed here, are:

- `odict in Babel`_
- `OrderedDict in Django`_
- `The odict module`_
- `ordereddict`_ (a C implementation of the odict module)
- `StableDict`_
- `Armin Rigo's OrderedDict`_

.. _odict in Babel:http://babel.edgewall.org/browser/trunk/babel/util.py?rev=374#L178
.. _OrderedDict in Django:
   http://code.djangoproject.com/browser/django/trunk/django/utils/datas...
.. _The odict module:http://www.voidspace.org.uk/python/odict.html
.. _ordereddict:http://www.xs4all.nl/~anthon/Python/ordereddict/
.. _StableDict:http://pypi.python.org/pypi/StableDict/0.2
.. _Armin Rigo's OrderedDict:http://codespeak.net/svn/user/arigo/hack/pyfuse/OrderedDict.py

Future Directions
=================

With the availability of an ordered dict in the standard library,
other libraries may take advantage of that.  For example, ElementTree
could return odicts in the future that retain the attribute ordering
of the source file.

Copyright
=========

This document has been placed in the public domain.

In Python 3.0 dict.items(), dict.keys() and dict.values() return set-
like views of the dictionary. http://www.python.org/dev/peps/pep-3106/.
You should consider doing the same thing for odict. Of course, the
views would be list-like, not set-like.

At the very least, I would like to see a discussion about it.

Matt
 
D

dbpokorny

In Python 2.5 a dict(int:None) needs about 36.2 bytes/element. I am
suggesting to add 2 pointers, to create a linked list, so it probably
becomes (on 32 bit systems) about 44.2 bytes/pair.

PyDictEntry is
typedef struct {
Py_ssize_t me_hash;
PyObject *me_key;
PyObject *me_value;
} PyDictEntry;

Which should be 12 bytes on a 32-bit machine. I thought the space for
growth factor for dicts was about 12% but it is really 100%. In any
case, a pair of lists will take up less space than a dict and a list.
Or the storage could be an array of PyDictEntrys (to cache the hash
values of the keys), an approach that is in some sense halfway between
the others.

There is one advantage of this last approach - I think the amount of
hacking on dictobject.c that would have to take place is minimal. In
fact it almost seems like you could get the desired result by setting
mp->ma_lookup to a new function (and keep most of the rest of the
methods as they are). This seems too easy though, so there might be a
catch.

David
 
A

Armin Ronacher

Small Status update of the changes incorporated in the PEP:

- The PEP Title was fixed. Of course it's a dictionary not a
directory :)

- A questions and answers section was added that covers some
of the questions raised here and on the original thread on
python-devel.

- Comparison behavor was documented

- a note on Python 3 support was added (about keys-views)


Regards,
Armin
 
B

bearophileHUGS

dbpoko...:
Which should be 12 bytes on a 32-bit machine. I thought the space for
growth factor for dicts was about 12% but it is really 100%.

(Please ignore the trailing ".2" in my number in my last post, such
precision is silly).
My memory value comes from experiments, I have created a little
program like this:

from memory import memory

def main(N):
m1 = memory()
print m1

d = {}
for i in xrange(N):
d = None

m2 = memory()
print m2
print float((m2 - m1) * 1024) / N
main(20000000)

Where memory is a small module of mine that calls a little known
program that tells how much memory is used by the current Python
process. The results for that run n=20000000 are (first two numbers
are kilobytes, the third number is byte/pair):

1876
633932
32.3612672

It means to store 20_000_000 pairs it requires about 647_000_000
bytes, Python 2.5.2, on Win.

Bye,
bearophile
 
B

bearophileHUGS

Duncan Booth:
What do you get if you change the output to exclude the integers from
the memory calculation so you are only looking at the dictionary
elements themselves? e.g.

The results:

318512 (kbytes)
712124 (kbytes)
20.1529344 (bytes)

Bye,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,698
Latest member
LydiaHalle

Latest Threads

Top