list.pop(0) vs. collections.dequeue

S

Steve Howell

You haven't explained why deque doesn't apply to your use case.  Until a
convincing explanation emerges, the sentiment you're creating seems to
be "what's wrong with that guy and why doesn't he just use deque?".  So,
why aren't you using deque?  If deque somehow isn't adequate for your
use case, maybe it can be improved.

These are the reasons I am not using deque:

1) I want to use native lists, so that downstream methods can use
them as lists.
2) Lists are faster for accessing elements.
3) I want to be able to insert elements into the middle of the list.
4) I have no need for rotating elements.

I also have reasons for not using the other workarounds, such as
reversing the list.
I'm talking about memory consumption, which is part of Python's concept
of performance.  You're proposing adding a word or two to every list,
with insufficient justification presented so far.  Any such
justification would have to include a clear and detailed explanation of
why using deque is insufficient, so that would be a good place to start.

Adding a word or two to a list is an O(1) addition to a data structure
that takes O(N) memory to begin with.

That extra pointer should really be taken not just in context of the
list itself taking O(N) memory, but also the fact that all the
elements in the list are also consuming memory (until they get popped
off). So adding the pointer has neglible cost.

Another way of looking at it is that you would need to have 250 or so
lists in memory at the same time before the extra pointer was even
costing you kilobytes of memory. My consumer laptop has 3027908k of
memory.
 
S

Steve Howell

So let me get this straight...

You've complained that Python's list.pop(0) is lame because it moves
memory around. And your solution to that is to have the memory manager
move the memory around instead?

Perhaps I'm missing something, but I don't see the advantage here. At
best, you consolidate all those moves you wanted to avoid and do them all
at once instead of a few at a time. At worst, you get a situation where
the application periodically, and unpredictably, grinds to a halt while
the memory manager tries to defrag all those lists.

You are misunderstanding what I meant, because I did not explain it
very well. When you release memory from the front of the list, if the
memory before it was also free, the memory manager could consolidate
the two chunks as one free chunk.

There is no rational scenario where the memory manager grinds to a
halt tries to "defrag all those lists."

Of course, once the list gets fully garbage collected, then entire
chunk of memory is freed up.

Do you honestly believe that "everybody" who prefers another language
over Python does so because they dislike the performance of list.pop(0)?

No I don't believe any statement that makes gross generalizations, so
I also don't believe "most everyone is satisfied with the existing
solution."
Popping from the end of the list isn't expensive. Reversing lists is
relatively cheap. In-place modifications are very cheap.

I am talking in relative terms here. I am saying that checking a
single flag in C code isn't gonna significantly slow down any
operation that calls list_resize(). Delete operations would already
be doing a memmove operation, and insert operations already have to
decide whether to optimistically allocate memory and create the new
list element.

Regarding the extra use of memory, I addressed this in my prior
posting.

Here is code for list_resize:

static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
PyObject **items;
size_t new_allocated;
Py_ssize_t allocated = self->allocated;

/* Bypass realloc() when a previous overallocation is large enough
to accommodate the newsize. If the newsize falls lower than half
the allocated size, then proceed with the realloc() to shrink the
list.
*/
if (allocated >= newsize && newsize >= (allocated >> 1)) {
assert(self->ob_item != NULL || newsize == 0);
Py_SIZE(self) = newsize;
return 0;
}

/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);

/* check for integer overflow */
if (new_allocated > PY_SIZE_MAX - newsize) {
PyErr_NoMemory();
return -1;
} else {
new_allocated += newsize;
}

if (newsize == 0)
new_allocated = 0;
items = self->ob_item;
if (new_allocated <= ((~(size_t)0) / sizeof(PyObject *)))
PyMem_RESIZE(items, PyObject *, new_allocated);
else
items = NULL;
if (items == NULL) {
PyErr_NoMemory();
return -1;
}
self->ob_item = items;
Py_SIZE(self) = newsize;
self->allocated = new_allocated;
return 0;
}


Here is the code for list_ass_slice:

static int
list_ass_slice(PyListObject *a, Py_ssize_t ilow, Py_ssize_t ihigh,
PyObject *v)
{
/* Because [X]DECREF can recursively invoke list operations on
this list, we must postpone all [X]DECREF activity until
after the list is back in its canonical shape. Therefore
we must allocate an additional array, 'recycle', into which
we temporarily copy the items that are deleted from the
list. :-( */
PyObject *recycle_on_stack[8];
PyObject **recycle = recycle_on_stack; /* will allocate more if
needed */
PyObject **item;
PyObject **vitem = NULL;
PyObject *v_as_SF = NULL; /* PySequence_Fast(v) */
Py_ssize_t n; /* # of elements in replacement list */
Py_ssize_t norig; /* # of elements in list getting replaced */
Py_ssize_t d; /* Change in size */
Py_ssize_t k;
size_t s;
int result = -1; /* guilty until proved innocent */
#define b ((PyListObject *)v)
if (v == NULL)
n = 0;
else {
if (a == b) {
/* Special case "a[i:j] = a" -- copy b first */
v = list_slice(b, 0, Py_SIZE(b));
if (v == NULL)
return result;
result = list_ass_slice(a, ilow, ihigh, v);
Py_DECREF(v);
return result;
}
v_as_SF = PySequence_Fast(v, "can only assign an iterable");
if(v_as_SF == NULL)
goto Error;
n = PySequence_Fast_GET_SIZE(v_as_SF);
vitem = PySequence_Fast_ITEMS(v_as_SF);
}
if (ilow < 0)
ilow = 0;
else if (ilow > Py_SIZE(a))
ilow = Py_SIZE(a);

if (ihigh < ilow)
ihigh = ilow;
else if (ihigh > Py_SIZE(a))
ihigh = Py_SIZE(a);

norig = ihigh - ilow;
assert(norig >= 0);
d = n - norig;
if (Py_SIZE(a) + d == 0) {
Py_XDECREF(v_as_SF);
return list_clear(a);
}
item = a->ob_item;
/* recycle the items that we are about to remove */
s = norig * sizeof(PyObject *);
if (s > sizeof(recycle_on_stack)) {
recycle = (PyObject **)PyMem_MALLOC(s);
if (recycle == NULL) {
PyErr_NoMemory();
goto Error;
}
}
memcpy(recycle, &item[ilow], s);

if (d < 0) { /* Delete -d items */
memmove(&item[ihigh+d], &item[ihigh],
(Py_SIZE(a) - ihigh)*sizeof(PyObject *));
list_resize(a, Py_SIZE(a) + d);
item = a->ob_item;
}
else if (d > 0) { /* Insert d items */
k = Py_SIZE(a);
if (list_resize(a, k+d) < 0)
goto Error;
item = a->ob_item;
memmove(&item[ihigh+d], &item[ihigh],
(k - ihigh)*sizeof(PyObject *));
}
for (k = 0; k < n; k++, ilow++) {
PyObject *w = vitem[k];
Py_XINCREF(w);
item[ilow] = w;
}
for (k = norig - 1; k >= 0; --k)
Py_XDECREF(recycle[k]);
result = 0;
Error:
if (recycle != recycle_on_stack)
PyMem_FREE(recycle);
Py_XDECREF(v_as_SF);
return result;
#undef b
}
 
S

Steve Howell

These are the reasons I am not using deque:

  1) I want to use native lists, so that downstream methods can use
them as lists.
  2) Lists are faster for accessing elements.
  3) I want to be able to insert elements into the middle of the list.
  4) I have no need for rotating elements.

I also have reasons for not using the other workarounds, such as
reversing the list.



Adding a word or two to a list is an O(1) addition to a data structure
that takes O(N) memory to begin with.

That extra pointer should really be taken not just in context of the
list itself taking O(N) memory, but also the fact that all the
elements in the list are also consuming memory (until they get popped
off).  So adding the pointer has neglible cost.

Another way of looking at it is that you would need to have 250 or so
lists in memory at the same time before the extra pointer was even
costing you kilobytes of memory.  My consumer laptop has 3027908k of
memory.

I should also point out that my telephone has gigabytes of memory.
It's a fairly expensive device, but I regularly carry multiple
gigabytes of memory around in my front pants pocket.

There are some valid reasons to reject a proposal to make deleting
elements off the top of the list be O(1).

Memory consumption is not one of them.

Even the most naive patch to make pop(0) and del lst[0] advance the
pointer would eventually reclaim memory once the list is garbage
collected. Also, by allowing users to pop elements off the list
without a memmove, you encourage users to discard elements earlier in
the process, which means you can amortize the garbage collection for
the list elements themselves (i.e. less spiky), and do it earlier.
 
S

Steve Howell

I have to wonder if my messages are all ending up in your spam folder
for some reason. :)

PEP 3128 (which solves your problem, but not using the implementation
you suggest)http://www.python.org/dev/peps/pep-3128/

Implementation as an extension module:http://pypi.python.org/pypi/blist/

Related discussion:http://mail.python.org/pipermail/py...rg/pipermail/python-3000/2007-May/007491.html
be
Detailed performance comparison:http://stutzbachenterprises.com/performance-blist

I maintain a private fork of Python 3 with the blist replacing the
regular list, as a way of rigorously testing the blist implementation.

Although I originally proposed a PEP, I am content to have the blist
exist as a third-party module.


Hi Daniel, I agree with what Raymond Hettinger says toward the top of
the PEP. Blist, while extremely useful, does seem to have to trade
off performance of common operations, notably "get item", in order to
get better performance for other operations (notably insert/delete).
My algorithm does exactly N pops and roughly N list accesses, so I
would be going from N*N + N to N + N log N if switched to blist. That
would be at least a theoretical gain over the current performance, but
if pop() were O(1), I could get the whole thing down to N time.
 
P

Paul Rubin

Steve Howell said:
These are the reasons I am not using deque:

Thanks for these. Now we are getting somewhere.
1) I want to use native lists, so that downstream methods can use
them as lists.

It sounds like that could be fixed by making the deque API a proper
superset of the list API.
2) Lists are faster for accessing elements.

It sounds like that could be fixed by optimizing deque somewhat. Also,
have you profiled your application to show that accessing list elements
is actually using a significant fraction of its runtime and that it
would be slowed down noticably by deque? If not, it's a red herring.
3) I want to be able to insert elements into the middle of the list.

I just checked, and was surprised to find that deque doesn't support
this. I'd say go ahead and file a feature request to add it to deque.
4) I have no need for rotating elements.

That's unpersuasive since you're advocating adding a feature to list
that many others have no need for.

Adding a word or two to a list is an O(1) addition to a data structure
that takes O(N) memory to begin with.

Yes, as mentioned, additive constants matter.
Another way of looking at it is that you would need to have 250 or so
lists in memory at the same time before the extra pointer was even
costing you kilobytes of memory.

I've often run applications with millions of lists, maybe tens of
millions. Of course it would be 100's of millions if the machines were
big enough.
My consumer laptop has 3027908k of memory.

I thought the idea of buying bigger machines was to solve bigger
problems, not to solve the same problems more wastefully.
 
A

Arnaud Delobelle

Steve Howell said:
My algorithm does exactly N pops and roughly N list accesses, so I
would be going from N*N + N to N + N log N if switched to blist.

Can you post your algorithm? It would be interesting to have a concrete
use case to base this discussion on.
 
S

Steve Howell

[...]
My algorithm does exactly N pops and roughly N list accesses, so I
would be going from N*N + N to N + N log N if switched to blist.

Can you post your algorithm?  It would be interesting to have a concrete
use case to base this discussion on.

It is essentially this, in list_ass_slice:

if (d < 0) { /* Delete -d items */
if (ilow == 0) {
a->popped -= d;
a->ob_item -= d * sizeof(PyObject *);
list_resize(a, Py_SIZE(a));
}
else {
memmove(&item[ihigh+d], &item[ihigh],
(Py_SIZE(a) - ihigh)*sizeof(PyObject *));
list_resize(a, Py_SIZE(a) + d);
}
item = a->ob_item;
}

I am still working through the memory management issues, but when I
have a complete working patch, I will give more detail.
 
S

Steve Howell

Thanks for these.  Now we are getting somewhere.


It sounds like that could be fixed by making the deque API a proper
superset of the list API.

That is probably a good idea.
It sounds like that could be fixed by optimizing deque somewhat.  Also,
have you profiled your application to show that accessing list elements
is actually using a significant fraction of its runtime and that it
would be slowed down noticably by deque?  If not, it's a red herring.

I haven't profiled deque vs. list, but I think you are correct about
pop() possibly being a red herring.

It appears that the main bottleneck might still be the processing I do
on each line of text, which in my cases is regexes.

For really large lists, I suppose memmove() would eventually start to
become a bottleneck, but it's brutally fast when it just moves a
couple kilobytes of data around.
I just checked, and was surprised to find that deque doesn't support
this.  I'd say go ahead and file a feature request to add it to deque.

It might be a good thing to add just for consistency sake. If
somebody first implements an algorithm with lists, then discovers it
has overhead relating to inserting/appending at the end of the list,
then the more deque behaves like a list, the more easily they could
switch over their code to deque. Not knowing much about deque's
internals, I assume its performance for insert() would O(N) just like
list, although maybe a tiny bit slower.
That's unpersuasive since you're advocating adding a feature to list
that many others have no need for.  

To be precise, I wasn't really advocating for a new feature but an
internal optimization of a feature that already exists.
Yes, as mentioned, additive constants matter.


I've often run applications with millions of lists, maybe tens of
millions.  Of course it would be 100's of millions if the machines were
big enough.

I bet even in your application, the amount of memory consumed by the
PyListObjects themselves is greatly dwarfed by other objects, notably
the list elements themselves, not to mention any dictionaries that
your app uses.
I thought the idea of buying bigger machines was to solve bigger
problems, not to solve the same problems more wastefully.

Well, I am not trying to solve problems wastefully here. CPU cycles
are also scarce, so it seems wasteful to do an O(N) memmove that could
be avoided by storing an extra pointer per list. I also think that
encouraging the use of pop(0) would actually make many programs more
memory efficient, in the sense that you can garbage collect list
elements earlier.

Thanks for your patience in responding to me, despite the needlessly
abrasive tone of my earlier postings. I am coming around to this
thinking:

1) Summarize all this discussion and my lessons learned in some kind
of document. It does not have to be a PEP per se, but I could provide
a useful service to the community by listing pros/cons/etc.

2) I would still advocate for removing the warning against list.pop
(0) from the tutorial. I agree with Steven D'Aprano that docs really
should avoid describing implementation details in many instances
(although I do not know what he thinks about this particular case). I
also think that the performance penalty for pop(0) is negligible for
most medium-sized programs. For large-sized programs where you really
want to swap in deque, I think most authors are beyond reading the
tutorial and are looking elsewhere for insight on Python data
structures.

3) I am gonna try to implement the patch anyway for my own
edification.

4) I do think that there are ways that deque could be improved, but
it is not high on my priority list. I will try to mention it in the
PEP, though.
 
S

Steve Howell

[...]
My algorithm does exactly N pops and roughly N list accesses, so I
would be going from N*N + N to N + N log N if switched to blist.

Can you post your algorithm?  It would be interesting to have a concrete
use case to base this discussion on.

These are the profile results for an admittedly very large file
(430,000 lines), which shows that pop() consumes more time than any
other low level method. So pop() is not a total red herring. But I
have to be honest and admit that I grossly overestimated the penalty
for smaller files. Typical files are a couple hundred lines, and for
that use case, pop()'s expense gets totally drowned out by regex
handling. In other words, it's a lot cheaper to move a couple hundred
pointers per list element pop than it is to apply a series of regexes
to them, which shouldn't be surprising.

ncalls tottime percall cumtime percall filename:lineno
(function)
230001/1 149.508 0.001 222.432 222.432 /home/showell/workspace/
shpaml_website/shpaml.py:192(recurse)
429999 17.667 0.000 17.667 0.000 {method 'pop' of 'list'
objects}
530000 8.428 0.000 14.125 0.000 /home/showell/workspace/
shpaml_website/shpaml.py:143(get_indented_block)
3700008 7.877 0.000 7.877 0.000 {built-in method match}
5410125/5410121 5.697 0.000 5.697 0.000 {len}
300000 3.938 0.000 22.286 0.000 /home/showell/workspace/
shpaml_website/shpaml.py:96(convert_line)
959999 3.847 0.000 6.759 0.000 /home/showell/workspace/
shpaml_website/shpaml.py:29(INDENT)
959999 3.717 0.000 12.547 0.000 /home/showell/workspace/
shpaml_website/shpaml.py:138(find_indentation)
370000 3.495 0.000 20.204 0.000 /home/showell/workspace/
shpaml_website/shpaml.py:109(apply_jquery)
370000 3.322 0.000 6.528 0.000 {built-in method sub}
1469999 2.575 0.000 2.575 0.000 {built-in method groups}

As an aside, I am a little surprised by how often I call len() and
that it also takes a large chunk of time, but that's my problem to
fix.
 
G

geremy condra

Hm, it would be nice if the Python docs offered complexity (time)
guarantees in general...

Cheers,

- Alf

This would be a very welcome improvement IMHO- especially
in collections.

Geremy Condra
 
E

Ethan Furman

No hint? Looking at the below snippet of docs -- "not efficient" and
"slow" sound like pretty good hints to me.
Bringing this thread full circle, does it make sense to strike this
passage from the tutorial?:

'''
It is also possible to use a list as a queue, where the first element
added is the first element retrieved (“first-in, first-out”); however,
lists are not efficient for this purpose. While appends and pops from
the end of list are fast, doing inserts or pops from the beginning of
a list is slow (because all of the other elements have to be shifted
by one).
'''

I think points #3 and #6 possibly apply. Regarding points #2 and #4,
the tutorial is at least not overly technical or specific; it just
explains the requirement to shift other elements one by one in simple
layman's terms.

I think the paragraph is fine. Instead of waiting for the (hundreds
of?) posts wondering why making a FIFO queue from a list is so slow, and
what's wrong with Python, etc, etc, it points out up front that yes you
can, and here's why you don't want to. This does not strike me as too
much knowledge.

~Ethan~
 
A

Alf P. Steinbach

* Ethan Furman:
No hint? Looking at the below snippet of docs -- "not efficient" and
"slow" sound like pretty good hints to me.


I think the paragraph is fine. Instead of waiting for the (hundreds
of?) posts wondering why making a FIFO queue from a list is so slow, and
what's wrong with Python, etc, etc, it points out up front that yes you
can, and here's why you don't want to. This does not strike me as too
much knowledge.

Is the tutorial regarded as part of the language specification?

I understand that the standard library docs are part (e.g. 'object' is only
described there), and that at least some PEPs are.


Cheers,

- Alf
 
P

Paul Rubin

Steve Howell said:
I haven't profiled deque vs. list, but I think you are correct about
pop() possibly being a red herring....
For really large lists, I suppose memmove() would eventually start to
become a bottleneck, but it's brutally fast when it just moves a
couple kilobytes of data around.

One way to think of Python is as a scripting wrapper around a bunch of C
functions, rather than as a full-fledged programming language. Viewed
that way, list operations like pop(0) are essentially constant time
unless the list is quite large. By that I mean you can implement
classic structures like doubly-linked lists using Python tuples, but
even though inserting into the middle of them is theoretically O(1), the
memmove's of the native list operations will be much faster in practice.
Programs dealing with large lists (more than a few thousand elements)
are obviously different and if your program is using such large lists,
you have to plan a little differently when writing the code.
I bet even in your application, the amount of memory consumed by the
PyListObjects themselves is greatly dwarfed by other objects, notably
the list elements themselves

Such lists often would just one element or even be empty. For example,
you might have a dictionary mapping names to addresses. Most people
have just one address, but some might have no address, and a few might
have more than one address, so you would have a list of addresses for
each name. Of course the dictionary slots in that example would also
use space.
Well, I am not trying to solve problems wastefully here. CPU cycles
are also scarce, so it seems wasteful to do an O(N) memmove that could
be avoided by storing an extra pointer per list.

Realistically the CPython interpreter is so slow that the memmove is
unnoticable, and Python (at least CPython) just isn't all that
conductive to writing fast code. It makes up for this in programmer
productivity for the many sorts of problems in which moderate speed
is acceptable.
Thanks for your patience in responding to me, despite the needlessly
abrasive tone of my earlier postings.

I wondered whether you might have come over from the Lisp newsgroups,
which are pretty brutal. We try to be friendlier here (not that we're
always successful). Anyway, welcome.
1) Summarize all this discussion and my lessons learned in some kind
of document. It does not have to be a PEP per se, but I could provide
a useful service to the community by listing pros/cons/etc.

I suppose that can't hurt, but there are probably other areas (multicore
parallelism is a perennial one) of much higher community interest.

http://wiki.python.org/moin/ is probably a good place to put such
a document.
2) I would still advocate for removing the warning against list.pop
(0) from the tutorial. I agree with Steven D'Aprano that docs really
should avoid describing implementation details in many instances

On general principles I agree with Alex Stepanov that the running time
of a function should be part of its interface (nobody wants to use a
stack of popping an element takes quadratic time) and therefore should
be stated in the docs. Python just has a weird incongruence between the
interpreter layer and the C layer, combined with a library well-evolved
for everyday problem sizes, so the traditional asymptotic approach to
algorithm selection often doesn't give the best practical choice.

I don't feel like looking up what the tutorial says about pop(0), but if
it just warns against it without qualification, it should probably
be updated.
 
S

Steve Howell

Again, your responsibility is to provide a patch and a spectrum of
benchmarking tests to prove it.  Then you would still have to deal with
the objection that extensions use the list internals -- that might be an
okay sell given the effort otherwise required to port extensions to
Python 3, but that's not the way to bet.

Ok, I just submitted a patch to python-dev that illustrates a 100x
speedup on an admittedly artificial program. It still has a long way
to go, but it demonstrates proof of concept. I'm done for the day,
but tomorrow I will try to polish it up and improve it, even if its
doomed for rejection. Apologies to all I have offended in this
thread. I frankly found some of the pushback to be a bit hasty and
disrespectful, but I certainly overreacted to some of the criticism.
And now I'm in the awkward position of asking the people I offended to
help me with the patch. If anybody can offer me a hand in
understanding some of CPython's internals, particularly with regard to
memory management, it would be greatly appreciated.

(Sorry I don't have a link to the python-dev posting; it is not
showing up in the archives yet for some reason.)
 
S

Steve Howell

One way to think of Python is as a scripting wrapper around a bunch of C
functions, rather than as a full-fledged programming language.  Viewed
that way, list operations like pop(0) are essentially constant time
unless the list is quite large.  By that I mean you can implement
classic structures like doubly-linked lists using Python tuples, but
even though inserting into the middle of them is theoretically O(1), the
memmove's of the native list operations will be much faster in practice.
Programs dealing with large lists (more than a few thousand elements)
are obviously different and if your program is using such large lists,
you have to plan a little differently when writing the code.

Thanks. That is a good way of looking at.
Realistically the CPython interpreter is so slow that the memmove is
unnoticable, and Python (at least CPython) just isn't all that
conductive to writing fast code.  It makes up for this in programmer
productivity for the many sorts of problems in which moderate speed
is acceptable.

Definitely, and moderate speed is enough in a surprisingly large
number of applications.

I wondered whether you might have come over from the Lisp newsgroups,
which are pretty brutal.  We try to be friendlier here (not that we're
always successful).  Anyway, welcome.

:)


I suppose that can't hurt, but there are probably other areas (multicore
parallelism is a perennial one) of much higher community interest.

http://wiki.python.org/moin/is probably a good place to put such
a document.

Ok, that's where I'll start.
On general principles I agree with Alex Stepanov that the running time
of a function should be part of its interface (nobody wants to use a
stack of popping an element takes quadratic time) and therefore should
be stated in the docs.  Python just has a weird incongruence between the
interpreter layer and the C layer, combined with a library well-evolved
for everyday problem sizes, so the traditional asymptotic approach to
algorithm selection often doesn't give the best practical choice.

I don't feel like looking up what the tutorial says about pop(0), but if
it just warns against it without qualification, it should probably
be updated.

Here it is:

http://docs.python.org/tutorial/datastructures.html#using-lists-as-queues

My opinion is that the warning should be either removed or qualified,
but it is probably fine as written.

'''
It is also possible to use a list as a queue, where the first element
added is the first element retrieved (“first-in, first-out”); however,
lists are not efficient for this purpose. While appends and pops from
the end of list are fast, doing inserts or pops from the beginning of
a list is slow (because all of the other elements have to be shifted
by one).
'''

The qualifications would be that deque lacks some features that list
has, and that the shift-by-one operation is actually a call to memmove
() and may not apply to all implementations.
 
S

Steve Howell

Ok, I just submitted a patch to python-dev that illustrates a 100x
speedup on an admittedly artificial program.  It still has a long way
to go, but it demonstrates proof of concept.  I'm done for the day,
but tomorrow I will try to polish it up and improve it, even if its
doomed for rejection.  Apologies to all I have offended in this
thread.  I frankly found some of the pushback to be a bit hasty and
disrespectful, but I certainly overreacted to some of the criticism.
And now I'm in the awkward position of asking the people I offended to
help me with the patch.  If anybody can offer me a hand in
understanding some of CPython's internals, particularly with regard to
memory management, it would be greatly appreciated.

(Sorry I don't have a link to the python-dev posting; it is not
showing up in the archives yet for some reason.)

Here is the latest version of the patch, which passes all the tests on
my debug build.

Not exactly trivial, but not super complicated either.

Index: Include/listobject.h

===================================================================
--- Include/listobject.h (revision 77751)
+++ Include/listobject.h (working copy)
@@ -36,6 +36,7 @@
* the list is not yet visible outside the function that
builds it.
*/
Py_ssize_t allocated;
+ Py_ssize_t orphans;
} PyListObject;

PyAPI_DATA(PyTypeObject) PyList_Type;
Index: Objects/listobject.c

===================================================================
--- Objects/listobject.c (revision 77751)
+++ Objects/listobject.c (working copy)
@@ -27,13 +27,25 @@
PyObject **items;
size_t new_allocated;
Py_ssize_t allocated = self->allocated;
+ Py_ssize_t needed;

+ if (self->orphans >= newsize) {
+ items = self->ob_item - self->orphans;
+ memmove(items, &items[self->orphans],
+ (newsize)*sizeof(PyObject *));
+ self->ob_item = items;
+ self->orphans = 0;
+ }
+
+ needed = newsize + self->orphans;
+ items = self->ob_item - self->orphans;
+
/* Bypass realloc() when a previous overallocation is large
enough
to accommodate the newsize. If the newsize falls lower
than half
the allocated size, then proceed with the realloc() to
shrink the list.
*/
- if (allocated >= newsize && newsize >= (allocated >> 1)) {
- assert(self->ob_item != NULL || newsize == 0);
+ if (allocated >= needed && needed >= (allocated >> 1)) {
+ assert(items != NULL || newsize == 0);
Py_SIZE(self) = newsize;
return 0;
}
@@ -45,28 +57,32 @@
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72,
88, ...
*/
- new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
+ new_allocated = (needed >> 3) + (needed < 9 ? 3 : 6);

/* check for integer overflow */
- if (new_allocated > PY_SIZE_MAX - newsize) {
+ if (new_allocated > PY_SIZE_MAX - needed) {
PyErr_NoMemory();
return -1;
} else {
- new_allocated += newsize;
+ new_allocated += needed;
}

- if (newsize == 0)
+ if (needed == 0)
new_allocated = 0;
- items = self->ob_item;
- if (new_allocated <= ((~(size_t)0) / sizeof(PyObject *)))
+ /*
+ fprintf(stderr, "ob_item: %p", self->ob_item);
+ fprintf(stderr, "items: %p", items);
+ */
+ if (new_allocated <= ((~(size_t)0) / sizeof(PyObject *))) {
PyMem_RESIZE(items, PyObject *, new_allocated);
+ }
else
items = NULL;
if (items == NULL) {
PyErr_NoMemory();
return -1;
}
- self->ob_item = items;
+ self->ob_item = items + self->orphans;
Py_SIZE(self) = newsize;
self->allocated = new_allocated;
return 0;
@@ -158,6 +174,7 @@
}
Py_SIZE(op) = size;
op->allocated = size;
+ op->orphans = 0;
_PyObject_GC_TRACK(op);
return (PyObject *) op;
}
@@ -304,9 +321,10 @@
There's a simple test case where somehow this reduces
thrashing when a *very* large list is created and
immediately deleted. */
+ op->ob_item -= op->orphans;
i = Py_SIZE(op);
while (--i >= 0) {
- Py_XDECREF(op->ob_item);
+ Py_XDECREF(op->ob_item[i+op->orphans]);
}
PyMem_FREE(op->ob_item);
}
@@ -548,12 +566,13 @@
if (item != NULL) {
/* Because XDECREF can recursively invoke operations on
this list, we make it empty first. */
+ item -= a->orphans;
i = Py_SIZE(a);
Py_SIZE(a) = 0;
a->ob_item = NULL;
a->allocated = 0;
while (--i >= 0) {
- Py_XDECREF(item);
+ Py_XDECREF(item[i + a->orphans]);
}
PyMem_FREE(item);
}
@@ -638,18 +657,32 @@
memcpy(recycle, &item[ilow], s);

if (d < 0) { /* Delete -d items */
- memmove(&item[ihigh+d], &item[ihigh],
- (Py_SIZE(a) - ihigh)*sizeof(PyObject *));
+ if (ilow == 0) {
+ a->orphans += (-1 * d);
+ a->ob_item += (-1 * d);
+ }
+ else {
+ memmove(&item[ihigh+d], &item[ihigh],
+ (Py_SIZE(a) - ihigh)*sizeof(PyObject *));
+ }
list_resize(a, Py_SIZE(a) + d);
item = a->ob_item;
}
else if (d > 0) { /* Insert d items */
- k = Py_SIZE(a);
- if (list_resize(a, k+d) < 0)
- goto Error;
- item = a->ob_item;
- memmove(&item[ihigh+d], &item[ihigh],
- (k - ihigh)*sizeof(PyObject *));
+ if (ilow == 0 && d <= a->orphans) {
+ a->orphans += (-1 * d);
+ a->ob_item += (-1 * d);
+ item = a->ob_item;
+ list_resize(a, Py_SIZE(a) + d);
+ }
+ else {
+ k = Py_SIZE(a);
+ if (list_resize(a, k+d) < 0)
+ goto Error;
+ item = a->ob_item;
+ memmove(&item[ihigh+d], &item[ihigh],
+ (k - ihigh)*sizeof(PyObject *));
+ }
}
for (k = 0; k < n; k++, ilow++) {
PyObject *w = vitem[k];
@@ -838,7 +871,7 @@
}
break;
}
- if (Py_SIZE(self) < self->allocated) {
+ if (Py_SIZE(self) + self->orphans < self->allocated) {
/* steals ref */
PyList_SET_ITEM(self, Py_SIZE(self), item);
++Py_SIZE(self);
@@ -852,7 +885,7 @@
}

/* Cut back result list if initial guess was too large. */
- if (Py_SIZE(self) < self->allocated)
+ if (Py_SIZE(self) + self->orphans < self->allocated)
list_resize(self, Py_SIZE(self)); /* shrinking can't fail
*/

Py_DECREF(it);
@@ -2290,7 +2323,7 @@
{
Py_ssize_t res;

- res = sizeof(PyListObject) + self->allocated * sizeof(void*);
+ res = sizeof(PyListObject) + (self->allocated + self->orphans) *
sizeof(void*);
return PyLong_FromSsize_t(res);
}

Index: Lib/test/test_list.py

===================================================================
--- Lib/test/test_list.py (revision 77751)
+++ Lib/test/test_list.py (working copy)
@@ -51,6 +51,15 @@
self.assertEqual(len([0]), 1)
self.assertEqual(len([0, 1, 2]), 3)

+ def test_pop_and_prepend(self):
+ # This guards against faulty optimizations on list that
+ # attempt to makes pops and prepends at the beginning of
the
+ # list work faster.
+ lst = [5] * 100
+ del lst[0]
+ lst.insert(0, 4)
+ self.assertEqual(lst, [4] + [5] * 99)
+
def test_overflow(self):
lst = [4, 5, 6, 7]
n = int((sys.maxsize*2+2) // len(lst))
Index: Lib/test/test_iter.py

===================================================================
--- Lib/test/test_iter.py (revision 77751)
+++ Lib/test/test_iter.py (working copy)
@@ -875,7 +875,18 @@
except TypeError:
pass

+ def test_extends(self):
+ # This test would break on an incomplete patch to
listobject.c
+ def gen():
+ for i in range(500):
+ yield i
+ lst = [0] * 500
+ for i in range(240):
+ lst.pop(0)
+ lst.extend(gen())

+
+
def test_main():
run_unittest(TestCase)

Index: Lib/test/test_sys.py

===================================================================
--- Lib/test/test_sys.py (revision 77751)
+++ Lib/test/test_sys.py (working copy)
@@ -515,7 +515,7 @@
# bool objects are not gc tracked
self.assertEqual(sys.getsizeof(True), size(vh) +
self.longdigit)
# but lists are
- self.assertEqual(sys.getsizeof([]), size(vh + 'PP') +
gc_header_size)
+ self.assertEqual(sys.getsizeof([]), size(vh + 'PPi') +
gc_header_size)

def test_default(self):
h = self.header
@@ -638,7 +638,7 @@
# list
samples = [[], [1,2,3], ['1', '2', '3']]
for sample in samples:
- check(sample, size(vh + 'PP') + len(sample)*self.P)
+ check(sample, size(vh + 'PPi') + len(sample)*self.P)
# sortwrapper (list)
# XXX
# cmpwrapper (list)
 
S

Steve Holden

Steve said:
Ok, I just submitted a patch to python-dev that illustrates a 100x
speedup on an admittedly artificial program. It still has a long way
to go, but it demonstrates proof of concept. I'm done for the day,
but tomorrow I will try to polish it up and improve it, even if its
doomed for rejection. Apologies to all I have offended in this
thread. I frankly found some of the pushback to be a bit hasty and
disrespectful, but I certainly overreacted to some of the criticism.
And now I'm in the awkward position of asking the people I offended to
help me with the patch. If anybody can offer me a hand in
understanding some of CPython's internals, particularly with regard to
memory management, it would be greatly appreciated.

(Sorry I don't have a link to the python-dev posting; it is not
showing up in the archives yet for some reason.)
Fortunately for you this is a very welcoming group, and particularly
responsive to individuals who have seen the error of their ways ;-)

regards
Steve
 
S

Steve Holden

Steve said:
Ok, I just submitted a patch to python-dev that illustrates a 100x
speedup on an admittedly artificial program. It still has a long way
to go, but it demonstrates proof of concept. I'm done for the day,
but tomorrow I will try to polish it up and improve it, even if its
doomed for rejection. Apologies to all I have offended in this
thread. I frankly found some of the pushback to be a bit hasty and
disrespectful, but I certainly overreacted to some of the criticism.
And now I'm in the awkward position of asking the people I offended to
help me with the patch. If anybody can offer me a hand in
understanding some of CPython's internals, particularly with regard to
memory management, it would be greatly appreciated.

(Sorry I don't have a link to the python-dev posting; it is not
showing up in the archives yet for some reason.)
Fortunately for you this is a very welcoming group, and particularly
responsive to individuals who have seen the error of their ways ;-)

regards
Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,176
Messages
2,570,950
Members
47,503
Latest member
supremedee

Latest Threads

Top