Efficient way to break up a list into two pieces

M

marwie

Hello,

I recently read about augmented assignments and that (with l1, l2
being lists)

l1.extend(l2)

is more efficient than

l1 = l1 + l2

because unnecessary copy operations can be avoided. Now my question is
if there's a similar thing for breaking a list into two parts. Let's
say I want to remove from l1 everything from and including position 10
and store it in l2. Then I can write

l2 = l1[10:]
del l1[10:]

But since I'm assigning a slice the elements will be copied.
Basically, I'm looking for something like l1.pop(10,len(l1)) which
returns and removes a whole chunk of data. Is there such a thing (and
if not, why not?)

Cheers,
Martin.
 
J

Jonathan Gardner

Hello,

I recently read about augmented assignments and that (with l1, l2
being lists)

   l1.extend(l2)

is more efficient than

   l1 = l1 + l2

because unnecessary copy operations can be avoided. Now my question is
if there's a similar thing for breaking a list into two parts. Let's
say I want to remove from l1 everything from and including position 10
and store it in l2. Then I can write

   l2 = l1[10:]
   del l1[10:]

But since I'm assigning a slice the elements will be copied.
Basically, I'm looking for something like l1.pop(10,len(l1)) which
returns and removes a whole chunk of data. Is there such a thing (and
if not, why not?)

The idiom is:

Don't know if it's optimized or not. If it's not, it could probably
be. This is a really common idiom.
 
S

Steven D'Aprano

Hello,

I recently read about augmented assignments and that (with l1, l2 being
lists)

l1.extend(l2)

is more efficient than

l1 = l1 + l2

because unnecessary copy operations can be avoided. Now my question is
if there's a similar thing for breaking a list into two parts. Let's say
I want to remove from l1 everything from and including position 10 and
store it in l2. Then I can write

l2 = l1[10:]
del l1[10:]

But since I'm assigning a slice the elements will be copied.

Yes, list slicing can't make any sort of assumptions about what you're
going to do next, so when you take a slice, it has to copy the data.

Basically,
I'm looking for something like l1.pop(10,len(l1)) which returns and
removes a whole chunk of data. Is there such a thing (and if not, why
not?)

Such a list pop would have to copy the data too. How else could it work?

What exactly is the problem you are trying to solve?

If you are unhappy that copy-and-remove a slice from a list takes two
lines instead of one ("won't somebody think of the shortage of
newlines!!!" *wink*), then just write a helper function and call that:

def copy_and_remove(alist, slice):
tmp = alist[slice]
del alist[slice]
return tmp

l2 = copy_and_remove(l1, slice(10, None))


If you are unhappy that copying slices from lists copies data, well
that's just the way things are. Don't make the mistake of trying to
optimize your application before you actually know what bits need
optimizing.

Python lists are arrays of pointers to objects, so copying a slice is
fast: it doesn't have to copy the objects, just pointers. Deleting from
the end of the list is also quick, because you don't have to move memory,
just clear some pointers and change the length field.

Splitting such an array without copying data is, essentially, impossible.
Python lists aren't linked lists.

But as I said, most of the time copying slices doesn't matter: the time
taken is unlikely to be a serious factor in practice. If it is (and
you've profiled your code, right? you're not just guessing what you think
is fast and slow?) then there may be ways to optimize your code to avoid
needing to copy slices, but we'd need to know what you're doing with the
data.
 
S

Steven D'Aprano

   l2 = l1[10:]
   del l1[10:]

But since I'm assigning a slice the elements will be copied. Basically,
I'm looking for something like l1.pop(10,len(l1)) which returns and
removes a whole chunk of data. Is there such a thing (and if not, why
not?)
The idiom is:
l1, l2 = l1[:10], l1[10:]

No, that has different behaviour to what the Original Poster wants, AND
it does two copies instead of one:

(1) copy l1[:10] and l1[10:]
(2) assign the names l1 and l2 to them
(3) if, and only if, there are no other references to the original l1, it
gets deleted (garbage collected).


What the OP is doing is quite different:

(1) copy l1[:10]
(2) assign the name l2 to it
(3) resize l1 in place to the first 10 items.


What the OP wants is:

(1) assign the name l2 to l1[:10] without copying
(2) resize l1 in place to the first 10 items without affecting l2.
 
M

marwie

Python lists are arrays of pointers to objects, so copying a slice is
fast: it doesn't have to copy the objects, just pointers. Deleting from
the end of the list is also quick, because you don't have to move memory,
just clear some pointers and change the length field.

Splitting such an array without copying data is, essentially, impossible.
Python lists aren't linked lists.

Well, to split a C array I would simply set l2 to point to l1[10] and
then
change the length of l1 (which I store somewhere else). No copying of
elements
needed. I would have assumed that python can do something like this
with its
internal arrays of pointers, too.

Anyway, this was more a question about coding style. I use
l1.extend(l2) or
l1 += l2 rather than l1 = l1 + l2 because it's as readable and
possibly faster.
I was simply wondering if something similar exists for splitting
lists.
 
M

marwie

What the OP is doing is quite different:

(1) copy l1[:10]
(2) assign the name l2 to it
(3) resize l1 in place to the first 10 items.

What the OP wants is:

(1) assign the name l2 to l1[:10] without copying
(2) resize l1 in place to the first 10 items without affecting l2.

Assuming you meant l1[10:] instead of l1[:10], that's what I wanted,
yes.
 
S

Steven D'Aprano

Python lists are arrays of pointers to objects, so copying a slice is
fast: it doesn't have to copy the objects, just pointers. Deleting from
the end of the list is also quick, because you don't have to move
memory, just clear some pointers and change the length field.

Splitting such an array without copying data is, essentially,
impossible. Python lists aren't linked lists.

Well, to split a C array I would simply set l2 to point to l1[10] and
then change the length of l1 (which I store somewhere else). No copying
of elements needed. I would have assumed that python can do something
like this with its internal arrays of pointers, too.

Python lists aren't C arrays either.

Python lists are *objects*. Everything in Python is an object.
Consequently, Python lists have a header, which includes the len. You
don't store the length somewhere else, you store it in the object: this
makes for less complexity. You can't just point l2 at an arbitrary index
in an array and expect it to work, it needs a header.

Additionally, Python lists are over-allocated so that appends are fast. A
list of (say) 1000 items might be over-allocated to (say) 1024 items, so
that you can do 24 appends before the array is full and the array needs
to be resized. This means that, on average, Python list appending is O(1)
instead of O(N). You can't just change the length blindly, you need to
worry about the over-allocation.

I'm sympathetic to your concern: I've often felt offended that doing
something like this:

x = SomeReallyBigListOrString
for item in x[1:]:
process(item)

has to copy the entire list or string (less the first item). But
honestly, I've never found a situation where it actually mattered.

Anyway, this was more a question about coding style. I use l1.extend(l2)
or l1 += l2 rather than l1 = l1 + l2 because it's as readable and
possibly faster.

I really, really hate augmented assignment for everything other than ints
and floats because you can't predict what it will do. Take this for
example:

mylist = [1, 2, 3, 4]
same_again = mylist
mylist.append(5)
same_again
[1, 2, 3, 4, 5]

mylist and same_again are the same object, so append and extend behave
predictably and simply. So is simple too:
mylist = mylist + [-1, -2, -3]
mylist [1, 2, 3, 4, 5, -1, -2, -3]
same_again
[1, 2, 3, 4, 5]

Because you have re-bound the name mylist to a new list, same_again
doesn't get modified. Nice.

But what about mylist += something else? Is it an in-place modification,
like extend and append, or a rebinding, like +? Who can remember? The
answer will depend on whether mylist is a builtin list, or a subclass.

And then there's this mystery:
mylist = [1, 2, 3]
mytuple = (mylist, None)
mytuple[0] += [4]
Traceback (most recent call last):
[1, 2, 3, 4]

So in fact, += is *always* a rebinding, but *sometimes* it is an in-place
modification as well. Yuck.
 
M

MRAB

Steven D'Aprano wrote:
[snip]
I'm sympathetic to your concern: I've often felt offended that doing
something like this:

x = SomeReallyBigListOrString
for item in x[1:]:
process(item)

has to copy the entire list or string (less the first item). But
honestly, I've never found a situation where it actually mattered.
[snip]
Could lists gain an .items() method with start and end positions? I was
thinking that it would be a 'view', like with dicts in Python 3. Of
course, that would only be worthwhile with very long lists:

x = SomeReallyBigListOrString
for item in x.items(1):
process(item)
 
J

Jonathan Gardner

What the OP wants is:

(1) assign the name l2 to l1[:10] without copying
(2) resize l1 in place to the first 10 items without affecting l2.

For ten items, though, is it really faster to muck around with array
lengths than just copying the data over? Array copies are extremely
fast on modern processors.

Programmer time and processor time being what it is, I still think my
answer is the correct one. No, it's not what he says he wants, but it
is what he needs.
 
C

Carl Banks

Hello,

I recently read about augmented assignments and that (with l1, l2
being lists)

    l1.extend(l2)

is more efficient than

    l1 = l1 + l2

because unnecessary copy operations can be avoided. Now my question is
if there's a similar thing for breaking a list into two parts. Let's
say I want to remove from l1 everything from and including position 10
and store it in l2. Then I can write

    l2 = l1[10:]
    del l1[10:]

But since I'm assigning a slice the elements will be copied.

That's about the best you can do with Python lists.

Basically, I'm looking for something like l1.pop(10,len(l1)) which
returns and removes a whole chunk of data. Is there such a thing (and
if not, why not?)

Numpy arrays can share underlying data like that when you take
slices. For instance, this probably works the way you want:

a = numpy.array([1,2,3,4,5,6])
b = a[:3]
c = a[3:]

None of the actual data was copied here.



Carl Banks
 
S

Steven D'Aprano

What the OP wants is:

(1) assign the name l2 to l1[:10] without copying (2) resize l1 in
place to the first 10 items without affecting l2.
For ten items, though, is it really faster to muck around with array
lengths than just copying the data over? Array copies are extremely fast
on modern processors.

My mistake, he actually wants l1[10:] copied. So you might be copying a
huge amount of data, not just ten items.

Or if you prefer, replace 10 by 100000000.

But in either case, I agree that *in practice* it's rarely an actual
problem.

Programmer time and processor time being what it is, I still think my
answer is the correct one. No, it's not what he says he wants, but it is
what he needs.

You missed the point that your suggestion gives radically different
behaviour to what the OP indicated he is using. He mutates the list in
place, you rebind the list's name. That has *very* different semantics.

a = b = [1, 2, 3, 4]
del a[2:] # mutate in place
b [1, 2]

a = b = [1, 2, 3, 4]
a = a[2:] # rebind the name
b
[1, 2, 3, 4]

The OP may not care about the difference, but if he does require the
first behaviour, then your solution doesn't help.
 
M

marwie

Additionally, Python lists are over-allocated so that appends are fast. A
list of (say) 1000 items might be over-allocated to (say) 1024 items, so
that you can do 24 appends before the array is full and the array needs
to be resized. This means that, on average, Python list appending is O(1)
instead of O(N). You can't just change the length blindly, you need to
worry about the over-allocation.

Ok, I see your point. However, other data representation might still
be able to optimize such a multi-element pop. I'm thinking of deques,
for example.
I'm sympathetic to your concern: I've often felt offended that doing
something like this:

x = SomeReallyBigListOrString
for item in x[1:]:
    process(item)

has to copy the entire list or string (less the first item). But
honestly, I've never found a situation where it actually mattered.

Good grief, it copies that, too? I assumed that the copying is at
least delayed until the assignment and that x[1:] is represented by
some wrapper that just shuffles the indices around (much like
the .indices method that MRAB suggested).

Maybe copying chunks of data really isn't that much of an issue as it
used to be (and maybe I'm getting old). The application I have in mind
has to do with symbolic computation, and expressions would be
represented by python lists. It's too early to do any profiling (in
fact, it's at the "deciding if python is the right language to
implement it" stage), but it will have to handle expressions with many
terms (i.e. long lists), and in the end taking these lists apart and
putting them back together in different order is the only thing the
code will do. That to explain my interest in performance issues
related to pyhton lists.

Anyway, thanks for your help.
Martin.
 
M

marwie

Numpy arrays can share underlying data like that when you take
slices.  For instance, this probably works the way you want:

a = numpy.array([1,2,3,4,5,6])
b = a[:3]
c = a[3:]

None of the actual data was copied here.

Hmm, that might be worth looking into. Thanks.
 
M

MRAB

MRAB said:
Steven D'Aprano wrote:
[snip]
I'm sympathetic to your concern: I've often felt offended that doing
something like this:

x = SomeReallyBigListOrString
for item in x[1:]:
process(item)

has to copy the entire list or string (less the first item). But
honestly, I've never found a situation where it actually mattered.
[snip]
Could lists gain an .items() method with start and end positions? I was
thinking that it would be a 'view', like with dicts in Python 3. Of
course, that would only be worthwhile with very long lists:

x = SomeReallyBigListOrString
for item in x.items(1):
process(item)

Actually, my example is a bit wrong! My suggestion should have said that
the arguments of list.items would resemble those of range:

x = SomeReallyBigListOrString
for item in x.items(1, None): # [1 : None], or just [1 : ]
process(item)

Having a third 'stride' argument would also be possible.
 
S

Steve Howell

Python lists are arrays of pointers to objects, so copying a slice is
fast: it doesn't have to copy the objects, just pointers. Deleting from
the end of the list is also quick, because you don't have to move memory,
just clear some pointers and change the length field.
Splitting such an array without copying data is, essentially, impossible.
Python lists aren't linked lists.

Well, to split a C array I would simply set l2 to point to l1[10] and
then
change the length of l1 (which I store somewhere else). No copying of
elements
needed. I would have assumed that python can do something like this
with its
internal arrays of pointers, too.

When you remove 10 elements off of a list of size 1000, none of the
objects themselves are moved, but the pointers to the objects are all
moved, so k*990 bytes get moved backward, where k is the size of a
pointer (4 or 8 typically on modern computers).

There is no mechanism to advance a pointer forward when you delete or
pop from the top.

I proposed the following patch to implement an advance-the-pointer
scheme, but it is unlikely to be accepted in the near future:

http://bugs.python.org/file16034/no_mem_penalty.diff

http://bugs.python.org/issue7784

You can find alternative data structures that might serve your needs
better, such as collections.deque.

If you are interested in how Python lists work internally, you mostly
want to look at listobject.c:

http://svn.python.org/view/python/trunk/Objects/listobject.c?view=markup

In particular, look at list_resize(), list_slice(), and
list_ass_slice().
 
S

Steven D'Aprano

x = SomeReallyBigListOrString
for item in x[1:]:
    process(item)

has to copy the entire list or string (less the first item). But
honestly, I've never found a situation where it actually mattered.

Good grief, it copies that, too? I assumed that the copying is at least
delayed until the assignment and that x[1:] is represented by some
wrapper that just shuffles the indices around (much like the .indices
method that MRAB suggested).

Such complexity doesn't happen for free. It costs developer time, more
complexity means more bugs, and more runtime overhead, especially for
small lists. 99.9% of programs operate on small amounts of data, and
"small" gets bigger every year.

(I was reading one of my old Uni text books the other day, and one of the
projects was an application to index all the words in a text file. The
author decided to use disk storage rather than in-memory data structures
because he was hoping to read files up to 60,000 words, and considered it
unlikely that any PCs would have enough memory to store 60,000 words!)

Unless you actually profile the code, you're as likely to be *slowing
Python down* by such extra sophistication as you are to speed it up.
Copying blocks of pointers is really fast on modern CPUs.

As a general rule, Python aims for simplicity of implementation rather
than clever, but fragile, tricks.

Maybe copying chunks of data really isn't that much of an issue as it
used to be (and maybe I'm getting old). The application I have in mind
has to do with symbolic computation, and expressions would be
represented by python lists. It's too early to do any profiling (in
fact, it's at the "deciding if python is the right language to implement
it" stage), but it will have to handle expressions with many terms (i.e.
long lists), and in the end taking these lists apart and putting them
back together in different order is the only thing the code will do.


Are you likely to have expressions with a hundred or so MILLION terms?

If so, then you should start thinking about clever tricks to minimise
copying. Otherwise, you're engaged in premature optimization, which is
the root of all computer science evil :)
 
J

Jonathan Gardner

For ten items, though, is it really faster to muck around with array
lengths than just copying the data over? Array copies are extremely fast
on modern processors.

My mistake, he actually wants l1[10:] copied. So you might be copying a
huge amount of data, not just ten items.

Or if you prefer, replace 10 by 100000000.

If you need to scale that direction, it's time to install a database.
You missed the point that your suggestion gives radically different
behaviour to what the OP indicated he is using. He mutates the list in
place, you rebind the list's name. That has *very* different semantics.


The OP may not care about the difference, but if he does require the
first behaviour, then your solution doesn't help.

Correct.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top