Negative array indicies and slice()

A

andrewr3mail

On Mon, Oct 29, 2012 at 1:24 AM,

You never wrapped `a` in a RangedSlicer or otherwise made use of RangedSlicer!

You wanted:

a = RangedSlicer([1,2,3,4,5])


[2, 3, 4, 5]
Very odd... I would have expected [1,2,3,4]



"[2, 3, 4, 5]" is the return value from `a.__getitem__( slice(1,5) )`

(or, equivalently, from `[1,2,3,4,5][1:5]`). It is not the result of

"print item"; that line of code is never executed since you never used

the RangedSlicer class at all.



Regards,

Chris

My apology --- I deleted that post; yet it didn't delete... I saw my mistake seconds after posting.

***** gmail.

Note: I subscribed to the python-list, and am able to recieve e-mails, but I don't see how to write a post for this particular thread nor subscribe to this particular thread...

A brief suggestion, or link to a howto would be *much* appreciated.
 
M

Mark Lawrence

Note: I subscribed to the python-list, and am able to recieve e-mails, but I don't see how to write a post for this particular thread nor subscribe to this particular thread...

A brief suggestion, or link to a howto would be *much* appreciated.

Get yourself a decent email client. I read all the Python lists that
I'm interested in using Thunderbird on Windows via gmane.
 
S

Steven D'Aprano

Note: I subscribed to the python-list, and am able to recieve e-mails,
but I don't see how to write a post for this particular thread nor
subscribe to this particular thread...

The beauty of email is that you don't have to subscribe to a thread. Once
you subscribe to the mailing list, email is delivered into your inbox. To
reply to it, just reply to it. To ignore it, throw it in the trash.

Gmail should have a button or three that say "Reply to email" or similar.
You want the button that says "Reply to All" or "Reply to List". Make
sure that the reply includes "(e-mail address removed)" as a recipient.

Delete bits of the quoted email (the lines that start with > characters)
that are no longer relevant to the conversation. Type your reply. Double
check that the reply is going to python-list. Then hit Send.

(P.S. when you signed up for (e-mail address removed), if you selected the
option to receive a single daily digest instead of individual emails,
you're going to have a bad time. Do yourself a favour -- and the rest of
us -- and change back to individual emails.)
 
S

Steven D'Aprano

Actually, I said in the OP:

"I also don't understand why slice() is not equivalent to an iterator,
but can replace an integer in __getitem__() whereas xrange() can't."

Slices and iterators have different purposes and therefore have not been
made interchangeable. Yes, there are certain similarities between a slice
and xrange, but there are also significant differences.

Thank you for the code snippet; I don't think it likely that existing
programs depend on nor use a negative index and a positive index
expecting to take a small chunk in the center...

On the contrary. That is the most straightforward and useful idea of
slicing, to grab a contiguous slice of items.

Why would you want to grab a slice from the end of the list, and a slice
from the start of the list, and swap them around? Apart from simulating
card shuffles and cuts, who does that?

hence, I would return
the whole array; Or if someone said [-len(listX) : len(listX)+1 ] I
would return the whole array twice.

Well, that's one possible interpretation, but it is not the one that
Python uses. When you create your own language, you can choose whatever
interpretation seems most useful to you too.


That's the maximum that is possible.
If someone could show me a normal/reasonable script which *would* expect
the other behavior, I'd like to know; compatibility is important.

I'm not entirely sure I understand what you are asking here.

My intended inferences about the iterator vs. slice question was perhaps
not obvious to you; Notice: an iterator is not *allowed* in
__getitem__().

Actually, you can write __getitem__ for your own classes to accept
anything you like.

py> class Test:
.... def __getitem__(self, index):
.... return index
....
py> t = Test()
py> t["Hello world"]
'Hello world'
py> t[{'x': None}]
{'x': None}

The slice class when passed to __getitem__() was created to merely pass
two numbers and a stride to __getitem__; As far as I know slice()
itself does *nothing* in the actual processing of the elements. So,
it's *redundant* functionality, and far worse, it's restrictive.

You say that as if it were a bad thing.

The philosophy of Python is to have exactly one way to do something when
possible; so, why create a stand alone class that does nothing an
existing class could already do, and do it better ?

What existing class is that? It certainly isn't xrange.

Because xrange represents a concrete sequence of numbers, all three of
start, end and stride must be concrete, known, integers:

py> xrange(4, None, 2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required

Whereas slices can trivially include blanks that get filled in only when
actually used:

py> "hello world"[aslice]
'owrd'
py> "NOBODY expects the Spanish Inquisition!"[aslice]
'D xet h pns nusto!'


So, no, xrange is no substitute for slices. Not even close.

A simple list of three values would be just as efficient as slice()!

On the contrary, a simple list of three values not only could not do
everything a slice does, but it's over twice the size!

py> sys.getsizeof([1, 2, 3])
44
py> sys.getsizeof(slice(1, 2, 3))
20

xrange is more flexible, and can be just as efficient.

Less flexible, less efficient.


[snip]
In 'C', where Python is written,

That's a popular misapprehension. Python is written in Java, or Lisp, or
Haskell, or CLR (dot Net), or RPython, or Ocaml, or Parrot. Each of those
languages have, or had, at least one Python implementation. Oh, there's
also a version written in C, or so I have heard.
 
C

Chris Angelico

That's a popular misapprehension. Python is written in Java, or Lisp, or
Haskell, or CLR (dot Net), or RPython, or Ocaml, or Parrot. Each of those
languages have, or had, at least one Python implementation. Oh, there's
also a version written in C, or so I have heard.

And that's not including the human-brain implementation, perhaps the
most important of all. Although the current port of Python to my brain
isn't quite a complete implementation, lacking a few bits that I
should probably get to at some point, but even so, it's as useful to
me as firing up IDLE.

I wonder if what the OP is looking for is not slicing, but something
more akin to map. Start with a large object and an iterator that
produces keys, and create an iterator/list of their corresponding
values. Something like:

a=[1,2,3,4,5,6,7,8,9,10]
b=[a for i in xrange(-4,3)]

It's not strictly a slice operation, but it's a similar sort of thing,
and it can do the wraparound quite happily.

ChrisA
 
C

Chris Angelico

I am curious as to how quickly it constructs the result compared to a slice
operation.

Eg:
a[1:5]
vs.
[ a for i in xrange[1:5] ]


For the most part, don't concern yourself with performance. Go with
functionality and readability. In the trivial case shown here, the
slice is WAY clearer, so it should definitely be the one used; in
other cases, the slice might simply be insufficient, so you go with
whatever achieves your goal. Performance will usually be "good
enough", even if there's a marginally faster way.

ChrisA
 
R

Roy Smith

Andrew Robinson said:
Show me an example where someone would write a slice with a negative and
a positive index (both in the same slice);
and have that slice grab a contiguous slice in the *middle* of the list
with orientation of lower index to greater index.

It's possible in bioinformatics. Many organisms have circular
chromosomes. It's a single DNA molecule spliced into a loop. There's
an "origin", but it's more a convenience thing for people to assign some
particular base-pair to be location 0. From the organism's point of
view, the origin isn't anything special (but there *is* a fixed
orientation).

It's entirely plausible for somebody to want to extract the sub-sequence
from 100 bp (base-pairs) before the origin to 100 bp after the origin.
If you were storing the sequence in Python string (or list), the most
convenient way to express this would be seq[-100:100]. Likewise, if you
wanted the *other* fragment, you would write seq[100:-100].

There is a minor confounding factor here in that biologists number
sequences starting with 1, not 0. At least that was the way when I was
doing this stuff mumble years ago. I don't know what the current
convention is.
 
C

Chris Angelico

Looking at some of the online programming notes -- a slice apparently
doesn't use an integer storage variable that is capable of arbitrary
expansion. =-O -- and hence, won't work for very large sized lists. That
actually explains some crashes I have noted in the past when working with 20
million element lists that I wanted a slice of. I had *plenty* of ram on
that system.

Can you provide links to these notes? I'm looking at
cpython/Include/sliceobject.h that has this comment:

/*

A slice object containing start, stop, and step data members (the
names are from range). After much talk with Guido, it was decided to
let these be any arbitrary python type. Py_None stands for omitted values.
*/

Also, the code for slice objects in CPython works with Py_ssize_t (a
signed quantity of the same length as size_t), which will allow at
least 2**31 for an index. I would guess that your crashes were nothing
to do with 20 million elements and slices.

ChrisA
 
A

Andrew Robinson

Show me an example where someone would write a slice with a negative and
a positive index (both in the same slice);
and have that slice grab a contiguous slice in the *middle* of the list
with orientation of lower index to greater index.
It's possible in bioinformatics. ...
eq[100:-100].
I decided to go to bed... I was starting to write very badly worded
responses. :)

Thanks, Roy, what you have just shown is another example that agrees
with what I am trying to do.
FYI: I was asking for a reason why Python's present implementation is
desirable...

I wonder, for example:

Given an arbitrary list:
a=[1,2,3,4,5,6,7,8,9,10,11,12]

Why would someone *want* to do:
a[-7,10]
Instead of saying
a[5:10] or a[-7:-2] ?

eg:
What algorithm would naturally *desire* the default behavior of slicing
when using *mixed* negative and positive indexes?
In the case of a bacterial circular DNA/RNA ring, asking for codons[
-10: 10 ] would logically desire codon[-10:] + codon[:10] not an empty
list, right?

I think your example is a very reasonable thing the scientific community
would want to do with Python.
:)
 
A

Andrew Robinson

If that's the case, then running Python at all is probably a mistake.
You know the interpreter alone has an overhead of nearly 6 MB?
There's already a version of the python interpreter which fits in under
100K:
http://code.google.com/p/python-on-a-chip/
It's not the 3.x series, though; and I don't want to redo this once 2.7
really does become obsolete.
You can just overload that one method in a subclass of list. Being
able to monkey-patch __getitem__ for the list class itself would not
be advisable, as it would affect all list slicing anywhere in your
program and possibly lead to some unexpected behaviors.
That's what I am curious about.
What unexpected behaviors would a "monkey patch" typically cause?
If no one really uses negative and positive indexes in the same slice
operation, because there is no reason to do so...
It will only break the occasional esoteric application.
20 million is nothing. On a 32-bit system, sys.maxsize == 2 ** 31 -
1. If the error you were seeing was MemoryError, then more likely you
were running into dynamic allocation issues due to fragmentation of
virtual memory.

No, there was no error at all. Pthon just crashed & exited; not even an
exception that I can recall. It was if it exited normally!

The list was generated in a single pass by many .append() 's, and then
copied once -- the original was left in place; and then I attempted to
slice it.

I am able to routinely to 5 million length lists, copy, slice, cut,
append, and delete from them without this ever happening.
If fragmentation were the issue, I'd think the shorter lists would cause
the problem after many manipulations...

It may not be a bug in python itself, though, of course. There are
libraries it uses which might have a bug.
 
I

Ian Kelly

I will be porting Python 3.xx to a super low power embedded processor (MSP430), both space and speed are at a premium.
Running Python on top of Java would be a *SERIOUS* mistake. .NET won't even run on this system. etc.

If that's the case, then running Python at all is probably a mistake.
You know the interpreter alone has an overhead of nearly 6 MB?
Yes, I realize that.
But, why can't I just overload the existing __getitem__ for lists and notbother writing an entire class?

You can just overload that one method in a subclass of list. Being
able to monkey-patch __getitem__ for the list class itself would not
be advisable, as it would affect all list slicing anywhere in your
program and possibly lead to some unexpected behaviors.
Hmmm..
Let's try your example exactly as shown...

"hello world"[aslice]

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'aslice' is not defined

WOW. Cool.
Where did the blanks *actually* get filled in? Or HOW WILL they in your next post?

It appears that Steven omitted the definition of aslice by mistake.
It looks like it should have been:

aslice = slice(4, None, 2)
Looking at some of the online programming notes -- a slice apparently doesn't use an integer storage variable that is capable of arbitrary expansion. =-O -- and hence, won't work for very large sized lists. That actually explains some crashes I have noted in the past when working with 20 million element lists that I wanted a slice of. I had *plenty* of ram on that system.

20 million is nothing. On a 32-bit system, sys.maxsize == 2 ** 31 -
1. If the error you were seeing was MemoryError, then more likely you
were running into dynamic allocation issues due to fragmentation of
virtual memory.
 
I

Ian Kelly

My intended inferences about the iterator vs. slice question was perhaps not obvious to you; Notice: an iterator is not *allowed* in __getitem__().

Yes, I misconstrued your question. I thought you wanted to change the
behavior of slicing to wrap around the end when start > stop instead
of returning an empty sequence. What you actually want is a new
sequence built from indexes supplied by an iterable. Chris has
already given you a list comprehension solution to solve that. You
could also use map for this:

new_seq = list(map(old_seq.__getitem__, iterable))

Since you seem to be concerned about performance, I'm not sure in this
case whether the map or the list comprehension will be faster. I'll
leave you to test that on your intended hardware.
In 'C', where Python is written, circularly linked lists -- and arrays are both very efficient ways of accessing data. Arrays can, in fact, have negative indexes -- perhaps contrary to what you thought. One merely definesa variable to act as the base pointer to the array and initialize it to the *end* of the array. Nor is the size of the data elements an issue, since in Python all classes are accessed by pointers which are of uniform size. Iroutinely do this in C.

I'm aware of what is possible in C with pointer arithmetic. This is
Python, though, and Python by design has neither pointers nor pointer
arithmetic. In any case, initializing the pointer to the end of the
array would still not do what you want, since the positive indices
would then extend past the end of the array.
 
A

Andrew Robinson

Can you provide links to these notes? I'm looking at
cpython/Include/sliceobject.h that has this comment:

/*

A slice object containing start, stop, and step data members (the
names are from range). After much talk with Guido, it was decided to
let these be any arbitrary python type. Py_None stands for omitted values.
*/

Also, the code for slice objects in CPython works with Py_ssize_t (a
signed quantity of the same length as size_t), which will allow at
least 2**31 for an index. I would guess that your crashes were nothing
to do with 20 million elements and slices.

ChrisA
Let's look at the source code rather than the web notes -- the source
must be the true answer anyhow.

I downloaded the source code for python 3.3.0, as the tbz;
In the directory "Python-3.3.0/Python", look at Python-ast.c, line 2089
& ff.

Clearly a slice is malloced for a slice_ty type.
It has four elements: kind, lower, upper, and step.

So, tracing it back to the struct definition...

"Include/Python-ast.h" has "typedef struct _slice *slice_ty;"

And, here's the answer!:

enum _slice_kind {Slice_kind=1, ExtSlice_kind=2, Index_kind=3};
struct _slice {
enum _slice_kind kind;
union {
struct {
expr_ty lower;
expr_ty upper;
expr_ty step;
} Slice;

struct {
asdl_seq *dims;
} ExtSlice;

struct {
expr_ty value;
} Index;

} v;
};


So, slice() does indeed have arbitrary python types included in it;
contrary to what I read elsewhere.
expr_ty is a pointer to an arbitrary expression, so the actual structure
is 4 pointers, at 32 bits each = 16 bytes.
The size of the structure itself, given in an earlier post, is 20 bytes
-- which means one more pointer is involved, perhaps the one pointing to
the slice structure itself.

Hmm...!

An empty tuple gives sys.getsizeof( () ) = 24.

But, I would expect a tuple to be merely a list of object pointers;
hence I would expect 4 bytes for len(), and then a head pointer 4 bytes,
and then a pointer for each object.
3 objects gives 12 bytes, + 8 = 16 bytes.

Then we need one more pointer so Python knows where the struct is...
So a Tuple of 3 objects ought to fit nicely into 20 bytes; the same size
as slice() --

but it's 24, even when empty...
And 36 when initialized...
What are the extra 16 bytes for?

All I see is:
typedef struct { object** whatever } PyTupleObject;
 
A

Andrew Robinson

You say that as if writing "an entire class" was a big complicated
effort. It isn't. It is trivially simple, a single line:

class MyList(list):
...
No, I don't think it big and complicated. I do think it has timing
implications which are undesirable because of how *much* slices are used.
In an embedded target -- I have to optimize; and I will have to reject
certain parts of Python to make it fit and run fast enough to be useful.
What part of "unexpected" is unclear?
Ahh -- The I don't know approach! It's only unexpected if one is a bad
programmer...!
Let me see if I can illustrate a flavour of the sort of things that can
happen if monkey-patching built-ins were allowed.

You create a list and print it:

# simulated output
py> x = [5, 2, 4, 1]
py> print(x)
[1, 2, 4, 5]
<snip>

Finally you search deep into the libraries used in your code, and *five
days later* discover that your code uses library A which uses library B
which uses library C which uses library D which installs a harmless
monkey-patch to print, but only if library E is installed, and you just
happen to have E installed even though your code never uses it, AND that
monkey-patch clashes with a harmless monkey-patch to list.__getitem__
installed by library F. And even though each monkey-patch alone is
harmless, the combination breaks your code's output.

Right, which means that people developing the libraries made
contradictory assumptions.
Python allows, but does not encourage, monkey-patching of code written in
pure Python, because it sometimes can be useful. It flat out prohibits
monkey-patching of builtins, because it is just too dangerous.

Ruby allows monkey-patching of everything. And the result was predictable:

http://devblog.avdi.org/2008/02/23/why-monkeypatching-is-destroying-ruby/

I read that post carefully; and the author purposely notes that he is
exaggerating.
BUT Your point is still well taken.

What you are talking about is namespace preservation; and I am thinking
about it. I can preserve it -- but only if I disallow true Python
primitives in my own interpreter; I can't provide two sets in the memory
footprint I am using.

From my perspective, the version of Python that I compile will not be
supported by the normal python help; The predecessor which first forged
this path, Pymite, has the same problems -- however, the benefits
ought-weigh the disadvantages; and the experiment yielded useful
information on what is redundant in Python (eg: range is not supported)
and when that redundancy is important for some reason.

If someone had a clear explanation of the disadvantages of allowing an
iterator, or a tuple -- in place of a slice() -- I would have no qualms
dropping the subject. However, I am not finding that yet. I am finding
very small optimization issues...

The size of an object is at least 8 bytes. Hence, three numbers is
going to be at least 24 bytes; and that's 24 bytes in *excess* of the
size of slice() or tuple () which are merely containers. So -- There
*ARE* savings in memory when using slice(), but it isn't really 2x
memory -- its more like 20% -- once the actual objects are considered.

The actual *need* for a slice() object still hasn't been demonsrated. I
am thinking that the implementation of __getitem__() is very poor
probably because of legacy issues.

A tuple can also hold None, so ( 1, None, 2 ) is still a valid Tuple.
Alternately: An iterator, like xrange(), could be made which takes None
as a parameter, or a special value like 'inf'.
Since these two values would never be passed to xrange by already
developed code, allowing them would not break working code.

I am only aware of one possible reason that slice() was once thought to
be necessary; and that is because accessing the element of a tuple would
recursively call __getitem__ on the tuple. But, even that is easily
dismissed once the fixed integer indexes are considered.

Your thoughts? Do you have any show stopper insights?
 
S

Steven D'Aprano

I am curious as to how quickly it constructs the result compared to a
slice operation.

Eg:
a[1:5]
vs.
[ a for i in xrange[1:5] ]


For the most part, don't concern yourself with performance. Go with
functionality and readability. In the trivial case shown here, the slice
is WAY clearer, so it should definitely be the one used; in other cases,
the slice might simply be insufficient, so you go with whatever achieves
your goal. Performance will usually be "good enough", even if there's a
marginally faster way.



Slicing is about an order of magnitude faster:


[steve@ando ~]$ python2.7 -m timeit -s "x = range(100, 1000, 2)" "x
[20:40]"
1000000 loops, best of 3: 0.342 usec per loop
[steve@ando ~]$ python2.7 -m timeit -s "x = range(100, 1000, 2)" "[x
for i in xrange(20, 40)]"
100000 loops, best of 3: 3.43 usec per loop
 
S

Steven D'Aprano

Because xrange represents a concrete sequence of numbers, all three of
start, end and stride must be concrete, known, integers:

py> xrange(4, None, 2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required

Whereas slices can trivially include blanks that get filled in only when
actually used:

py> "hello world"[aslice]
'owrd'
py> "NOBODY expects the Spanish Inquisition!"[aslice]
'D xet h pns nusto!'

/me facepalms/


Argggh, I forgot to copy-and-paste the critical line defining aslice:

aslice = slice(4, None, 2)


Sorry about that.
 
A

Andrew Robinson

Every Python object requires two pieces of data, both of which are
pointer-sized (one is a pointer, one is an int the size of a pointer).
These are: a pointer to the object's type, and the object's reference
count. A tuple actually does not need a head pointer: the head pointer
is merely an offset from the tuple's pointer. It merely has a ref
count, type, an item count, and pointers to its contents. A slice has
the same type pointer and reference count, then three pointers to the
start, stop, and step objects. This means a slice object should be the
same size as a two-item tuple: the tuple needs a count, while that is
fixed at 3 for a slice (though some items may be unset). NOTE: The
above is taken from reading the source code for Python 2.6. For some
odd reason, I am getting that an empty tuple consists of 6
pointer-sized objects (48 bytes on x64), rather than the expected 3
pointer-sized (24 bytes on x64). Slices are showing up as the expected
5 pointer-sized (40 bytes on x64), and tuples grow at the expected 1
pointer (8 bytes on x64) per item. I imagine I am missing something,
but cannot figure out what that would be.
It's fairly straight forward in 3.2.0. I debugged the code with GDB and
watched.
Perhaps it is the same in 2.6 ?

In addition to those items you mention, of which the reference count is
not even *inside* the struct -- there is additional debugging
information not mentioned. Built in objects contain a "line number", a
"column number", and a "context" pointer. These each require a full
word of storage.

Also, built in types appear to have a "kind" field which indicates the
object "type" but is not a pointer. That suggests two "object" type
indicators, a generic pointer (probably pointing to "builtin"? somewhere
outside the struct) and a specific one (an enum) inside the "C" struct.

Inside the tuple struct, I count 4 undocumented words of information.
Over all, there is a length, the list of pointers, a "kind", "line",
"col" and "context"; making 6 pieces in total.

Although your comment says the head pointer is not required; I found in
3.3.0 that it is a true head pointer; The Tuple() function on line 2069
of Python-ast.c, (3.3 version) -- is passed in a pointer called *elts.
That pointer is copied into the Tuple struct.

How ironic, slices don't have debugging info, that's the main reason
they are smaller.
When I do slice(3,0,2), suprisingly "Slice()" is NOT called.
But when I do a[1:2:3] it *IS* called.
 
C

Chris Angelico

No, there was no error at all. Pthon just crashed & exited; not even an
exception that I can recall. It was if it exited normally!

Can you create a reproducible test case? There's usually a cause to
these sorts of things.

ChrisA
 
I

Ian Kelly

FYI: I was asking for a reason why Python's present implementation is
desirable...

I wonder, for example:

Given an arbitrary list:
a=[1,2,3,4,5,6,7,8,9,10,11,12]

Why would someone *want* to do:
a[-7,10]
Instead of saying
a[5:10] or a[-7:-2] ?

A quick search of local code turns up examples like this:

if name.startswith('{') and name.endswith('}'):
name = name[1:-1]

If slices worked like ranges, then the result of that would be empty,
which is obviously not desirable.

I don't know of a reason why one might need to use a negative start
with a positive stop, though.
 
I

Ian Kelly

The list was generated in a single pass by many .append() 's, and then
copied once -- the original was left in place; and then I attempted to slice
it.

Note that if the list was generated by .appends, then it was copied
more than once. Python reserves a specific amount of space for the
list. When it grows past that, the list must be reallocated and
copied. It grows the list exponentially in order to keep the
amortized time complexity of append at O(1), but the point is that a
list of 20 million items is going to be resized and copied several
times before it is complete.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,145
Messages
2,570,824
Members
47,370
Latest member
desertedtyro29

Latest Threads

Top