General Numerical Python question

2

2mc

Generally speaking, if one had a list (from regular Python) and an
array (from Numerical Python) that contained the same number of
elements, would a While loop or a For loop process them at the same
speed? Or, would the array process faster?

I'm new to Python, so my question may expose my ignorance. I
appreciate anyone's effort to help me understand.

Thanks. It is much appreciated.

Matt
 
A

Alex Martelli

2mc said:
Generally speaking, if one had a list (from regular Python) and an
array (from Numerical Python) that contained the same number of
elements, would a While loop or a For loop process them at the same
speed? Or, would the array process faster?

I'm new to Python, so my question may expose my ignorance. I
appreciate anyone's effort to help me understand.

I don't know, I've never measured. Let's find out together.

The best way to answer these performance questions, which may
easily vary a little depending on your platform and exact versions
involved, is to _measure_. Python 2.3's standard library comes with
timeit.py, a little script that's made just for that. I've copied it to my
~/bin/ directory and done a chmod +x (it starts with a shebang line
so that's sufficient), or in Windows you might set up a .bat or .cmd
file to call Python on it. Anyway, it's easy to use: you specify zero
or more -s 'blahblah' arguments to set things up, then the specific
statement you want to time. Watch...:

[alex@lancelot pop]$ timeit.py -s'import Numeric' -s'x=Numeric.arange(555)'
'for i in x: id(i)'
1000 loops, best of 3: 296 usec per loop
[alex@lancelot pop]$ timeit.py -s'import Numeric' -s'x=range(555)' 'for i in
x: id(i)'
1000 loops, best of 3: 212 usec per loop
[alex@lancelot pop]$ timeit.py -s'import Numeric' -s'x=Numeric.arange(555)'
'for i in x: id(i)'
1000 loops, best of 3: 296 usec per loop
[alex@lancelot pop]$ timeit.py -s'import Numeric' -s'x=range(555)' 'for i in
x: id(i)'
1000 loops, best of 3: 207 usec per loop
[alex@lancelot pop]$

So, on this specific case, looping over a list of ints is a bit faster than
looping over an otherwise equivalent Numeric.array -- about 210
microseconds versus about 300.

Similarly:

[alex@lancelot pop]$ timeit.py -s'import Numeric' -s'x=range(555)' 'for i in
range(len(x)): x=id(x)'
1000 loops, best of 3: 353 usec per loop
[alex@lancelot pop]$ timeit.py -s'import Numeric' -s'x=range(555)' 'for i in
range(len(x)): x=id(x)'
1000 loops, best of 3: 356 usec per loop
[alex@lancelot pop]$ timeit.py -s'import Numeric' -s'x=Numeric.arange(555)'
'for i in range(len(x)): x=id(x)'
1000 loops, best of 3: 581 usec per loop
[alex@lancelot pop]$ timeit.py -s'import Numeric' -s'x=Numeric.arange(555)'
'for i in range(len(x)): x=id(x)'
1000 loops, best of 3: 585 usec per loop

Here we're accessing AND also modifying each element by index, and the
list outperforms the array about 350 microseconds to 580.

So, measure operations of your interest, on platforms of your interest,
for roughly the kinds of list/array sizes you'll be using, and you'll KNOW
what performance issues you may be facing, rather than guessing.

In most cases you'll conclude that the difference is not important enough --
a factor of 1.5 or more may seem large, but here we're just doing a trivial
operation on each item -- if we were doing more the looping overhead
would matter less. AND some operations are available as ufuncs in
Numeric, cutting down loop overhead dramatically. And in the end a
100 to 200 microseconds' difference may just not matter much, depending
on your application. But anyway, you do get the ability to measure just
what you need to.


Alex
 
M

Michael Ressler

I don't know, I've never measured. Let's find out together.

The real question is - why do you want to run a loop over an array?
The whole point of Numeric is that you want to eliminate loops
entirely. Keeping things in the array domain is infinitely faster than
running explicit loops (this is the whole point of Numeric). You may
need to come up with some clever expressions to do it, but most loops
can be gotten rid of with clever uses of put(), take(), and the like.

Loops are evil.

Mike
 
M

Michael Ressler

I don't know, I've never measured. Let's find out together.

The real question is - why do you want to run a loop over an array?
The whole point of Numeric is that you want to eliminate loops
entirely. Keeping things in the array domain is infinitely faster than
running explicit loops. You may need to come up with some clever
expressions to do it, but most loops can be gotten rid of with clever
uses of put(), take(), and the like.

Loops are evil.

Mike
 
2

2mc

Michael Ressler said:
The real question is - why do you want to run a loop over an array?
The whole point of Numeric is that you want to eliminate loops
entirely. Keeping things in the array domain is infinitely faster than
running explicit loops. You may need to come up with some clever
expressions to do it, but most loops can be gotten rid of with clever
uses of put(), take(), and the like.

Loops are evil.

Mike

For me, the key thought in your post is " you may need to come up with
some clever expressions to do it, but most loops can be gotten rid of
with clever uses of put(), take(), and the like."

This is what I'm looking for. I'm so used to explicitly declaring
loops that it is hard for me to "speak" in Numerical Python.

Suppose I have 2 very large arrays of serial data. To make it simple,
let's assume each has 10s of thousands of rows with one column/field
of data. Further assume at some point later in the program I am going
to compare the data in the two arrays - the comparison being on chunks
of 25 rows throughout the array. But, before I do that, I have to
"normalize" the data in both arrays in order to make the comparisons
valid.

Assume the way I make the comparisons is to find the size of the range
between the highest and the lowest value in each 25 row 'chunk' and
normalize each data point as: (datapoint - lowestvalue) /
(highestvalue - lowestvalue) * 100.

Then assume I want to find the slope of the linear regression through
each 25 row 'chunk.' It is this slope that I will ultimately be
comparing later in the program.

This is the kind of programming I was hoping I could use Numerical
Python for. It is the syntax of such a program that I'm grappling
with. If someone could help me with the above scenario, then I could
write the program using the real comparisons I want (which are
considerably more complicated than above).

Thank you for your kind response. If you have any comments on the
above I would appreciate hearing them.

Matt
 
T

Tim Hochberg

2mc said:
For me, the key thought in your post is " you may need to come up with
some clever expressions to do it, but most loops can be gotten rid of
with clever uses of put(), take(), and the like."

This is what I'm looking for. I'm so used to explicitly declaring
loops that it is hard for me to "speak" in Numerical Python.

Suppose I have 2 very large arrays of serial data. To make it simple,
let's assume each has 10s of thousands of rows with one column/field
of data. Further assume at some point later in the program I am going
to compare the data in the two arrays - the comparison being on chunks
of 25 rows throughout the array. But, before I do that, I have to
"normalize" the data in both arrays in order to make the comparisons
valid.

Assume the way I make the comparisons is to find the size of the range
between the highest and the lowest value in each 25 row 'chunk' and
normalize each data point as: (datapoint - lowestvalue) /
(highestvalue - lowestvalue) * 100.

Something like:

import Numeric as np # Personal preference
chunked = np.reshape(data, (-1, 25)) # Chunked is a n/25 x 25 array
chunk_max = np.maximum.reduce(chunked, 1) # Reduce along axis 1
chunk_min = np.minimum.reduce(chunked, 1)
# NewAxis changes the relevant arrays from shape [n/25] to shape
# [n/25,1]. 1 will get broadcast.
normalized = ((chunked - chunk_min[:,np.NewAxis]) /
(chunk_max - chunk_min)[:,np.NewAxis]* 100)
Then assume I want to find the slope of the linear regression through
each 25 row 'chunk.' It is this slope that I will ultimately be
comparing later in the program.

For this you might want to use the LinearAlgebra module (it comes with
Numeric). I'm not as familiar with the interface for this though, so you
'll have to check the docs or hope someone else can help you.


Hope that gets you started.

-tim
 
2

2mc

Tim Hochberg said:
Suppose I have 2 very large arrays of serial data. To make it simple,
let's assume each has 10s of thousands of rows with one column/field
of data. Further assume at some point later in the program I am going
to compare the data in the two arrays - the comparison being on chunks
of 25 rows throughout the array. But, before I do that, I have to
"normalize" the data in both arrays in order to make the comparisons
valid.

Assume the way I make the comparisons is to find the size of the range
between the highest and the lowest value in each 25 row 'chunk' and
normalize each data point as: (datapoint - lowestvalue) /
(highestvalue - lowestvalue) * 100.

Something like:

import Numeric as np # Personal preference
chunked = np.reshape(data, (-1, 25)) # Chunked is a n/25 x 25 array
chunk_max = np.maximum.reduce(chunked, 1) # Reduce along axis 1
chunk_min = np.minimum.reduce(chunked, 1)
# NewAxis changes the relevant arrays from shape [n/25] to shape
# [n/25,1]. 1 will get broadcast.
normalized = ((chunked - chunk_min[:,np.NewAxis]) /
(chunk_max - chunk_min)[:,np.NewAxis]* 100)
Then assume I want to find the slope of the linear regression through
each 25 row 'chunk.' It is this slope that I will ultimately be
comparing later in the program.

For this you might want to use the LinearAlgebra module (it comes with
Numeric). I'm not as familiar with the interface for this though, so you
'll have to check the docs or hope someone else can help you.


Hope that gets you started.

-tim

Thanks a million. I appreciate your kind and thoughtful response. I
have found members of this board to be very prompt with help and very
courteous. I hope that when I'm a little more savvy with the language
that I may return the favor by posting help for someone else.

May I ask you for a little more help. The example you gave was very
good and it was something I hadn't thought ot. However, I need the 25
row "window" to move through the entire array one row at a time. In
other words each 25 row 'chunk' of data will contain 24 rows of the
previous 'chunk'. Unless I misunderstood your code, each 'chunk' has
a unique set of rows - there is no overlapping.

Do you have any ideas how I could do this without loops?

Matt
 
M

Michael Ressler

May I ask you for a little more help. The example you gave was very
good and it was something I hadn't thought ot. However, I need the 25
row "window" to move through the entire array one row at a time. In
other words each 25 row 'chunk' of data will contain 24 rows of the
previous 'chunk'. Unless I misunderstood your code, each 'chunk' has
a unique set of rows - there is no overlapping.

Do you have any ideas how I could do this without loops?

Okay, maybe you can't get rid of all loops as I implied in my previous
"loops are evil" post, but the trick is to minimize them.

This is "pseudo-code", so don't try to run it, but see if the ideas
are useful. One way to approach "running" things, is extensive use of
subarrays. Suppose you want to do a running average on an n-element
run of a large array a with m elements (let's ignore the endpoints for
now). This isn't the best way to do a running average, but it might
help with things more complex than an average.

m=len(a)
avg=zeros(len(a))
for i in range(n) : # window size to smooth over
avg=avg+a[i:m-n+i]
avg=avg/n

I think I might have screwed up the syntax of the subarray statement
(I'm still much better with the commercial language IDL than I am with
Numeric, and I get them confused, but the thought process is the
same). The idea is to pile up subarrays which have been shifted by
one. Instead of a zillion loops through the array, you only have to
deal with n (in your case 25) cycles.

This is what I meant by "clever expressions" in my first response.
Hope this stimulates more ideas.

Mike
 
T

Tim Hochberg

Michael said:
May I ask you for a little more help. The example you gave was very
good and it was something I hadn't thought ot. However, I need the 25
row "window" to move through the entire array one row at a time. In
other words each 25 row 'chunk' of data will contain 24 rows of the
previous 'chunk'. Unless I misunderstood your code, each 'chunk' has
a unique set of rows - there is no overlapping.

Do you have any ideas how I could do this without loops?


Okay, maybe you can't get rid of all loops as I implied in my previous
"loops are evil" post, but the trick is to minimize them.

This is "pseudo-code", so don't try to run it, but see if the ideas
are useful. One way to approach "running" things, is extensive use of
subarrays. Suppose you want to do a running average on an n-element
run of a large array a with m elements (let's ignore the endpoints for
now). This isn't the best way to do a running average, but it might
help with things more complex than an average.

m=len(a)
avg=zeros(len(a))
for i in range(n) : # window size to smooth over
avg=avg+a[i:m-n+i]
avg=avg/n

I agree, you probably can't get rid of all the loops in this case, and
if you could, the resulting code would probably be horrible. I have a
couple of minor quibles with the above code though. I think I'd write it as:

lenavg = len(a) - n + 1
avg =np.zeros(lenavg, np.Float)
for i in range(n) : # window size to smooth over
avg += a[i:lenavg+i] # Using += reuses the same array every time
# Instead of creating a new one each time
# Through the loop.
avg /= n # Same here.

The important point being the use of += and /=. And, in order to make
that work, you need to set the type of avg appropriately, not let it
default to int.

-tim
 
M

Michael Ressler

How I defend myself :)
I agree, you probably can't get rid of all the loops in this case, and
if you could, the resulting code would probably be horrible. I have a
couple of minor quibles with the above code though. I think I'd write it as:

lenavg = len(a) - n + 1
avg =np.zeros(lenavg, np.Float)
for i in range(n) : # window size to smooth over
avg += a[i:lenavg+i] # Using += reuses the same array every time
# Instead of creating a new one each time
# Through the loop.
avg /= n # Same here.

The important point being the use of += and /=. And, in order to make
that work, you need to set the type of avg appropriately, not let it
default to int.

You are right, of course, but I was trying to be clear, not fast. The
point is that you could replace that sum with the sum of the squares
and crossproducts, and thus get a running standard deviation, for
example. The trick of Numeric is thinking about problems in new ways
which are far different from what the original author is used to with
loops.

Another example of thinking things differently is suppose you have a
vector where the values are randomly positive or negative. Suppose for
reasons known only to you, you want to replace the negative values
with the sqrt of their absolute values. With Numeric, no loops are
involved.

from Numeric import *
a=array([1.,2.,-3.,4.,-5.,6.,-7.,-8.,9.]) # make up an array
idx=nonzero(a<0) # indexes of the negative values
sqrs=sqrt(abs(take(a,idx))) # get the sqrts of neg elements
put(a,idx,sqrs) # put them back into a
print a # works!

You can make the whole thing a one-liner if you want to get carried
away with it. It's too bad "nonzero" isn't called "whereis" or
something like that - it would make the idx= line more obvious.

Mike
 
J

John J. Lee

Michael Ressler said:
You are right, of course, but I was trying to be clear, not fast. The
point is that you could replace that sum with the sum of the squares
and crossproducts, and thus get a running standard deviation, for
example. The trick of Numeric is thinking about problems in new ways
which are far different from what the original author is used to with
loops.
[...]

Perhaps part of the trick is to know when to leave Numeric behind and
use Pyrex.


John
 
W

Western Larch

[...]
from Numeric import *
a=array([1.,2.,-3.,4.,-5.,6.,-7.,-8.,9.]) # make up an array
idx=nonzero(a<0) # indexes of the negative values
sqrs=sqrt(abs(take(a,idx))) # get the sqrts of neg elements
put(a,idx,sqrs) # put them back into a
print a # works!

You can make the whole thing a one-liner if you want to get carried
away with it. It's too bad "nonzero" isn't called "whereis" or
something like that - it would make the idx= line more obvious.

In the Octave programming language (Matlab-like), the operation
is called "find" -- find(a<0) returns the indices of the negative
elements; a(find(a<0)) returns the negative elements themselves.

With all due respect, Octave rocks pretty hard, and although I
love Python I've only fiddled around with Numeric Python a little
bit -- I haven't found any reason to prefer it to Octave.

FWIW
L.O.
 
S

Scott Ransom

[...]
from Numeric import *
a=array([1.,2.,-3.,4.,-5.,6.,-7.,-8.,9.]) # make up an array
idx=nonzero(a<0) # indexes of the negative values
sqrs=sqrt(abs(take(a,idx))) # get the sqrts of neg elements
put(a,idx,sqrs) # put them back into a
print a # works!

You can make the whole thing a one-liner if you want to get carried
away with it. It's too bad "nonzero" isn't called "whereis" or
something like that - it would make the idx= line more obvious.

It's actually quite easy (and much more readable IMHO) as a one liner:
from Numeric import *
from umath import fabs, sqrt
a = array([1.,2.,-3.,4.,-5.,6.,-7.,-8.,9.])
a = where(a<0, sqrt(fabs(a)), a)

That's it!
In the Octave programming language (Matlab-like), the operation
is called "find" -- find(a<0) returns the indices of the negative
elements; a(find(a<0)) returns the negative elements themselves.

A similar operation is possible in Numeric:

compress(a<0, a) returns the elements of a that have a<0.
With all due respect, Octave rocks pretty hard, and although I
love Python I've only fiddled around with Numeric Python a little
bit -- I haven't found any reason to prefer it to Octave.

Numeric is wonderful as well. And one of the biggest benefits is that
you get all of the power of Python (and its _huge_ standard and
contributed libraries) as a bonus.

Scott
 
F

Fernando Perez

John said:
Perhaps part of the trick is to know when to leave Numeric behind and
use Pyrex.

Do you have a successful example of pyrex manipulating data which is in a
Numeric array? Last time I tried (a while back), the performance was
catastrophic (meaning, no better than python itself for explicit loops).
These days I either use weave.inline() or hand-written extensions. In both
cases I use Blitz++ for the arrays, so the C++ code retains much of the flavor
of the originaly python.

But I'd really like to know if pyrex has caught up to handling Numeric arrays
efficiently (including complex ones).

Thanks in advance,

f
 
2

2mc

Michael Ressler said:
Another example of thinking things differently is suppose you have a
vector where the values are randomly positive or negative. Suppose for
reasons known only to you, you want to replace the negative values
with the sqrt of their absolute values. With Numeric, no loops are
involved.

from Numeric import *
a=array([1.,2.,-3.,4.,-5.,6.,-7.,-8.,9.]) # make up an array
idx=nonzero(a<0) # indexes of the negative values
sqrs=sqrt(abs(take(a,idx))) # get the sqrts of neg elements
put(a,idx,sqrs) # put them back into a
print a # works!

You can make the whole thing a one-liner if you want to get carried
away with it. It's too bad "nonzero" isn't called "whereis" or
something like that - it would make the idx= line more obvious.

Mike

I think I'm finally getting a handle on this. So, my thanks to
everyone who has so graciously helped me out with their suggestions.

How would you handle the above if "a" were a 2d array since "nonzero"
only works on 1d arrays? Could you have used the "nonzero" function
on a "vertical" slice of the array (from the perspective of an array
of rows and columns - a vertical slice being the data in the column)?

I mostly deal with 2d arrays, but I have a few 3d arrays. So, I'm
curious how you would handle your example above with a multdimensional
array.

Thanks. And, thanks again to all.

Matt
 
M

Michael Ressler

How would you handle the above if "a" were a 2d array since "nonzero"
only works on 1d arrays? Could you have used the "nonzero" function
on a "vertical" slice of the array (from the perspective of an array
of rows and columns - a vertical slice being the data in the column)?

I don't have a lot of experience with this yet, but every array has a
attribute called flat (e.g. a.flat) which is a 1-D representation of
the array. So if a is a 2-D (or 3-D) array, you could do something
like:

idx=nonzero(a.flat)
put(a.flat, values, idx)
print a

where a now has the appropriate values placed in their proper 2-D
positions.

As a side note, the numarray package (intended to be a Numeric
replacement) will provide better syntax for dealing with put and take,
maybe even handle 2-D issues like this transparently, but it's not
quite ready for prime time yet.

Mike
 
M

Mark Jackson

Michael Ressler said:
Another example of thinking things differently is suppose you have a
vector where the values are randomly positive or negative. Suppose for
reasons known only to you, you want to replace the negative values
with the sqrt of their absolute values. With Numeric, no loops are
involved.

from Numeric import *
a=array([1.,2.,-3.,4.,-5.,6.,-7.,-8.,9.]) # make up an array
idx=nonzero(a<0) # indexes of the negative values
sqrs=sqrt(abs(take(a,idx))) # get the sqrts of neg elements
put(a,idx,sqrs) # put them back into a
print a # works!

You can make the whole thing a one-liner if you want to get carried
away with it. It's too bad "nonzero" isn't called "whereis" or
something like that - it would make the idx= line more obvious.

Mike

I think I'm finally getting a handle on this. So, my thanks to
everyone who has so graciously helped me out with their suggestions.

How would you handle the above if "a" were a 2d array since "nonzero"
only works on 1d arrays? Could you have used the "nonzero" function
on a "vertical" slice of the array (from the perspective of an array
of rows and columns - a vertical slice being the data in the column)?

I'm very new at this myself (currently porting some Fortran code to
Numeric) but I believe that Numeric.putmask is your friend here:
a=Numeric.array([i*(-1)**i for i in range(20)],Numeric.Float)
b=a.resize((4,5))
b
array([[ 0., -1., 2., -3., 4.],
[ -5., 6., -7., 8., -9.],
[ 10., -11., 12., -13., 14.],
[-15., 16., -17., 18., -19.]])array([[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1]])array([[ 0. , 1. , 2. , 1.73205081, 4. ],
[ 2.23606798, 6. , 2.64575131, 8. , 3. ],
[ 10. , 3.31662479, 12. , 3.60555128, 14. ],
[ 3.87298335, 16. , 4.12310563, 18. , 4.35889894]])
 
S

Scott Ransom

Michael Ressler said:
Another example of thinking things differently is suppose you have a
vector where the values are randomly positive or negative. Suppose for
reasons known only to you, you want to replace the negative values
with the sqrt of their absolute values. With Numeric, no loops are
involved.

from Numeric import *
a=array([1.,2.,-3.,4.,-5.,6.,-7.,-8.,9.]) # make up an array
idx=nonzero(a<0) # indexes of the negative values
sqrs=sqrt(abs(take(a,idx))) # get the sqrts of neg elements
put(a,idx,sqrs) # put them back into a
print a # works!

You can make the whole thing a one-liner if you want to get carried
away with it. It's too bad "nonzero" isn't called "whereis" or
something like that - it would make the idx= line more obvious.

Mike

I think I'm finally getting a handle on this. So, my thanks to
everyone who has so graciously helped me out with their suggestions.

How would you handle the above if "a" were a 2d array since "nonzero"
only works on 1d arrays? Could you have used the "nonzero" function
on a "vertical" slice of the array (from the perspective of an array
of rows and columns - a vertical slice being the data in the column)?

I'm very new at this myself (currently porting some Fortran code to
Numeric) but I believe that Numeric.putmask is your friend here:
a=Numeric.array([i*(-1)**i for i in range(20)],Numeric.Float)
b=a.resize((4,5))
b
array([[ 0., -1., 2., -3., 4.],
[ -5., 6., -7., 8., -9.],
[ 10., -11., 12., -13., 14.],
[-15., 16., -17., 18., -19.]])array([[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1]])array([[ 0. , 1. , 2. , 1.73205081, 4. ],
[ 2.23606798, 6. , 2.64575131, 8. , 3. ],
[ 10. , 3.31662479, 12. , 3.60555128, 14. ],
[ 3.87298335, 16. , 4.12310563, 18. , 4.35889894]])

Once again, this can be done in a single (easy-to-read) line using:

b = where(b<0, sqrt(fabs(b)), b)

where does all the masking and putmasking for you.

Scott
 
2

2mc

To all who have helped,

I am finding out all kinds of ways to do things. It's exciting.
Thanks to all who have replied.

I'm still having trouble with one thing. Let me set a scenario and
see if anyone has any ideas.

Assume a multidimensional array (2d). This would be like a
spreadsheet of rows and columns. Further, assume many 'rows' and 3
columns. Suppose I want a running list of the highest value for 20
'rows'. So, starting at 'row' 19, the answer would be the highest
value from 'row' 0 to 'row' 19. Then, at 'row' 20, the answer would
be the highest value from 'row' 1 to 'row' 20. And, so on. Further,
suppose I want this value for each 'column'. The result would be a 3
'column' array with 19 less rows than the source array containing the
running list of highest values in the last 20.

How would this be done without loops?

Thanks a million.

Matt
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,169
Messages
2,570,915
Members
47,456
Latest member
JavierWalp

Latest Threads

Top