Numpy Performance

T

timlash

Still fairly new to Python. I wrote a program that used a class
called RectangularArray as described here:

class RectangularArray:
def __init__(self, rows, cols, value=0):
self.arr = [None]*rows
self.row = [value]*cols
def __getitem__(self, (i, j)):
return (self.arr or self.row)[j]
def __setitem__(self, (i, j), value):
if self.arr==None: self.arr = self.row[:]
self.arr[j] = value

This class was found in a 14 year old post:
http://www.python.org/search/hypermail/python-recent/0106.html

This worked great and let me process a few hundred thousand data
points with relative ease. However, I soon wanted to start sorting
arbitrary portions of my arrays and to transpose others. I turned to
Numpy rather than reinventing the wheel with custom methods within the
serviceable RectangularArray class. However, once I refactored with
Numpy I was surprised to find that the execution time for my program
doubled! I expected a purpose built array module to be more efficient
rather than less.

I'm not doing any linear algebra with my data. I'm working with
rectangular datasets, evaluating individual rows, grouping, sorting
and summarizing various subsets of rows.

Is a Numpy implementation overkill for my data handling uses? Should
I evaluate prior array modules such as Numeric or Numarray? Are there
any other modules suited to handling tabular data? Would I be best
off expanding the RectangularArray class for the few data
transformation methods I need?

Any guidance or suggestions would be greatly appreciated!

Cheers,

Tim
 
P

Peter Otten

timlash said:
Still fairly new to Python. I wrote a program that used a class
called RectangularArray as described here:

class RectangularArray:
def __init__(self, rows, cols, value=0):
self.arr = [None]*rows
self.row = [value]*cols
def __getitem__(self, (i, j)):
return (self.arr or self.row)[j]
def __setitem__(self, (i, j), value):
if self.arr==None: self.arr = self.row[:]
self.arr[j] = value

This class was found in a 14 year old post:
http://www.python.org/search/hypermail/python-recent/0106.html

This worked great and let me process a few hundred thousand data
points with relative ease. However, I soon wanted to start sorting
arbitrary portions of my arrays and to transpose others. I turned to
Numpy rather than reinventing the wheel with custom methods within the
serviceable RectangularArray class. However, once I refactored with
Numpy I was surprised to find that the execution time for my program
doubled! I expected a purpose built array module to be more efficient
rather than less.

I'm not doing any linear algebra with my data. I'm working with
rectangular datasets, evaluating individual rows, grouping, sorting
and summarizing various subsets of rows.

Is a Numpy implementation overkill for my data handling uses? Should
I evaluate prior array modules such as Numeric or Numarray? Are there
any other modules suited to handling tabular data? Would I be best
off expanding the RectangularArray class for the few data
transformation methods I need?

Any guidance or suggestions would be greatly appreciated!


Do you have many rows with zeros? That might be the reason why your
self-made approach shows better performance.

Googling for "numpy sparse" finds:

http://www.scipy.org/SciPy_Tutorial

Maybe one of the sparse matrix implementations in scipy works for you.

Peter
 
R

Robert Kern

Still fairly new to Python. I wrote a program that used a class
called RectangularArray as described here:

class RectangularArray:
def __init__(self, rows, cols, value=0):
self.arr = [None]*rows
self.row = [value]*cols
def __getitem__(self, (i, j)):
return (self.arr or self.row)[j]
def __setitem__(self, (i, j), value):
if self.arr==None: self.arr = self.row[:]
self.arr[j] = value

This class was found in a 14 year old post:
http://www.python.org/search/hypermail/python-recent/0106.html

This worked great and let me process a few hundred thousand data
points with relative ease. However, I soon wanted to start sorting
arbitrary portions of my arrays and to transpose others. I turned to
Numpy rather than reinventing the wheel with custom methods within the
serviceable RectangularArray class. However, once I refactored with
Numpy I was surprised to find that the execution time for my program
doubled! I expected a purpose built array module to be more efficient
rather than less.


It depends on how much you refactored you code. numpy tries to optimize bulk
operations. If you are doing a lot of __getitem__s and __setitem__s with
individual elements as you would with RectangularArray, numpy is going to do a
lot of extra work creating and deleting the scalar objects.
I'm not doing any linear algebra with my data. I'm working with
rectangular datasets, evaluating individual rows, grouping, sorting
and summarizing various subsets of rows.

Is a Numpy implementation overkill for my data handling uses? Should
I evaluate prior array modules such as Numeric or Numarray?

No.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
T

timlash

Thanks for your replies.

@Peter - My arrays are not sparse at all, but I'll take a quick look
as scipy. I also should have mentioned that my numpy arrays are of
Object type as each data point (row) has one or more text labels for
categorization.

@Robert - Thanks for the comments about how numpy was optimized for
bulk transactions. Most of the processing I'm doing is with
individual elements.

Essentially, I'm testing tens of thousands of scenarios on a
relatively small number of test cases. Each scenario requires all
elements of each test case to be scored, then summarized, then sorted
and grouped with some top scores captured for reporting.

It seems like I can either work toward a procedure that features
indexed categorization so that my arrays are of integer type and a
design that will allow each scenario to be handled in bulk numpy
fashion, or expand RectangularArray with custom data handling methods.

Any other recommended approaches to working with tabular data in
Python?

Cheers,

Tim
 
R

Robert Kern

Essentially, I'm testing tens of thousands of scenarios on a
relatively small number of test cases. Each scenario requires all
elements of each test case to be scored, then summarized, then sorted
and grouped with some top scores captured for reporting.

It seems like I can either work toward a procedure that features
indexed categorization so that my arrays are of integer type and a
design that will allow each scenario to be handled in bulk numpy
fashion, or expand RectangularArray with custom data handling methods.

If you posted a small, self-contained example of what you are doing to
numpy-discussion, the denizens there will probably be able to help you formulate
the right way to do this in numpy, if such a way exists.

http://www.scipy.org/Mailing_Lists

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top