Which sparse matrix package?

M

Martin Manns

Hi:

I am writing a spreadsheet application in Python

http://pyspread.sf.net

which currently uses numpy.array for:

+ storing object references
(each array element corresponds to one grid cell)
+ slicing (read and write)
+ mapping from/to smaller numpy.array
+ searching and replacing
+ growing and shrinking
+ transpose operation.

While fast and stable, this is really memory inefficient for large grids
(i.e. larger than 1E7 rows and columns), so that I am looking into dicts
and sparse matrices.

The dict that I tried out is of the type:

{(1,2,3): "2323", (1,2,545): "2324234", ... }

It is too slow for my application when it grows. One slicing operation
with list comprehensions takes about 1/2 s on my computer for 1E6
elements.

Therefore, I looked into sparse matrices and found scipy.sparse and
pysparse. I tried out both lil_matrix objects. (I wrote a wrapper that
turns them into Python object arrays.)

scipy.sparse.lil_matrix allowed __getitem__ slicing only for one of the
dimensions and used much memory when increasing the number of
columns above 1E7.

pysparse.spmatrix.ll_mat was faster, uses less space and allows slicing
for both dimensions. However, its methods are not documented well and
I am not able to compile it in Debian testing due to some g77
dependencies. Even though the deb package works well, I am concerned
about having a dependency to a problematic package.


Now my questions:

Is there a better suited / maintained module for such sparse matrices
(or multi-dim arrays)?

Should I use another type of matrix in scipy.sparse? If yes which?

Does a different data-structure suit my above-stated needs better?

Best Regards

Martin
 
J

James Mills

Hi:
Hi,

I am writing a spreadsheet application in Python

What's wrong with pyspread ?

[ ... snip ... ]
The dict that I tried out is of the type:

{(1,2,3): "2323", (1,2,545): "2324234", ... }

It is too slow for my application when it grows. One slicing operation
with list comprehensions takes about 1/2 s on my computer for 1E6
elements.

Let me get this straight.
It's taking 0.5s to slice your matrix
of 1E7 (10000000.0 rows/columns)

Are you mad ? This is TEN Millions and you
required it faster than 0.5s ?

Am I missing something here ?

cheers
James
 
R

Robert Kern

James said:
What's wrong with pyspread ?

pyspread *is* the spreadsheet application he is writing.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
M

MRAB

James said:
Hi:
Hi,

I am writing a spreadsheet application in Python

What's wrong with pyspread ?

[ ... snip ... ]
The dict that I tried out is of the type:

{(1,2,3): "2323", (1,2,545): "2324234", ... }

It is too slow for my application when it grows. One slicing operation
with list comprehensions takes about 1/2 s on my computer for 1E6
elements.

Let me get this straight.
It's taking 0.5s to slice your matrix
of 1E7 (10000000.0 rows/columns)

Are you mad ? This is TEN Millions and you
required it faster than 0.5s ?
No, 1E6, or ONLY 1 million. :)
Am I missing something here ?
Would a matrix of matrices work better, ie a supermatrix where each
element is either a submatrix of cells or None if the submatrix is
empty? How much memory it would save would depend on how sparse the
matrix is.
 
E

eric

What's wrong with pyspread ?
[ ... snip ... ]
The dict that I tried out is of the type:
{(1,2,3): "2323", (1,2,545): "2324234", ... }
It is too slow for my application when it grows. One slicing operation
with list comprehensions takes about 1/2 s on my computer for 1E6
elements.
Let me get this straight.
It's taking 0.5s to slice your matrix
of 1E7 (10000000.0 rows/columns)
Are you mad ? This is TEN Millions and you
required it faster than 0.5s ?

No, 1E6, or ONLY 1 million. :)

what does Internet explorer 6 & 7 have to do with that ? ;-)
Would a matrix of matrices work better, ie a supermatrix where each
element is either a submatrix of cells or None if the submatrix is
empty? How much memory it would save would depend on how sparse the
matrix is.

sparse matrix efficiency is all about guessing the probability for a
cell to be empty. My guess is that the probability for a cell (i,j) to
be empty grows with norm(i,j)

then, you should store your data in a list, and the indexes in another
"list", then use a BTree like algorithm, or keep you data ordered by
norm(i,j) growing.

something like:

data = [] #
kernel =[] # full of couples (i,j)

def add(i,j,value):
data.append(value) # or put in the "right place" to keep it sorted
kernel.append(i,j) # same comment

def get(i,j):
k = kernel_of(i,j)
return data[k]
def set(i,j,value):
if is_probably_empty():
k = kernel_of(i,j)
if k:
data[k]= value
else: add(i,j,value)

else:
add(i,j,value)

where the 'kernel_of method" can take advantage of the sorted order
(i*j or max(i,j) should be as good) to make a fast dichotomy.

The set method is about finding if the cell is 'empty or not' and then
either perform an indexof, or a add.
to efficiently know if a cell is empty (the is_probably_empty method)
or not, try to use the bloom filter using
i*j%m as hash function (for instance)

here arrays have the size of your actual data (maybe thousands of cell
for a huge spreadsheet), that's MUCH better than the size of the
maxN*maxM

if my memory is correct, there is room in numpy for your own sparse
matrix implementation. You should try it with your own sparse matrix
implementation.
 
R

Robert Kern

Martin said:
Should I use another type of matrix in scipy.sparse? If yes which?

If you have a benchmark, you might just want to try all of them. Should be just
a matter of a small script. Block Sparse Row (bsr_matrix) might be the most
appropriate in terms of data structure, but it appears that a bunch of stuff
important for your use case is unimplemented.
Does a different data-structure suit my above-stated needs better?

Something similar to a quadtree might be more suitable. That should let you do
queries along both dimensions easily. You probably don't want exactly a
quadtree, though; it's good for scattered points, but probably not for largish
blocks of points on a precise grid, which you will probably see frequently in a
spreadsheet application.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
G

Guest

Let me get this straight.
It's taking 0.5s to slice your matrix
of 1E7 (10000000.0 rows/columns)

My benchmark is as follows:

1) Each of the numbers in the 3-tuple is in the range [0, 1E7).
2) There are 1 000 000 elements in the dict (randomly distributed).
3) The content string is a random number in the range [0, 1E10) that is
casted into a string.
4) Measure the time that retrieving all elements in a 10000x100x10 cube
requires.
Are you mad ? This is TEN Millions and you
required it faster than 0.5s ?

Think about how often a spreadsheet re-calculates cells.
Furthermore, people tend to copy&paste cells.

I had some small example (Wagner-Whitin algorithm demo)
with hundreds of slicing operations on one screen. I believe that
no-one wants to wait for 1/2 s times 500 == 250 s each time the
spreadsheet is updated.


Martin
 
G

Guest

If you have a benchmark, you might just want to try all of them.
Should be just a matter of a small script. Block Sparse Row
(bsr_matrix) might be the most appropriate in terms of data
structure, but it appears that a bunch of stuff important for your
use case is unimplemented.

This was my first impulse until I saw that the syntax and the actually
implemented methods vary considerably inside scipy.sparse (different
authors?).

Before I code around all of that, I would like to to see, which
functionality is really provided by which matrix class.

Do you know where to find an overview (some docs)? Currently, I stumble
upon unimplemented things in my unit tests and figure out what is going
on after reading the sparse code :-(


Martin
 
J

James Mills

Let me get this straight.
It's taking 0.5s to slice your matrix
of 1E7 (10000000.0 rows/columns)

My benchmark is as follows:

1) Each of the numbers in the 3-tuple is in the range [0, 1E7).
2) There are 1 000 000 elements in the dict (randomly distributed).
3) The content string is a random number in the range [0, 1E10) that is
casted into a string.
4) Measure the time that retrieving all elements in a 10000x100x10 cube
requires.

Does a spreadsheet really get this complex ?
Think about how often a spreadsheet re-calculates cells.
Furthermore, people tend to copy&paste cells.

I had some small example (Wagner-Whitin algorithm demo)
with hundreds of slicing operations on one screen. I believe that
no-one wants to wait for 1/2 s times 500 == 250 s each time the
spreadsheet is updated.

Hmmm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,992
Messages
2,570,220
Members
46,807
Latest member
ryef

Latest Threads

Top