Efficient python 2-d arrays?

J

Jake Biesinger

Hi all,

Using numpy, I can create large 2-dimensional arrays quite easily.
Unfortunately, my target audience may not have numpy so I'd prefer not to use it.

Similarly, a list-of-tuples using standard python syntax.
mylist = [(0,0) for i in xrange(100000000)

but this method uses way too much memory (>4GB for 100 million items, compared to 1.5GB for numpy method).

Since I want to keep the two elements together during a sort, I *can't* use array.array.
mylist = [array.array('i',xrange(100000000)), array.array('i',xrange(100000000))]

If I knew the size in advance, I could use ctypes arrays.
from ctypes import *
class myStruct(Structure):
_fields_ = [('x',c_int),('y',c_int)]
mylist_type = myStruct * 100000000
mylist = mylist_type()

but I don't know that size (and it can vary between 1 million-200 million), so preallocating doesn't seem to be an option.

Is there a python standard library way of creating *efficient* 2-dimensional lists/arrays, still allowing me to sort and append?

Thanks!
 
D

Dan Stromberg

Hi all,

Using numpy, I can create large 2-dimensional arrays quite easily.
Unfortunately, my target audience may not have numpy so I'd prefer not to use it.

Similarly, a list-of-tuples using standard python syntax.
mylist = [(0,0) for i in xrange(100000000)

but this method uses way too much memory (>4GB for 100 million items, compared to 1.5GB for numpy method).

Since I want to keep the two elements together during a sort, I *can't* use array.array.
mylist = [array.array('i',xrange(100000000)), array.array('i',xrange(100000000))]

If I knew the size in advance, I could use ctypes arrays.
from ctypes import *
class myStruct(Structure):
    _fields_ = [('x',c_int),('y',c_int)]
mylist_type = myStruct * 100000000
mylist = mylist_type()

but I don't know that size (and it can vary between 1 million-200 million), so preallocating doesn't seem to be an option.

Is there a python standard library way of creating *efficient* 2-dimensional lists/arrays, still allowing me to sort and append?

Thanks!

I recently had need of a two dimensional array also, but I only needed
small ones, so I just used a dictionary indexed by tuples.

If you need to sort a row as an aggregate type, it seems that you
probably either want to:
1) Use a list of lists, where the outer list is the easier one to sort
by - this is analogous to the array of pointers in C - sorting by the
outer would want to compare "elements" by looking at the 0th inner,
then 1st inner, etc. So if you can organize things this way, things
might be pretty easy.
2) Use a list of instances, where the instances (rows) might just include a list
3) Use array.array with a custom sort so that you can move multiple
things when a more common sort would move "one" (possibly an
aggregate).

If you want some sorting code to start from for #3, I have a variety
of sorts written in Python (pure python or cython, your option, each
automatically derived from the same m4 file for a given sort) at
http://stromberg.dnsalias.org/svn/sorts/compare/trunk/ - however, even
the cython timsort provided there doesn't perform quite as well as the
standard library's timsort.

HTH
 
O

OAN

Hi,

what about pytables? It's built for big data collections and it doesn't
clog up the memory.

Am 17.01.2011 23:54, schrieb Dan Stromberg:
Hi all,

Using numpy, I can create large 2-dimensional arrays quite easily.
import numpy
mylist = numpy.zeros((100000000,2), dtype=numpy.int32)
Unfortunately, my target audience may not have numpy so I'd prefer not to use it.

Similarly, a list-of-tuples using standard python syntax.
mylist = [(0,0) for i in xrange(100000000)
but this method uses way too much memory (>4GB for 100 million items, compared to 1.5GB for numpy method).

Since I want to keep the two elements together during a sort, I *can't* use array.array.
mylist = [array.array('i',xrange(100000000)), array.array('i',xrange(100000000))]
If I knew the size in advance, I could use ctypes arrays.
from ctypes import *
class myStruct(Structure):
_fields_ = [('x',c_int),('y',c_int)]
mylist_type = myStruct * 100000000
mylist = mylist_type()
but I don't know that size (and it can vary between 1 million-200 million), so preallocating doesn't seem to be an option.

Is there a python standard library way of creating *efficient* 2-dimensional lists/arrays, still allowing me to sort and append?

Thanks!
I recently had need of a two dimensional array also, but I only needed
small ones, so I just used a dictionary indexed by tuples.

If you need to sort a row as an aggregate type, it seems that you
probably either want to:
1) Use a list of lists, where the outer list is the easier one to sort
by - this is analogous to the array of pointers in C - sorting by the
outer would want to compare "elements" by looking at the 0th inner,
then 1st inner, etc. So if you can organize things this way, things
might be pretty easy.
2) Use a list of instances, where the instances (rows) might just include a list
3) Use array.array with a custom sort so that you can move multiple
things when a more common sort would move "one" (possibly an
aggregate).

If you want some sorting code to start from for #3, I have a variety
of sorts written in Python (pure python or cython, your option, each
automatically derived from the same m4 file for a given sort) at
http://stromberg.dnsalias.org/svn/sorts/compare/trunk/ - however, even
the cython timsort provided there doesn't perform quite as well as the
standard library's timsort.

HTH
 
M

Martin v. Loewis

Using numpy, I can create large 2-dimensional arrays quite easily.

IIUC (please confirm), you don't need a generic two-dimensional
array, but rather an Nx2 array, where N may be large (but the other
dimension will always have a magnitude of 2).
Since I want to keep the two elements together during a sort

I assume (please confirm) that you want to sort by one of the numbers,
and keep the other one together with the sort key.

I suggest that you implement your own sorting algorithm, or reuse
heapsort (actually, any of the sorting algorithms would do).

You then use array.array, and take the elements at indices 2k and 2k+1
together. Sorting would then only look at even indices, but any swapping
you do during the sorting would always swap two numbers with two other
numbers.

Regards,
Martin
 
C

Carl Banks

Is there a python standard library way of creating *efficient* 2-dimensional lists/arrays, still allowing me to sort and append?

Without using third party libraries, no not really. numpy has it
covered so there's not really a lot of demand for it. If your users
are loading 1.5 GB arrays into memory, it's probably not unreasonable
to expect them to have numpy installed.

Other than that you're probably stuck emulating 2D with the array
module.


Carl Banks
 
J

Jonathan Hartley

Hi all,

Using numpy, I can create large 2-dimensional arrays quite easily.

Unfortunately, my target audience may not have numpy so I'd prefer not to use it.

Similarly, a list-of-tuples using standard python syntax.
mylist = [(0,0) for i in xrange(100000000)

but this method uses way too much memory (>4GB for 100 million items, compared to 1.5GB for numpy method).

Since I want to keep the two elements together during a sort, I *can't* use array.array.
mylist = [array.array('i',xrange(100000000)), array.array('i',xrange(100000000))]

If I knew the size in advance, I could use ctypes arrays.
from ctypes import *
class myStruct(Structure):
    _fields_ = [('x',c_int),('y',c_int)]
mylist_type = myStruct * 100000000
mylist = mylist_type()

but I don't know that size (and it can vary between 1 million-200 million), so preallocating doesn't seem to be an option.

Is there a python standard library way of creating *efficient* 2-dimensional lists/arrays, still allowing me to sort and append?

Thanks!



Since you can't depend on your users installing the dependencies, is
it vital that your users run from source? You could bundle up your
application along with numpy and other dependencies using py2Exe or
similar. This also means you wouldn't have to require users to have
the right (or any) version of Python installed.
 
R

Robert Kern

I thought PyTables depends on NumPy?

It does.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top