numpy.memmap advice?

L

Lionel

Hello all,

On a previous thread (http://groups.google.com/group/comp.lang.python/
browse_thread/thread/64da35b811e8f69d/67fa3185798ddd12?
hl=en&lnk=gst&q=keene#67fa3185798ddd12) I was asking about reading in
binary data. Briefly, my data consists of complex numbers, 32-bit
floats for real and imaginary parts. The data is stored as 4 bytes
Real1, 4 bytes Imaginary1, 4 bytes Real2, 4 bytes Imaginary2, etc. in
row-major format. I needed to read the data in as two separate numpy
arrays, one for real values and one for imaginary values.

There were several very helpful performance tips offered, and one in
particular I've started looking into. The author suggested a
"numpy.memmap" object may be beneficial. It was suggested I use it as
follows:


descriptor = dtype([("r", "<f4"), ("i", "<f4")])
data = memmap(filename, dtype=descriptor, mode='r').view(recarray)
print "First 100 real values:", data.r[:100]


I have two questions:
1) What is "recarray"?
2) The documentation for numpy.memmap claims that it is meant to be
used in situations where it is beneficial to load only segments of a
file into memory, not the whole thing. This is definately something
I'd like to be able to do as my files are frequently >1Gb. I don't
really see in the diocumentation how portions are loaded, however.
They seem to create small arrays and then assign the entire array
(i.e. file) to the memmap object. Let's assume I have a binary data
file of complex numbers in the format described above, and let's
assume that the size of the complex data array (that is, the entire
file) is 100x100 (rows x columns). Could someone please post a few
lines showing how to load the top-left 50 x 50 quadrant, and the lower-
right 50 x 50 quadrant into memmap objects? Thank you very much in
advance!

-L
 
R

Robert Kern

Hello all,

On a previous thread (http://groups.google.com/group/comp.lang.python/
browse_thread/thread/64da35b811e8f69d/67fa3185798ddd12?
hl=en&lnk=gst&q=keene#67fa3185798ddd12) I was asking about reading in
binary data. Briefly, my data consists of complex numbers, 32-bit
floats for real and imaginary parts. The data is stored as 4 bytes
Real1, 4 bytes Imaginary1, 4 bytes Real2, 4 bytes Imaginary2, etc. in
row-major format. I needed to read the data in as two separate numpy
arrays, one for real values and one for imaginary values.

I don't have time to answer your questions now, so you should ask on the numpy
mailing list where others can jump in.

http://www.scipy.org/Mailing_Lists

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
C

Carl Banks

Hello all,

On a previous thread (http://groups.google.com/group/comp.lang.python/
browse_thread/thread/64da35b811e8f69d/67fa3185798ddd12?
hl=en&lnk=gst&q=keene#67fa3185798ddd12) I was asking about reading in
binary data. Briefly, my data consists of complex numbers, 32-bit
floats for real and imaginary parts. The data is stored as 4 bytes
Real1, 4 bytes Imaginary1, 4 bytes Real2, 4 bytes Imaginary2, etc. in
row-major format. I needed to read the data in as two separate numpy
arrays, one for real values and one for imaginary values.

There were several very helpful performance tips offered, and one in
particular I've started looking into. The author suggested a
"numpy.memmap" object may be beneficial. It was suggested I use it as
follows:

descriptor = dtype([("r", "<f4"), ("i", "<f4")])
data = memmap(filename, dtype=descriptor, mode='r').view(recarray)
print "First 100 real values:", data.r[:100]

I have two questions:
1) What is "recarray"?

Let's look:

[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Help on class recarray in module numpy.core.records:

class recarray(numpy.ndarray)
| recarray(shape, dtype=None, buf=None, **kwds)
|
| Subclass of ndarray that allows field access using attribute
lookup.
|
| Parameters
| ----------
| shape : tuple
| shape of record array
| dtype : data-type or None
| The desired data-type. If this is None, then the data-type is
determine
| by the *formats*, *names*, *titles*, *aligned*, and
*byteorder* keywords
| buf : [buffer] or None
| If this is None, then a new array is created of the given
shape and data
| If this is an object exposing the buffer interface, then the
array will
| use the memory from an existing buffer. In this case, the
*offset* and
| *strides* keywords can also be used.
....


So there you have it. It's a subclass of ndarray that allows field
access using attribute lookup. (IOW, you're creating a view of the
memmap'ed data of type recarray, which is the type numpy uses to
access structures by name. You need to create the view because
regular numpy arrays, which numpy.memmap creates, can't access fields
by attribute.)

help() is a nice thing to use, and numpy is one of the better
libraries when it comes to docstrings, so learn to use it.

2) The documentation for numpy.memmap claims that it is meant to be
used in situations where it is beneficial to load only segments of a
file into memory, not the whole thing. This is definately something
I'd like to be able to do as my files are frequently >1Gb. I don't
really see in the diocumentation how portions are loaded, however.
They seem to create small arrays and then assign the entire array
(i.e. file) to the memmap object. Let's assume I have a binary data
file of complex numbers in the format described above, and let's
assume that the size of the complex data array (that is, the entire
file) is 100x100 (rows x columns). Could someone please post a few
lines showing how to load the top-left 50 x 50 quadrant, and the lower-
right 50 x 50 quadrant into memmap objects? Thank you very much in
advance!


You would memmap the whole region in question (in this case the whole
file), then take a slice. Actually you could get away with memmapping
just the last 50 rows (bottom half). The offset into the file would
be 50*100*8, so:

data = memmap(filename, dtype=descriptor, mode='r',offset=
(50*100*8)).view(recarray)
reshaped_data = reshape(data,(50,100))
intersting_data = reshaped_data[:,50:100]


A word of caution: Every instance of numpy.memmap creates its own mmap
of the whole file (even if it only creates an array from part of the
file). The implications of this are A) you can't use numpy.memmap's
offset parameter to get around file size limitations, and B) you
shouldn't create many numpy.memmaps of the same file. To work around
B, you should create a single memmap, and dole out views and slices.


Carl Banks
 
L

Lionel

Hello all,
On a previous thread (http://groups.google.com/group/comp.lang.python/
browse_thread/thread/64da35b811e8f69d/67fa3185798ddd12?
hl=en&lnk=gst&q=keene#67fa3185798ddd12) I was asking about reading in
binary data. Briefly, my data consists of complex numbers, 32-bit
floats for real and imaginary parts. The data is stored as 4 bytes
Real1, 4 bytes Imaginary1, 4 bytes Real2, 4 bytes Imaginary2, etc. in
row-major format. I needed to read the data in as two separate numpy
arrays, one for real values and one for imaginary values.
There were several very helpful performance tips offered, and one in
particular I've started looking into. The author suggested a
"numpy.memmap" object may be beneficial. It was suggested I use it as
follows:
descriptor = dtype([("r", "<f4"), ("i", "<f4")])
data = memmap(filename, dtype=descriptor, mode='r').view(recarray)
print "First 100 real values:", data.r[:100]
I have two questions:
1) What is "recarray"?

Let's look:

[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> import numpy
Help on class recarray in module numpy.core.records:

class recarray(numpy.ndarray)
 |  recarray(shape, dtype=None, buf=None, **kwds)
 |
 |  Subclass of ndarray that allows field access using attribute
lookup.
 |
 |  Parameters
 |  ----------
 |  shape : tuple
 |      shape of record array
 |  dtype : data-type or None
 |      The desired data-type.  If this is None, then the data-type is
determine
 |      by the *formats*, *names*, *titles*, *aligned*, and
*byteorder* keywords
 |  buf : [buffer] or None
 |      If this is None, then a new array is created of the given
shape and data
 |      If this is an object exposing the buffer interface, then the
array will
 |      use the memory from an existing buffer.  In this case, the
*offset* and
 |      *strides* keywords can also be used.
...

So there you have it.  It's a subclass of ndarray that allows field
access using attribute lookup.  (IOW, you're creating a view of the
memmap'ed data of type recarray, which is the type numpy uses to
access structures by name.  You need to create the view because
regular numpy arrays, which numpy.memmap creates, can't access fields
by attribute.)

help() is a nice thing to use, and numpy is one of the better
libraries when it comes to docstrings, so learn to use it.
2) The documentation for numpy.memmap claims that it is meant to be
used in situations where it is beneficial to load only segments of a
file into memory, not the whole thing. This is definately something
I'd like to be able to do as my files are frequently >1Gb. I don't
really see in the diocumentation how portions are loaded, however.
They seem to create small arrays and then assign the entire array
(i.e. file) to the memmap object. Let's assume I have a binary data
file of complex numbers in the format described above, and let's
assume that the size of the complex data array (that is, the entire
file) is 100x100 (rows x columns). Could someone please post a few
lines showing how to load the top-left 50 x 50 quadrant, and the lower-
right 50 x 50 quadrant into memmap objects? Thank you very much in
advance!

You would memmap the whole region in question (in this case the whole
file), then take a slice.  Actually you could get away with memmapping
just the last 50 rows (bottom half).  The offset into the file would
be 50*100*8, so:

data = memmap(filename, dtype=descriptor, mode='r',offset=
(50*100*8)).view(recarray)
reshaped_data = reshape(data,(50,100))
intersting_data = reshaped_data[:,50:100]

A word of caution: Every instance of numpy.memmap creates its own mmap
of the whole file (even if it only creates an array from part of the
file).  The implications of this are A) you can't use numpy.memmap's
offset parameter to get around file size limitations, and B) you
shouldn't create many numpy.memmaps of the same file.  To work around
B, you should create a single memmap, and dole out views and slices.

Carl Banks- Hide quoted text -

- Show quoted text -

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?
 
C

Carl Banks

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?

No, what accounts for the memory efficienty is there is no bulk
allocation at all. The ndarray you have points to the memory that's
in the mmap. There is no copying data or separate array allocation.

Also, it's not any more memory efficient to use the offset parameter
with numpy.memmap than it is to memmap the whole file and take a
slice.


Carl Banks
 
S

sturlamolden

1) What is "recarray"?

An ndarray of what C programmers know as a "struct", in which each
field is accessible by its name.

That is,

struct rgba{
unsigned char r;
unsigned char g;
unsigned char b;
unsigned char a;
};

struct rgba arr[480][640];

is similar to:

import numpy as np
rbga = np.dtype({'names':list('rgba'), 'formats':[np.uint8]*4})
arr = np.array((480,640), dtype=rgba)

Now you can access the r, g, b and a fields directly using arr['r'],
arr['g'], arr['b'], and arr['a'].
Internally the data will be represented compactly as with the C code
above. If you want to view the data as an 480 x 640 array of 32 bit
integers instead, it is as simple as arr.view(dtype=np.uint32).
Formatted binary data can of course be read from files using
np.fromfile with the specified dtype, and written to files by passing
a recarray as buffer to file.write. You can thus see NumPy's
recarray's as a more powerful alternative to Python's struct module.

I don't really see in the diocumentation how portions are loaded, however.

Prior to Python 2.6, the mmap object (which numpy.memmap uses
internally) does not take an offset parameter. But when NumPy are
ported to newer version of Python this will be fixed. You should then
be able to memory map with an ndarray from a certain offset. To make
this work now, you must e.g. backport mmap from Python 2.6 and use
that with NumPy. Not difficult, but nobody has bothered to do it (as
far as I know).



Sturla Molden
 
C

Carl Banks

1) What is "recarray"?

An ndarray of what C programmers know as a "struct", in which each
field is accessible by its name.

That is,

struct rgba{
  unsigned char r;
  unsigned char g;
  unsigned char b;
  unsigned char a;

};

struct rgba arr[480][640];

is similar to:

import numpy as np
rbga = np.dtype({'names':list('rgba'), 'formats':[np.uint8]*4})
arr = np.array((480,640), dtype=rgba)

Now you can access the r, g, b and a fields directly using arr['r'],
arr['g'], arr['b'], and arr['a'].
Internally the data will be represented compactly as with the C code
above. If you want to view the data as an 480 x 640 array of 32 bit
integers instead, it is as simple as arr.view(dtype=np.uint32).
Formatted binary data can of course be read from files using
np.fromfile with the specified dtype, and written to files by passing
a recarray as buffer to file.write. You can thus see NumPy's
recarray's as a more powerful alternative to Python's struct module.
I don't really see in the diocumentation how portions are loaded, however.

Prior to Python 2.6, the mmap object (which numpy.memmap uses
internally) does not take an offset parameter. But when NumPy are
ported to newer version of Python this will be fixed. You should then
be able to memory map with an ndarray from a certain offset. To make
this work now, you must e.g. backport mmap from Python 2.6 and use
that with NumPy. Not difficult, but nobody has bothered to do it (as
far as I know).

You can use an offset with numpy.memmap today; it'll mmap the whole
file, but start the array data at the given offset.

The offset parameter of mmap itself would be useful to map small
portions of gigabyte-sized files, and maybe numpy.memmap can take
advantage of that if the user passes an offset parameter. One thing
you can't do with mmap's offset, but you can do with numpy.memmap, is
to set it to an arbitary value, since it has to be a multiple of some
large number (something like 1 MB, depending on the OS).


Carl Banks
 
L

Lionel

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?

No, what accounts for the memory efficienty is there is no bulk
allocation at all.  The ndarray you have points to the memory that's
in the mmap.  There is no copying data or separate array allocation.

Also, it's not any more memory efficient to use the offset parameter
with numpy.memmap than it is to memmap the whole file and take a
slice.

Carl Banks

Does this mean that everytime I iterate through an ndarray that is
sourced from a memmap, the data is read from the disc? The sliced
array is at no time wholly resident in memory? What are the
performance implications of this?
 
C

Carl Banks

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?
No, what accounts for the memory efficienty is there is no bulk
allocation at all.  The ndarray you have points to the memory that's
in the mmap.  There is no copying data or separate array allocation.

Does this mean that everytime I iterate through an ndarray that is
sourced from a memmap, the data is read from the disc? The sliced
array is at no time wholly resident in memory? What are the
performance implications of this?

Ok, sorry for the confusion. What I should have said is that there is
no bulk allocation *by numpy* at all. The call to mmap does allocate
a chunk of RAM to reflect file contents, but the numpy arrays don't
allocate any memory of their own: they use the same memory as was
allocated by the mmap call.


Carl Banks
 
S

sturlamolden

The offset parameter of mmap itself would be useful to map small
portions of gigabyte-sized files, and maybe numpy.memmap can take
advantage of that if the user passes an offset parameter.  

NumPy's memmap is just a wrapper for Python 2.5's mmap. The offset
parameter does not affect the amount that is actually memory mapped.

S.M.
 
C

Carl Banks

NumPy's memmap is just a wrapper for Python 2.5's mmap. The offset
parameter does not affect the amount that is actually memory mapped.

Yes, that's what I said, but in future numpy.mmap could be updated to
take advantage of mmap's new offset parameter.


Carl Banks
 
L

Lionel

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?
No, what accounts for the memory efficienty is there is no bulk
allocation at all.  The ndarray you have points to the memory that's
in the mmap.  There is no copying data or separate array allocation..
Does this mean that everytime I iterate through an ndarray that is
sourced from a memmap, the data is read from the disc? The sliced
array is at no time wholly resident in memory? What are the
performance implications of this?

Ok, sorry for the confusion.  What I should have said is that there is
no bulk allocation *by numpy* at all.  The call to mmap does allocate
a chunk of RAM to reflect file contents, but the numpy arrays don't
allocate any memory of their own: they use the same memory as was
allocated by the mmap call.

Carl Banks- Hide quoted text -

- Show quoted text -

Thanks for the explanations Carl. I'm sorry, but it's me who's the
confused one here, not anyone else :)

I hate to waste everyone's time again, but something is just not
"clicking" in that black-hole I call a brain. So..."numpy.memmap"
allocates a chunk off the heap to coincide with the file contents. If
I memmap the entire 1 Gb file, a corresponding amount (approx. 1 Gb)
is allocated? That seems to contradict what is stated in the numpy
documentation:

"class numpy.memmap
Create a memory-map to an array stored in a file on disk.

Memory-mapped files are used for accessing small segments of large
files on disk, without reading the entire file into memory."

In my previous example that we were working with (100x100 data file),
you used an offset to memmap the "lower-half" of the array. Does this
mean that in the process of memmapping that lower half, RAM was set
aside for 50x100 32-bit complex numbers? If so, and I decide to memmap
an entire file, there is no memory benefit in doing so.

At this point do you (or anyone else) recommend I just write a little
function for my class that takes the coords I intend to load and "roll
my own" function? Seems like the best way to keep memory to a minimum,
I'm just worried about performance. On the other hand, the most I'd be
loading would be around 1k x 1k worth of data.
 
C

Carl Banks

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?
No, what accounts for the memory efficienty is there is no bulk
allocation at all.  The ndarray you have points to the memory that's
in the mmap.  There is no copying data or separate array allocation.
Does this mean that everytime I iterate through an ndarray that is
sourced from a memmap, the data is read from the disc? The sliced
array is at no time wholly resident in memory? What are the
performance implications of this?
Ok, sorry for the confusion.  What I should have said is that there is
no bulk allocation *by numpy* at all.  The call to mmap does allocate
a chunk of RAM to reflect file contents, but the numpy arrays don't
allocate any memory of their own: they use the same memory as was
allocated by the mmap call.
Carl Banks- Hide quoted text -
- Show quoted text -

Thanks for the explanations Carl. I'm sorry, but it's me who's the
confused one here, not anyone else :)

I hate to waste everyone's time again, but something is just not
"clicking" in that black-hole I call a brain. So..."numpy.memmap"
allocates a chunk off the heap to coincide with the file contents. If
I memmap the entire 1 Gb file, a corresponding amount (approx. 1 Gb)
is allocated? That seems to contradict what is stated in the numpy
documentation:

"class numpy.memmap
Create a memory-map to an array stored in a file on disk.

Memory-mapped files are used for accessing small segments of large
files on disk, without reading the entire file into memory."

Yes, it allocates room for the whole file in your process's LOGICAL
address space. However, it doesn't actually reserve any PHYSICAL
memory, or read in any data from the disk, until you've actually
access the data. And then it only reads small chunks in, not the
whole file.

So when you mmap your 1GB file, the OS sets aside a 1 GB chunk of
address to use for your memory map. That's all it does: it doesn't
read anything from disk, it doesn't reserve any physical RAM. Later,
when you access a byte in the mmap via a pointer, the OS notes that it
hasn't yet loaded the data at that address, so it grabs a small chunk
of physical ram and reads in the a small amount of data from the disk
containing the byte you are accessing. This all happens automatically
and transparently to you.

In my previous example that we were working with (100x100 data file),
you used an offset to memmap the "lower-half" of the array. Does this
mean that in the process of memmapping that lower half, RAM was set
aside for 50x100 32-bit complex numbers? If so, and I decide to memmap
an entire file, there is no memory benefit in doing so.

The mmap call sets aside room for all 100x100 32-bit complex numbers
in logical address space, regardless of whether you use the offset
parameter or not. However, it might only read in part of the file in
from disk, and will only reserve physical RAM for the parts it reads
in.

At this point do you (or anyone else) recommend I just write a little
function for my class that takes the coords I intend to load and "roll
my own" function? Seems like the best way to keep memory to a minimum,
I'm just worried about performance. On the other hand, the most I'd be
loading would be around 1k x 1k worth of data.-

No, if your file is not too large to mmap, just do it the way you've
been doing it. The documentation you've been reading is pretty much
correct, even if you approach it naively. It is both memory and I/O
efficient. You're overthinking things here; don't try to outsmart the
operating system. It'll take care of the performance issues
satisfactorily.

The only thing you have to worry about is if the file is too large to
fit into your process's logical address space, which on a typical 32-
bit system is 2-3 GB (depending on configuration) minus the space
occupied by Python and other heap objects, which is probably only a
few MB.



Carl Banks
 
L

Lionel

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?
No, what accounts for the memory efficienty is there is no bulk
allocation at all.  The ndarray you have points to the memory that's
in the mmap.  There is no copying data or separate array allocation.
Does this mean that everytime I iterate through an ndarray that is
sourced from a memmap, the data is read from the disc? The sliced
array is at no time wholly resident in memory? What are the
performance implications of this?
Ok, sorry for the confusion.  What I should have said is that there is
no bulk allocation *by numpy* at all.  The call to mmap does allocate
a chunk of RAM to reflect file contents, but the numpy arrays don't
allocate any memory of their own: they use the same memory as was
allocated by the mmap call.
Carl Banks- Hide quoted text -
- Show quoted text -
Thanks for the explanations Carl. I'm sorry, but it's me who's the
confused one here, not anyone else :)
I hate to waste everyone's time again, but something is just not
"clicking" in that black-hole I call a brain. So..."numpy.memmap"
allocates a chunk off the heap to coincide with the file contents. If
I memmap the entire 1 Gb file, a corresponding amount (approx. 1 Gb)
is allocated? That seems to contradict what is stated in the numpy
documentation:
"class numpy.memmap
Create a memory-map to an array stored in a file on disk.
Memory-mapped files are used for accessing small segments of large
files on disk, without reading the entire file into memory."

Yes, it allocates room for the whole file in your process's LOGICAL
address space.  However, it doesn't actually reserve any PHYSICAL
memory, or read in any data from the disk, until you've actually
access the data.  And then it only reads small chunks in, not the
whole file.

So when you mmap your 1GB file, the OS sets aside a 1 GB chunk of
address to use for your memory map.  That's all it does: it doesn't
read anything from disk, it doesn't reserve any physical RAM.  Later,
when you access a byte in the mmap via a pointer, the OS notes that it
hasn't yet loaded the data at that address, so it grabs a small chunk
of physical ram and reads in the a small amount of data from the disk
containing the byte you are accessing.  This all happens automatically
and transparently to you.
In my previous example that we were working with (100x100 data file),
you used an offset to memmap the "lower-half" of the array. Does this
mean that in the process of memmapping that lower half, RAM was set
aside for 50x100 32-bit complex numbers? If so, and I decide to memmap
an entire file, there is no memory benefit in doing so.

The mmap call sets aside room for all 100x100 32-bit complex numbers
in logical address space, regardless of whether you use the offset
parameter or not.  However, it might only read in part of the file in
from disk, and will only reserve physical RAM for the parts it reads
in.
At this point do you (or anyone else) recommend I just write a little
function for my class that takes the coords I intend to load and "roll
my own" function? Seems like the best way to keep memory to a minimum,
I'm just worried about performance. On the other hand, the most I'd be
loading would be around 1k x 1k worth of data.-

No, if your file is not too large to mmap, just do it the way you've
been doing it.  The documentation you've been reading is pretty much
correct, even if you approach it naively.  It is both memory and I/O
efficient.  You're overthinking things here; don't try to outsmart the
operating system.  It'll take care of the performance issues
satisfactorily.

The only thing you have to worry about is if the file is too large to
fit into your process's logical address space, which on a typical 32-
bit system is 2-3 GB (depending on configuration) minus the space
occupied by Python and other heap objects, which is probably only a
few MB.

Carl Banks- Hide quoted text -

- Show quoted text -

I see. That was very well explained Carl, thank you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top