Multidimensional array performance

G

Gregory.A.Book

I'm working with displaying and manipulating very large image sets. The
program handles anything from 2D images to 4D RGB volumes in a
time-series. I've been using dynamically allocated arrays to accomplish
this, but having recently added the ability to load and display 4D
data, I've run into some significant performance issues.

For a dataset of size 256x176x176x1, it takes about 30 seconds to
allocate the array. Loading the data into the array from 176 files only
take a few seconds. Accessing the array is fine, I'm able to loop
through the entire volume in a couple seconds. When its time to
deallocate the memory, it takes a couple minutes. This is on a P4
2.4GHz with 1GB RAM. I'm using the code below to allocate and
deallocate the memory.

I've read about alternate methods for multidimensional arrays,
including templates. I've also read about memory pools, but haven't
seen anything about how to allocate a multidimensional array inside of
the pool. I also dont think a pool is ideal because I'll only be
(de)allocating memory when the user program loads or unloads a dataset.
I've also thought about just declaring a single array in memory and
accessing elements by multiplication (ie array[row*col*plane])

What method would be best to quickly allocate and deallocate large
chunks of memory?

Thanks,
Greg


-------------- DECLARATION -----------------
unsigned short int ****imgDataINT16U; /* unsigned 16-bit */


--------------- ALLOCATION -------------------
int i, j, k;
/* setup the image buffer */
imgData->imgDataINT16U = new unsigned short int***[YSize];
for (i=0; i < YSize;i++)
imgData->imgDataINT16U = new unsigned short int**[XSize];
for (i=0; i < YSize;i++) {
for (j=0; j < XSize;j++)
imgData->imgDataINT16U[j] = new unsigned short int*[ZSize];
}
for (i=0; i < YSize;i++) {
for (j=0; j < XSize;j++)
for (k=0; k < ZSize;k++)
imgData->imgDataINT16U[j][k] = new unsigned short int[TSize];
}


--------------- DEALLOCATION -------------
for (i=0; i < imgInfo->GetVolumeYSize();i++) {
for (j=0; j < imgInfo->GetVolumeXSize();j++)
for (k=0; k < imgInfo->GetVolumeZSize();k++)
delete[] imgDataINT16U[j][k];
progDlg.Update(i,"Unloading Data");
}
for (i=0; i < imgInfo->GetVolumeYSize();i++) {
for (j=0; j < imgInfo->GetVolumeXSize();j++)
delete[] imgDataINT16U[j];
}
for (i=0; i < imgInfo->GetVolumeYSize();i++)
delete[] imgDataINT16U;
delete[] imgDataINT16U;

------------ GENERAL ACCESS ------------
voxelValue = imgData->imgDataINT16U[col][row][slice][0];
 
I

Ivan Vecerina

: I'm working with displaying and manipulating very large image sets.
The
: program handles anything from 2D images to 4D RGB volumes in a
: time-series. I've been using dynamically allocated arrays to
accomplish
: this, but having recently added the ability to load and display 4D
: data, I've run into some significant performance issues.
:
: For a dataset of size 256x176x176x1, it takes about 30 seconds to
: allocate the array. Loading the data into the array from 176 files
only
: take a few seconds.
....
: -------------- DECLARATION -----------------
: unsigned short int ****imgDataINT16U; /* unsigned 16-bit */
: --------------- ALLOCATION -------------------
: int i, j, k;
: /* setup the image buffer */
: imgData->imgDataINT16U = new unsigned short int***[YSize];
: for (i=0; i < YSize;i++)
: imgData->imgDataINT16U = new unsigned short int**[XSize];
: for (i=0; i < YSize;i++) {
: for (j=0; j < XSize;j++)
: imgData->imgDataINT16U[j] = new unsigned short int*[ZSize];
: }
: for (i=0; i < YSize;i++) {
: for (j=0; j < XSize;j++)
: for (k=0; k < ZSize;k++)
: imgData->imgDataINT16U[j][k] = new unsigned short int[TSize];
: }
Think of it: if TSize=1, the every single unsigned short 'voxel'
will be allocated in its own memory block! This represents an
space overhead of a pointer+several bytes (depending on your
library implementation, maybe 16!) for each 2 bytes of data !
This of course also means expensive initial allocations.
But during data manipulation as well, performance will be degraded
by the many misaligned memory accesses when manipulating the data.


: What method would be best to quickly allocate and deallocate large
: chunks of memory?

A first step would be to modify your code to use something like:
class Buf4D_Int16U {
public:
Buf4D_Int16U(int xSize, int ySize, int zSize, int tSize)
: p_( new unsigned short(xSize*ySize*zSize*tSize) )
, xSize_(xSize)
, ySize_(ySize)
, zSize_(zSize)
, tSize_(tSize)
{ }

~Buf4D_Int16U() { delete p_; }

// Element access (non-const and const)
unsigned short& operator(int x,int y,int z,int t)
{ return p_[((x*ySize_+y)*zSize_+z_)*tSize+t]; }
unsigned short operator(int x,int y,int z,int t) const
{ return p_[((x*ySize_+y)*zSize_+z_)*tSize+t]; }

private:
unsigned short* p_;
int xSize_,ySize_,zSize_,tSize_;
Buf4D_Int16U(Buf4D_Int16U const&); //disabled copy-ctr
void operator=(Buf4D_Int16U const&); //disabled copy-assign
};

: ------------ GENERAL ACCESS ------------
: voxelValue = imgData->imgDataINT16U[col][row][slice][0];

Becomes:
voxelValue = imgData(col,row,slice,0);

Pass a reference to the imgData object instead of the
unsigned short**** pointers when passing the buffer to functions.

Note that you probably want to use an std::vector<short> instead
of the naked pointer p_ (but this may have a performance cost,
especially in non-optimized builds...).


Many further improvements are possible, but this first step
should be a no-brainer...


Cheers,
Ivan
 
G

Gregory.A.Book

That looks like an excellent solution. I never thought about putting
the indexing math into a function.
How does the performance of using [n][m][...] notation compare to a
function call when accessing elements?
Many further improvements are possible, but this first step
should be a no-brainer...

What other ways are there of improving large multidimensional arrays?

-Greg



Ivan said:
: I'm working with displaying and manipulating very large image sets.
The
: program handles anything from 2D images to 4D RGB volumes in a
: time-series. I've been using dynamically allocated arrays to
accomplish
: this, but having recently added the ability to load and display 4D
: data, I've run into some significant performance issues.
:
: For a dataset of size 256x176x176x1, it takes about 30 seconds to
: allocate the array. Loading the data into the array from 176 files
only
: take a few seconds.
...
: -------------- DECLARATION -----------------
: unsigned short int ****imgDataINT16U; /* unsigned 16-bit */
: --------------- ALLOCATION -------------------
: int i, j, k;
: /* setup the image buffer */
: imgData->imgDataINT16U = new unsigned short int***[YSize];
: for (i=0; i < YSize;i++)
: imgData->imgDataINT16U = new unsigned short int**[XSize];
: for (i=0; i < YSize;i++) {
: for (j=0; j < XSize;j++)
: imgData->imgDataINT16U[j] = new unsigned short int*[ZSize];
: }
: for (i=0; i < YSize;i++) {
: for (j=0; j < XSize;j++)
: for (k=0; k < ZSize;k++)
: imgData->imgDataINT16U[j][k] = new unsigned short int[TSize];
: }
Think of it: if TSize=1, the every single unsigned short 'voxel'
will be allocated in its own memory block! This represents an
space overhead of a pointer+several bytes (depending on your
library implementation, maybe 16!) for each 2 bytes of data !
This of course also means expensive initial allocations.
But during data manipulation as well, performance will be degraded
by the many misaligned memory accesses when manipulating the data.


: What method would be best to quickly allocate and deallocate large
: chunks of memory?

A first step would be to modify your code to use something like:
class Buf4D_Int16U {
public:
Buf4D_Int16U(int xSize, int ySize, int zSize, int tSize)
: p_( new unsigned short(xSize*ySize*zSize*tSize) )
, xSize_(xSize)
, ySize_(ySize)
, zSize_(zSize)
, tSize_(tSize)
{ }

~Buf4D_Int16U() { delete p_; }

// Element access (non-const and const)
unsigned short& operator(int x,int y,int z,int t)
{ return p_[((x*ySize_+y)*zSize_+z_)*tSize+t]; }
unsigned short operator(int x,int y,int z,int t) const
{ return p_[((x*ySize_+y)*zSize_+z_)*tSize+t]; }

private:
unsigned short* p_;
int xSize_,ySize_,zSize_,tSize_;
Buf4D_Int16U(Buf4D_Int16U const&); //disabled copy-ctr
void operator=(Buf4D_Int16U const&); //disabled copy-assign
};

: ------------ GENERAL ACCESS ------------
: voxelValue = imgData->imgDataINT16U[col][row][slice][0];

Becomes:
voxelValue = imgData(col,row,slice,0);

Pass a reference to the imgData object instead of the
unsigned short**** pointers when passing the buffer to functions.

Note that you probably want to use an std::vector<short> instead
of the naked pointer p_ (but this may have a performance cost,
especially in non-optimized builds...).


Many further improvements are possible, but this first step
should be a no-brainer...


Cheers,
Ivan
 
I

Ivan Vecerina

: That looks like an excellent solution. I never thought about putting
: the indexing math into a function.
: How does the performance of using [n][m][...] notation compare to a
: function call when accessing elements?
You should check using a profiler on your specific platform.
Some reasoning though:
- In Debug mode/unoptimized builds, you will suffer the overhead
of a function call. This may or may not matter.
It may actually be useful to check, within the accessor function,
at the expense of an additional runtime overhead, that all indices
are within range (e.g. ASSERT(x<xSize) if x and xSize are unsigned).
In release mode, this should not matter.
- On a modern HW architecture, the contiguous (and cache-friendly)
memory accesses to [xSize,ySize,zSize], and the multiplications
to compute an index should be less expensive than the multiple
dereferencing of pointers. But take this as a wild guess.

: > Many further improvements are possible, but this first step
: > should be a no-brainer...
:
: What other ways are there of improving large multidimensional arrays?

For very large data sets, it can be useful to tile the data in
memory (e.g. store small sub-cubes of the data in one contiguous
memory block). This can, for example, improve the performance
of some multi-planar reconstructions (memory locality, can
be more cache-friendly, especially if some of the data does
not fit in main memory).
But this will complicate the implementation of most algorithms.
Since your program ran with all the allocation space overhead
you had, this should not be an issue for you.

For many algorithms, you will save time by using relative offsets to
reach the neighbors of a pixel rather than to recompute each index.
E.g.: tStep=1, zStep=tSize, yStep=zStep*zSize, xStep=yStep*ySize;
When you want to acess the neighbors of a voxel at pv,
you can directly address pv[+zStep], pv[-zStep] etc...
Even better, by adding a 'margin' around your data set,
you can linearly scan and transform the data step from
begin() to end() without any full-index computation.

In a different vein, many algorithms today can be
reimplemented to leverage the graphics processor
of your display card:
http://www.google.com/search?q=GPU+processing

But remember the saying about premature optimization...


hth
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top