Can I avoid the use of arrays in this situation or do I have to usethem?

T

terminator

mike3 said:
I just finished testing another attempt at this. It seems the
problem is not the choice of vector vs array, but instead how
the memory is handled.
A temporary array made like this:
Digit *digits(new Digit[4]);
and then removed using delete is just as slow as a vector.
Whereas an array like this:
Digit digits[4];
is far faster.

Welcome to the dark side of the dynamic memory management.




The problem is though those variable-length arrays are
necessary -- no easy ways around 'em with the Floating Point
(FP) routines as I need to store bitshifted versions of the
operands without munging said operands. It might be possible
to allocate arrays of the second type I discussed above, but
then all the digit-manipulators have to take arrays not
vectors and that just doesn't seem like a good idea. Now,
if my program were single-threaded, I would have been able to
make do with a global buffer that would only get allocated once,
and always use that. But the program is going to be multi-
threaded, so having a single global buffer cannot really work.
And it seems ugly to have to pass a buffer operand to every
single arithmetic routine and is downright impossible when
using overloaded operators.
What do you do?

There are commercial and homegrown memory managers that can be
made much faster than the generic one your RTL supplies.

Great idea.
Also
consider that if you know the possible largest size of your
array, use that to define an automatic array, even if you only
use part of it in the current call to your function.

consider:
a=b*c;

'a' would need to be dynamically allocated in the end.
Dynamic allocation is generally inevitable in programs using that
'bignum'; so why to dirty the stack for preventing what has to happen?

regards,
FM.
 
M

mike3

mike3 said:
<snip>
I just finished testing another attempt at this. It seems the
problem is not the choice of vector vs array, but instead how
the memory is handled.
A temporary array made like this:
Digit *digits(new Digit[4]);
and then removed using delete is just as slow as a vector.
Whereas an array like this:
Digit digits[4];
is far faster.
Welcome to the dark side of the dynamic memory management.
There are commercial and homegrown memory managers that can be
made much faster than the generic one your RTL supplies.

Great idea.

But it seems even the simplest method of allocating
the memory as an array is so slow.

Unless one puts a precision limit in (which it has),
and uses static arrays for the temporary buffers but
you can't mix vectors and arrays easily -- the routines
have to take one or the other and that would mean using
arrays directly inside objects. And I've heard that
arrays are "evil".
consider:
a=b*c;

'a' would need to be dynamically allocated in the end.
Dynamic allocation is generally inevitable in programs using that
'bignum'; so why to dirty the stack for preventing what has to happen?

Not with floating point where you have to round at some
point. The 2x-width product of b and c need only be stored
in a temporary buffer before rounding to the precision of
"a", then it can be discarded.
 
M

mike3

I just finished testing another attempt at this. It seems the
problem is not the choice of vector vs array, but instead how
the memory is handled.
A temporary array made like this:
Digit *digits(new Digit[4]);
and then removed using delete is just as slow as a vector.
Whereas an array like this:
Digit digits[4];
is far faster.
The problem is though those variable-length arrays are
necessary -- no easy ways around 'em with the Floating Point
(FP) routines as I need to store bitshifted versions of the
operands without munging said operands. It might be possible
to allocate arrays of the second type I discussed above, but
then all the digit-manipulators have to take arrays not
vectors and that just doesn't seem like a good idea.

That is what I am talking about.
Now,
if my program were single-threaded, I would have been able to
make do with a global buffer that would only get allocated once,
and always use that. But the program is going to be multi-
threaded, so having a single global buffer cannot really work.
And it seems ugly to have to pass a buffer operand to every
single arithmetic routine and is downright impossible when
using overloaded operators.
What do you do?

even in that case(global buffer) you would need to store the result in
a dynamically allocated array,so alloacating memory is inevitable and
I prefere to do it at the begining of the function.
If you do in-place calculation no dynamic allocation is necessary if
the capacity of destination is already enough,but if you need to store
the initial value of the destination in another 'bignum' object then
one dynamic alloction is needed.You are destined to pay that
overhead ,so a vector buffer followed by a swap is the clean readable
way of doing it ,but for the sake of performance I would declare an in-
place version too (just the way we can write 'x=y+z' and 'x+=y').

cheers,
FM.

Why does the global array need to be dynamic and constantly
changing it's precision? The result is going to get rounded
at the end anyway, and the precision of the results is not
varied through the calculations constantly -- it does not
keep growing so there's no reason to keep dynamically
allocating more memory.

If you have set up to use X amount of precision, your
global array needs only hold 2X worth. There may be another
array that holds X+1 or X+2's worth for computing
transcendental functions hence the multipliaction
buffer may need to hold 2X+2 or 2X+4's worth. But that is
not a problem. One need only resize it if the precision
being used is too small. Then when it's resized, it stays
that big. (It cannot grow indefinitely though as the
bignum package contains a max precision limit coded in.)
And the memory consumption is not too great -- these
numbers can take up at most 1 Kbyte or so, so that's only
a few Kbytes of storage which is nothing on today's PCs.
 
W

werasm

I just reduced the routine to this:

FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
const std::vector<Digit> &vec1,
const std::vector<Digit> &vec2,
int length1, int length2)
{
std::vector<Digit> tmpbuf(length1 + length2);
return(FG3DError(FG3D_SUCCESS));

}

I wrote a little test using VecStack I mentioned below. One
has to consider that the test ran in a single threaded
environment and things may change if you throw a
critical section in there (but not to much if not
contented). I used Dev C++ (don't know which STL),
turned debugging off and enabled optimization for
speed. Turned out the vector code executed faster
than the array code by +10 seconds for 30000000
items. Test is below:

#include <iostream>
#include <cassert>
#include <vector>
#include <memory>
//...#include "yourMutex.h"

template <class T>
struct VecStack
{
typedef std::auto_ptr<std::vector<T> > ClientItem;

VecStack( unsigned defaultSz )
: defaultSz_( defaultSz )
{ vecList_.reserve( 50 ); }

ClientItem pop()
{
//...Mutex::guard lock( mutex_ );
std::auto_ptr<std::vector<T> > v( 0 );

if( vecList_.empty() )
{
v.reset( new std::vector<T>( defaultSz_ ) );
}
else
{
v.reset( vecList_.back() );
vecList_.pop_back();
}
return v;
}

void push( ClientItem v )
{
//...Mutex::guard lock( mutex_ );
//assert( v.get() );
vecList_.push_back( v.release() );
}

private:
std::vector<std::vector<T>*> vecList_;
unsigned defaultSz_;
//...Mutex mutex_;

};

enum
{
eMAX_EXPECTED_SZ = 512,
eLOOP_SZ = 30000000,
eINNER_SZ = 500
};


void testVect1( unsigned len )
{
typedef VecStack<int> StackType;
static StackType stack( eMAX_EXPECTED_SZ );

StackType::ClientItem item = stack.pop();
item->resize( len );
stack.push( item );
}
void testVect2( unsigned len )
{
int Items[eMAX_EXPECTED_SZ] = { 0 };
}
void doTest( void(*op)(unsigned) )
{
for( int i = 0; i < eLOOP_SZ/eINNER_SZ; ++i )
{
for( int j = 0; j < eINNER_SZ; ++j )
{
(*op)( j );
}
}
}

int main(int argc, char *argv[])
{
std::cout << "Press key to start test!" << std::endl;
std::cin.get();
std::cout << "Test1 running...\n";
doTest( testVect1 );
std::cout << "Test2 running... \n";
doTest( testVect2 );
std::cout << "End... Press key to exit." << std::endl;
std::cin.get();
return EXIT_SUCCESS;
}

Kind regards,

Werner
 
T

terminator

<snip>
I just finished testing another attempt at this. It seems the
problem is not the choice of vector vs array, but instead how
the memory is handled.
A temporary array made like this:
Digit *digits(new Digit[4]);
and then removed using delete is just as slow as a vector.
Whereas an array like this:
Digit digits[4];
is far faster.
The problem is though those variable-length arrays are
necessary -- no easy ways around 'em with the Floating Point
(FP) routines as I need to store bitshifted versions of the
operands without munging said operands. It might be possible
to allocate arrays of the second type I discussed above, but
then all the digit-manipulators have to take arrays not
vectors and that just doesn't seem like a good idea.
That is what I am talking about.
even in that case(global buffer) you would need to store the result in
a dynamically allocated array,so alloacating memory is inevitable and
I prefere to do it at the begining of the function.
If you do in-place calculation no dynamic allocation is necessary if
the capacity of destination is already enough,but if you need to store
the initial value of the destination in another 'bignum' object then
one dynamic alloction is needed.You are destined to pay that
overhead ,so a vector buffer followed by a swap is the clean readable
way of doing it ,but for the sake of performance I would declare an in-
place version too (just the way we can write 'x=y+z' and 'x+=y').
cheers,
FM.

Why does the global array need to be dynamic and constantly
changing it's precision? The result is going to get rounded
at the end anyway, and the precision of the results is not
varied through the calculations constantly -- it does not
keep growing so there's no reason to keep dynamically
allocating more memory.

I did not mean the global buffer to be placed on heap;I mean that when
you store the value in the result 'bignum' a dynamic allocation
happens and it is better to have a none static vector buffer ( whose
implementation performs the dynamic allocation ) and swap it to the
result (that prevents a second dynamic allocation).
If you have set up to use X amount of precision, your
global array needs only hold 2X worth.

actually a mutiplication holds **less than** or equal to
length1+length2.
if you want to store just necessary digits then a normal array buffer
can help decrease the size of resulting vector , but there are better
programing techniques to avoid a normal array while achiving similar
performance.
There may be another
array that holds X+1 or X+2's worth for computing
transcendental functions hence the multipliaction
buffer may need to hold 2X+2 or 2X+4's worth. But that is
not a problem. One need only resize it if the precision
being used is too small. Then when it's resized, it stays
that big. (It cannot grow indefinitely though as the
bignum package contains a max precision limit coded in.)
And the memory consumption is not too great -- these
numbers can take up at most 1 Kbyte or so, so that's only
a few Kbytes of storage which is nothing on today's PCs.

suppose a number of 'length1 ' multiplied by another of 'length2' then
you just need to check whether 'length1+length2' is greater than the
'maximum_size' .
'length1+lenght2' is less than '2*max(length1,length2)'.And none of
the operands need be less than 'maximum_size/2' but as mention
formerly ,the sum of lengthes must be not greater than 'maximum_size'.

regards,
FM.
 
M

mike3

On Nov 27, 11:05 pm, mike3 <[email protected]> wrote:
I did not mean the global buffer to be placed on heap;I mean that when
you store the value in the result 'bignum' a dynamic allocation
happens and it is better to have a none static vector buffer ( whose
implementation performs the dynamic allocation ) and swap it to the
result (that prevents a second dynamic allocation).

Why does a dynamic allocation happen to the bignum?
In the floating point routines, the multiplication
stores it's wide result in a temporary buffer that's been
preallocated to the full size. Then the _upper part_ of
this buffer (it's most-significant digits) is copied to
the destination floating point number, so that it fits
within the allotted precision. Floating point numbers
are intended to be static, their precision does not
change except with specifically-made resize functions,
for maximum performance.
actually a mutiplication holds **less than** or equal to
length1+length2.

Yes, that's right.
if you want to store just necessary digits then a normal array buffer
can help decrease the size of resulting vector , but there are better
programing techniques to avoid a normal array while achiving similar
performance.

But then you need to count digits to figure out how
big the product is going to be.

The big problem here is handling the allocation/
deallocation of the memory.
suppose a number of 'length1 ' multiplied by another of 'length2' then
you just need to check whether 'length1+length2' is greater than the
'maximum_size' .
'length1+lenght2' is less than '2*max(length1,length2)'.And none of
the operands need be less than 'maximum_size/2' but as mention
formerly ,the sum of lengthes must be not greater than 'maximum_size'.

But where is this temporary buffer stored? Also,
in my multithreaded application, how can you give
each thread it's own buffer without having to
pass special parameters to every bignum routine
(impossible with overloaded operators!), _and_
not having a "rationing" method where the threads
take turns using a single buffer, which utterly
destroys the parallelism that would be gained
on machines with multiprocessing ability
(multiple CPUs/multiple cores)?
 
T

terminator

I mean that when
you store the value in the result 'bignum' a dynamic allocation
happens and it is better to have a none static vector buffer ( whose
implementation performs the dynamic allocation ) and swap it to the
result (that prevents a second dynamic allocation).

actually I mmissed something: if the result 'bignum' has formerly been
used then its size/capacity may be enough to keep the result of
calculation in which case a buffer on stack may increase speed.

regards,
FM.
 
M

mike3

actually I mmissed something: if the result 'bignum' has formerly been
used then its size/capacity may be enough to keep the result of
calculation in which case a buffer on stack may increase speed.

regards,
FM.

That's right, and with floating point we don't store the full-width
product, it's only stored as an intermediate temporary prior to
rounding. However a stack array would require the digit manipulation
routines take arrays instead of vectors, which would therefore force
everything to be arrays, unless of course one makes two routines,
one for each type of argument, but that seems like a poor, poor
coding practice.
 
M

mike3

That's right, and with floating point we don't store the full-width
product, it's only stored as an intermediate temporary prior to
rounding. However a stack array would require the digit manipulation
routines take arrays instead of vectors, which would therefore force
everything to be arrays, unless of course one makes two routines,
one for each type of argument, but that seems like a poor, poor
coding practice.

And therefore, because of that, what should I do?
 
T

terminator

No , you just cannot use swap( which would not be need in this regard
anymore ).
 
M

mike3

No , you just cannot use swap( which would not be need in this regard
anymore ).

So then how do I avoid reallocation on every call of the routines?
If I keep a stack of preallocated vector buffers, one for each thread,
how do I make the bignum routines know which to use without passing
parameters all the time, which to me is poor coding, and also
downright
impossible with overloaded operators?
 
W

werasm

So then how do I avoid reallocation on every call of the routines?
If I keep a stack of preallocated vector buffers, one for each thread,
how do I make the bignum routines know which to use without passing
parameters all the time, which to me is poor coding, and also
downright
impossible with overloaded operators?- Hide quoted text -

- Show quoted text -

I don't know what I'm missing here. To me the solution I
proposed should work fine. You don't need a vector per
thread. You only need a vector for the amount of threads
passing through the routine simultaneously. Furthermore
you would not have that many reallocations. Guess at
the largest buffer required and reserve on instantiation.
Chances are that you would not have more than two or
three threads passing through the function simultaneously
anyway. I tested it (stack of vectors) for single
threaded scenario and it beat the array by some margin
(obviously optimization were maximum). The solution
should at least improve things dramatically in comparison
to your original solution.

Regards,

Werner
 
M

mike3

I don't know what I'm missing here. To me the solution I
proposed should work fine. You don't need a vector per
thread. You only need a vector for the amount of threads
passing through the routine simultaneously. Furthermore
you would not have that many reallocations. Guess at
the largest buffer required and reserve on instantiation.
Chances are that you would not have more than two or
three threads passing through the function simultaneously
anyway. I tested it (stack of vectors) for single
threaded scenario and it beat the array by some margin
(obviously optimization were maximum). The solution
should at least improve things dramatically in comparison
to your original solution.

Regards,

Werner

But still, I need more that none buffer. How do I indicate
to the threads though which buffer they should use without
including funny "use this buffer" parameters on every
bignum routine, which to me equals poor coding practice and
also is 100% impossible with overloaded operators?
 
M

mike3

I just finished testing another attempt at this. It seems the
problem is not the choice of vector vs array, but instead how
the memory is handled.
A temporary array made like this:
Digit *digits(new Digit[4]);
and then removed using delete is just as slow as a vector.
Whereas an array like this:
Digit digits[4];
is far faster.
The problem is though those variable-length arrays are
necessary -- no easy ways around 'em with the Floating Point
(FP) routines as I need to store bitshifted versions of the
operands without munging said operands. It might be possible
to allocate arrays of the second type I discussed above, but
then all the digit-manipulators have to take arrays not
vectors and that just doesn't seem like a good idea.

That is what I am talking about.
Now,
if my program were single-threaded, I would have been able to
make do with a global buffer that would only get allocated once,
and always use that. But the program is going to be multi-
threaded, so having a single global buffer cannot really work.
And it seems ugly to have to pass a buffer operand to every
single arithmetic routine and is downright impossible when
using overloaded operators.
What do you do?

even in that case(global buffer) you would need to store the result in
a dynamically allocated array,so alloacating memory is inevitable and
I prefere to do it at the begining of the function.
If you do in-place calculation no dynamic allocation is necessary if
the capacity of destination is already enough,but if you need to store
the initial value of the destination in another 'bignum' object then
one dynamic alloction is needed.You are destined to pay that
overhead ,so a vector buffer followed by a swap is the clean readable
way of doing it ,but for the sake of performance I would declare an in-
place version too (just the way we can write 'x=y+z' and 'x+=y').

cheers,
FM.

And of course the reallocations shouldn't get done in every single
iteration of the fractals!
 
M

mike3

On Dec 1, 12:00 pm, werasm <[email protected]> wrote:
<snip>

I think I figured out how to get the threads assigned
the right buffers, but I need a very fast "mutex" control.
Would something simple, like including a boolean "in use"
flag in the buffer stack when pushing/popping that is waited
for with a simple while() loop suffice?
 
M

mike3

A vector should be compatible with arrays as the memory
of vectors are guaranteed to be contiguous. Therefore
having an array should IMO not force you to use arrays
all over the show. If you really don't have a choice,
I would use a scheme whereby one pre creates vectors as
you require them, whereafter you just pop them of
a stack. This would/should reduce the required memory
allocations to a minimum. Something like this (probably
naive implementation, you would have to test and experiment
a little):
<snip>

The problem is parameter type conflicts and repeated code.
See, the basic routines that manipulate the digits take
special objects of type "RawDigit", not simple arrays of
digits. This avoids having nasty boundschecking all over
the place (note the boundschecking in the routine on my
original post.), as RawDigit contains it's own "length" field.
Therefore you'd have to have separate routines for array
manipulation and having two sets of routines like that is
a poor design and bad for mantainability purposes (think:
bug inducing). Thus for the purpose of good design, I am
forced to use the bunch of preallocated buffers. Furthermore,
what about a thread that needs a buffer deep down in the
stack? You have to pop off _all_ the vectors down to that
point, retrieve it, then push them _all_ back on. Also,
what would be the fastest possible "mutex", anyway?
 
W

werasm

But still, I need more that none buffer.

Don't follow?
How do I indicate
to the threads though which buffer they should use without
including funny "use this buffer" parameters on every
bignum routine, which to me equals poor coding practice and
also is 100% impossible with overloaded operators?

I don't see why it is necessary to indicate to threads
which buffer they should use. It is irrelevant which
buffer should be used. You simply pop of the next buffer.
If none exists (at the time of the pop), it means all
are used at that instant, in which case one gets created.

After using the buffer, you just push it (or the pointer
to it) back. Threads never share buffers as once a buffer
is used it is not available to other threads. As buffers
are only used temporarily (like the array on the stack)
you don't require to keep state and threads don't care
about which buffer is used (See my first reply to you
for example):

http://groups.google.com/group/comp...6a844a/bef7efe28a247163?#doc_9d759b72822e5491

Your original question: Can I avoid the use of arrays...?

My answer - yes, use a stack of pointers to vectors
and allocate the vectors as required (see example).
In testing it for one thread of execution it proved
to my surprise to beat zero initialized arrays. For
the multithreaded (MT) case it depends on how good your
MT primitives are. The whole which thread uses
which buffer question is irrelevant (emphasis).

I used auto_ptr for the generic stack as this
emphasizes that the buffer does not belong to
the stack when popped, until returned. Therefore
should exceptions happen, you would have no
leaks (transferral of ownership to the caller -
the purpose(to me) of using auto_ptr).

Regards,

Werner
 
M

mike3

Don't follow?


I don't see why it is necessary to indicate to threads
which buffer they should use. It is irrelevant which
buffer should be used. You simply pop of the next buffer.
If none exists (at the time of the pop), it means all
are used at that instant, in which case one gets created.

After using the buffer, you just push it (or the pointer
to it) back. Threads never share buffers as once a buffer
is used it is not available to other threads. As buffers
are only used temporarily (like the array on the stack)
you don't require to keep state and threads don't care
about which buffer is used (See my first reply to you
for example):

http://groups.google.com/group/comp.lang.c++/tree/browse_frm/thread/3...

Your original question: Can I avoid the use of arrays...?

My answer - yes, use a stack of pointers to vectors
and allocate the vectors as required (see example).
In testing it for one thread of execution it proved
to my surprise to beat zero initialized arrays. For
the multithreaded (MT) case it depends on how good your
MT primitives are. The whole which thread uses
which buffer question is irrelevant (emphasis).

I used auto_ptr for the generic stack as this
emphasizes that the buffer does not belong to
the stack when popped, until returned. Therefore
should exceptions happen, you would have no
leaks (transferral of ownership to the caller -
the purpose(to me) of using auto_ptr).

Regards,

Werner

Ah, now I see how it works. So if you have, say,
8 threads running, you create and push on 8 buffers.
One will never need more so long as the amount of
threads does not exceed that limit. If one needs more
threads, one need only allocate the extra buffers once,
not on all 500 million bignum operations
needed to render an image.

Also, what about the speed of the mutex, which
gets called so constantly? Will that significantly
impede performance? Although in the single-thread
case, the mutex calls can be suppressed. But what
about in the multi-thread one?
 
W

werasm

Ah, now I see how it works. So if you have, say,
8 threads running, you create and push on 8 buffers.
One will never need more so long as the amount of
threads does not exceed that limit. If one needs more
threads, one need only allocate the extra buffers once,
not on all 500 million bignum operations
needed to render an image.

Actually, for this implementation you don't need to
push any buffers at all initially. You just need
to pop and then push when you've done. If the under-
lying buffer is empty a new vector would be created
automatically (the first time). Then buffers will
only be added when more than one thread passes
through the function simultaneously. See the example
in main.


You only need as many buffers as the amount of passing
through the function in question simultaneoulsy.
Also, what about the speed of the mutex, which
gets called so constantly? Will that significantly
impede performance?

This depends on your OS. Most OS's have mechanisms
that only cause significant reduction in
performance when contention arises (switching to
Kernel mode). In windows you could typically wrap
the critical section (I think boost or a library
like LOKI will have done this for your already
with some scoped locking mechanism). Under Windows
you should rather use CRITICAL_SECTION (and not
MUTEX) as MUTEX is for interprocess synchronization
which has more overhead.

Regards,

Werner
 
W

werasm

Actually, for this implementation you don't need to
push any buffers at all initially. You just need
to pop and then push when you've done. If the under-
lying buffer is empty a new vector would be created
automatically (the first time). Then buffers will
only be added when more than one thread passes
through the function simultaneously. See the example
in main.

You only need as many buffers as the amount of passing
through the function in question simultaneoulsy.

And this is determined automatically, because if pop
is called when the underlying structure is empty, a
new vector gets created...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,149
Members
46,695
Latest member
StanleyDri

Latest Threads

Top