Can I avoid the use of arrays in this situation or do I have to usethem?

M

mike3

Hi.

I've been reworking a bignum package I made a bit. I've got this thing
here that multiplies two strings of digits using a simple "grade
school" algorithm, and it seems I can't help but use an array in there
for a temporary buffer, as reallocating an std::vector<> over and over
again takes lots of time, and storing a global buffer would be a
problem since my program will have multithreading. (Although I suppose
one could allocate a certain amount of global buffers equal to the
maximum amount of threads one could run and then pass an index
parameter to each routine but that seems kind of silly. And how do you
keep track of all these buffers especially with overloaded operators
where you can't pass a parameter?! Eww.) Can I avoid the use of the
array and still get maximum performance? Or is this the case when
arrays are appropriate? The point of the temporary buffer is to allow
for this to be used as an "in-place" multiplication procedure as well
as out-of-place. There are other places in the bignum package where
similar array-vs-vector performance considerations arise involving
temporary buffers (there's a floating point addition/subtraction
routine that requires shifting the bits in one of the operands and to
avoid messing up the original it creates a temporary buffer... Also,
that buffer needs to be able to be passed to routines similar to this
one which accept vectors... Should I just give up and use arrays
everywhere for this?! (sucks a lot)).

Also, are the typecasts in there necessary or could they be removed
and the full-width product of the two digits still be obtained?

This is the code:

---
/* Multiply two digit vectors.
* Parameters:
* vec0: Reference to destination vector.
* vec1: Reference to vector to multiply.
* vec2: Reference to vector to multiply by.
* length1: Length of vec1
* length2: Length of vec2
*
* Returns: Error code
*
* Operation: vec0 = vec1 * vec2.
*
* Note: vec0 must be able to hold length1 + length2 worth of
* elements.
*/
FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
const std::vector<Digit> &vec1,
const std::vector<Digit> &vec2,
int length1, int length2)
{
/* Set up temporary digit buffer */
Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
to
// maximize performance. Vector
// takes forever to constantly
de/
// reallocate.

/* Safety */
if(length1 > MAX_PRECISION)
return(FG3DError(FG3D_INVALID_PRECISION, length1));
if(length2 > MAX_PRECISION)
return(FG3DError(FG3D_INVALID_PRECISION, length2));

/* Clear buffer */
std::fill(tmpbuf, tmpbuf + length1 + length2, 0);

/* Set up iterators */
std::vector<Digit>::iterator v0i(vec0.begin());
std::vector<Digit>::const_iterator v1i(vec1.begin());
std::vector<Digit>::const_iterator v2i(vec2.begin());
std::vector<Digit>::const_iterator v2init(v2i);

/* Do the multiplication */
for(int i(0);i<length1;i++,++v1i)
{
TwoDigit carry;
carry = 0;
v2i = v2init;
for(int j(0);j<length2;j++,++v2i)
{
TwoDigit
tmp((static_cast<TwoDigit>(*v1i)*static_cast<TwoDigit>(*v2i))
+static_cast<TwoDigit>(tmpbuf[i+j])+carry);
carry = tmp>>DIGIT_SIZE;
tmp &= MAX_DIGIT;
tmpbuf[i+j] = tmp;
}
tmpbuf[i+length2] = carry;
}

/* Copy result back */
vec0.reserve(length1 + length2);
std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());
}---
 
A

Alf P. Steinbach

* mike3:
I've been reworking a bignum package I made a bit. I've got this thing
here that multiplies two strings of digits using a simple "grade
school" algorithm, and it seems I can't help but use an array in there
for a temporary buffer, as reallocating an std::vector<> over and over
again takes lots of time, and storing a global buffer would be a
problem since my program will have multithreading. (Although I suppose
one could allocate a certain amount of global buffers equal to the
maximum amount of threads one could run and then pass an index
parameter to each routine but that seems kind of silly. And how do you
keep track of all these buffers especially with overloaded operators
where you can't pass a parameter?! Eww.) Can I avoid the use of the
array and still get maximum performance? Or is this the case when
arrays are appropriate? The point of the temporary buffer is to allow
for this to be used as an "in-place" multiplication procedure as well
as out-of-place. There are other places in the bignum package where
similar array-vs-vector performance considerations arise involving
temporary buffers (there's a floating point addition/subtraction
routine that requires shifting the bits in one of the operands and to
avoid messing up the original it creates a temporary buffer... Also,
that buffer needs to be able to be passed to routines similar to this
one which accept vectors... Should I just give up and use arrays
everywhere for this?! (sucks a lot)).

Also, are the typecasts in there necessary or could they be removed
and the full-width product of the two digits still be obtained?

This is the code:

---
/* Multiply two digit vectors.
* Parameters:
* vec0: Reference to destination vector.
* vec1: Reference to vector to multiply.
* vec2: Reference to vector to multiply by.
* length1: Length of vec1
* length2: Length of vec2
*
* Returns: Error code
*
* Operation: vec0 = vec1 * vec2.
*
* Note: vec0 must be able to hold length1 + length2 worth of
* elements.
*/
FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
const std::vector<Digit> &vec1,
const std::vector<Digit> &vec2,
int length1, int length2)

Why do you have the length arguments?

A std::vector has a length, accessible via size() and resize().

{
/* Set up temporary digit buffer */
Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
to
// maximize performance. Vector
// takes forever to constantly
de/
// reallocate.

Have you measured this?

Anyway, reserve ALL UPPERCASE for macros.

If MAX_PRECISION is a macro, make it a typed constant.

/* Safety */
if(length1 > MAX_PRECISION)
return(FG3DError(FG3D_INVALID_PRECISION, length1));
if(length2 > MAX_PRECISION)
return(FG3DError(FG3D_INVALID_PRECISION, length2));

Ensure class invariants in operations so that you don't have to pepper
the code with checks all over the place.

In other words,

class Bignum
{
private:
std::vector<Digit> myDigits;
public:
// Whatever, all operations assume a max length
// and ensure that max length.
};

/* Clear buffer */
std::fill(tmpbuf, tmpbuf + length1 + length2, 0);

Just zero-initialize the array if you're using an array,

Digit tmpbuf[whatever] = {0};

/* Set up iterators */
std::vector<Digit>::iterator v0i(vec0.begin());
std::vector<Digit>::const_iterator v1i(vec1.begin());
std::vector<Digit>::const_iterator v2i(vec2.begin());
std::vector<Digit>::const_iterator v2init(v2i);

/* Do the multiplication */
for(int i(0);i<length1;i++,++v1i)
{
TwoDigit carry;
carry = 0;
v2i = v2init;
for(int j(0);j<length2;j++,++v2i)
{
TwoDigit
tmp((static_cast<TwoDigit>(*v1i)*static_cast<TwoDigit>(*v2i))
+static_cast<TwoDigit>(tmpbuf[i+j])+carry);

This looks rather suspicious.

However, you don't provide definitions of Digit and TwoDigit, so
difficult to say.

carry = tmp>>DIGIT_SIZE;
tmp &= MAX_DIGIT;
tmpbuf[i+j] = tmp;
}
tmpbuf[i+length2] = carry;
}

/* Copy result back */
vec0.reserve(length1 + length2);
std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());

Don't modify out-arguments directly.

Create the result in a local variable, then at the end swap().

For exception safety.


Cheers, & hth.,

- Alf
 
W

werasm

Hi.

I've been reworking a bignum package I made a bit. I've got this thing
here that multiplies two strings of digits using a simple "grade
school" algorithm, and it seems I can't help but use an array in there
for a temporary buffer, as reallocating an std::vector<> over and over
again takes lots of time, and storing a global buffer would be a
problem since my program will have multithreading. (Although I suppose
one could allocate a certain amount of global buffers equal to the
maximum amount of threads one could run and then pass an index
parameter to each routine but that seems kind of silly. And how do you
keep track of all these buffers especially with overloaded operators
where you can't pass a parameter?! Eww.) Can I avoid the use of the
array and still get maximum performance? Or is this the case when
arrays are appropriate? The point of the temporary buffer is to allow
for this to be used as an "in-place" multiplication procedure as well
as out-of-place. There are other places in the bignum package where
similar array-vs-vector performance considerations arise involving
temporary buffers (there's a floating point addition/subtraction
routine that requires shifting the bits in one of the operands and to
avoid messing up the original it creates a temporary buffer... Also,
that buffer needs to be able to be passed to routines similar to this
one which accept vectors... Should I just give up and use arrays
everywhere for this?! (sucks a lot)).

No, if you want to be standard compliant (at least), you have to
use vectors, as C++ does not support arrays with variable length.

If I compile the following under Comeau...

struct Digits{};

void foo( int sz )
{
Digits digits[sz];
}

int main()
{
static const int sz( 10 );
foo( sz );
}


.... I get the failure:

"ComeauTest.c", line 5: error: expression must have a constant value
Digits digits[sz];

For the multithreaded case you might consider a scheme where you
have a linked list of pointers to vectors. Initially the list
is empty, then you allocated one. Next time round you just pop
from the list. Whenever the list is empty, you allocate. Therefore
the maximum allocated equals the maximum amount of threads
contending at one instance (Obviously pop and push from the list
is guarded). The may even be allocators doing this for you already.

Regards,

Werner
 
M

Marco Manfredini

mike3 said:
Hi.

I've been reworking a bignum package I made a bit. I've got this thing
here that multiplies two strings of digits using a simple "grade
school" algorithm, and it seems I can't help but use an array in there
for a temporary buffer, as reallocating an std::vector<> over and over

Well, if you really want in-place multiplication you can try this:
Arrange your loops in such a way, that the output digits (r) are
computed from 0 straight up. That is, for each n in your *result*,
starting at 0 sum up all the a_i*b_j where i+j=n. You will need a finite
number (usually 2, unless you want to multiply verrrry long numbers) of
temporaries to hold the overflow that will leap to the next few digits.
I'm leaving it to you how juggle these. Now look at this example
multiplication r=a*b where len(a)==3 and len(b)==4

r_0 = a_0*b_0
r_1 = a_0*b_1 + a_1*b_0 + overflow
r_2 = a_0*b_2 + a_1*b_1 + a_2*b_0 + overflow
r_3 = a_0*b_3 + a_1*b_2 + a_2*b_1 + overflow
r_4 = a_1*b_3 + a_2*b_2 + overflow
r_5 = a_2*b_3 + overflow
r_6 = overflow

You might have noticed that after we are finished with r_3, a_0 is never
read again, and so the computation of r_4 obsoletes a_1 etc..Also,
pushing the carries through the result happens only once. Now I rename r:

a_3 = a_0*b_0
a_4 = a_0*b_1 + a_1*b_0 + overflow
a_5 = a_0*b_2 + a_1*b_1 + a_2*b_0 + overflow
a_6 = a_0*b_3 + a_1*b_2 + a_2*b_1 + overflow // a_0 obsoleted!
a_0 = a_1*b_3 + a_2*b_2 + overflow // a_1 obsoleted!
a_1 = a_2*b_3 + overflow // a_2 obsoleted!
a_2 = overflow

Your result is there, rotate a left by 3 to put it in place.

This should keep you busy.
 
M

mike3

Why do you have the length arguments?

A std::vector has a length, accessible via size() and resize().

I thought resize() is to do just that, resize the vector,
not get it's length. Also, I have then length arguments
so I can use lengths different from those of the vector
if one needs to do that for some reason.
{
/* Set up temporary digit buffer */
Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
to
// maximize performance. Vector
// takes forever to constantly
de/
// reallocate.

Have you measured this?

Sure did. Took ~34 secs, for 100,000,000 calls, with all
compiler optimizations turned on.

I just reduced the routine to this:

FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
const std::vector<Digit> &vec1,
const std::vector<Digit> &vec2,
int length1, int length2)
{
std::vector<Digit> tmpbuf(length1 + length2);
return(FG3DError(FG3D_SUCCESS));
}

and just ran it over and over 100,000,000 times to
see how long that allocate/free takes and I was
stunned to find it wastes 34 seconds, more time than
the entire multiplication loop!
Anyway, reserve ALL UPPERCASE for macros.

If MAX_PRECISION is a macro, make it a typed constant.

It's a macro, #defined as some number equal to the maximum
amount of "Digit"s that we want to allow (right now, it's
set at 128).
Ensure class invariants in operations so that you don't have to pepper
the code with checks all over the place.

In other words,

class Bignum
{
private:
std::vector<Digit> myDigits;
public:
// Whatever, all operations assume a max length
// and ensure that max length.
};

The problem is I need to be able to work on pieces of the
digit string as well as the entire string with some
operations (addition, subtraction). And if the operations
ensure the max length does not that require those if()
checks in them to make sure the length parameter passed
does not exceed the limit?
/* Clear buffer */
std::fill(tmpbuf, tmpbuf + length1 + length2, 0);

Just zero-initialize the array if you're using an array,

Digit tmpbuf[whatever] = {0};

Alright.
/* Set up iterators */
std::vector<Digit>::iterator v0i(vec0.begin());
std::vector<Digit>::const_iterator v1i(vec1.begin());
std::vector<Digit>::const_iterator v2i(vec2.begin());
std::vector<Digit>::const_iterator v2init(v2i);
/* Do the multiplication */
for(int i(0);i<length1;i++,++v1i)
{
TwoDigit carry;
carry = 0;
v2i = v2init;
for(int j(0);j<length2;j++,++v2i)
{
TwoDigit
tmp((static_cast<TwoDigit>(*v1i)*static_cast<TwoDigit>(*v2i))
+static_cast<TwoDigit>(tmpbuf[i+j])+carry);

This looks rather suspicious.

However, you don't provide definitions of Digit and TwoDigit, so
difficult to say.

TwoDigit can be any unsigned type that has twice the bitlength
of Digit. You could use unsigned int (32 bit) for TwoDigit,
unsigned short (16 bit) for Digit, for example.
carry = tmp>>DIGIT_SIZE;
tmp &= MAX_DIGIT;
tmpbuf[i+j] = tmp;
}
tmpbuf[i+length2] = carry;
}
/* Copy result back */
vec0.reserve(length1 + length2);
std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());

Don't modify out-arguments directly.

Create the result in a local variable, then at the end swap().

For exception safety.

The result is already being built in the temp buffer, that's the
point. And does this mean one cannot resize the output vector if
it cannot hold the full size product? Or will swap() do that
automatically? Is swap() slower than the above?
 
A

Alf P. Steinbach

* mike3:
I thought resize() is to do just that, resize the vector,
not get it's length.

It sets the length. Plus allocates if necessary, and initializes
elements that become accessible.

Also, I have then length arguments
so I can use lengths different from those of the vector
if one needs to do that for some reason.

Yeah, but /do/ you need that?

Without needing it the passed lengths are just one more thing that can
go wrong.

{
/* Set up temporary digit buffer */
Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
to
// maximize performance. Vector
// takes forever to constantly
de/
// reallocate.
Have you measured this?

Sure did. Took ~34 secs, for 100,000,000 calls, with all
compiler optimizations turned on.

I just reduced the routine to this:

FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
const std::vector<Digit> &vec1,
const std::vector<Digit> &vec2,
int length1, int length2)
{
std::vector<Digit> tmpbuf(length1 + length2);
return(FG3DError(FG3D_SUCCESS));
}

and just ran it over and over 100,000,000 times to
see how long that allocate/free takes and I was
stunned to find it wastes 34 seconds, more time than
the entire multiplication loop!

Well, it comes down to usage pattern.

If the out-parameter is generally an empty vector or of less capacity
than required for the result, then there will be an allocation anyway in
most cases. And then the most efficient will generally be to declare
the temporary as a vector, and swap at the end.

On the other hand, if the out-parameter is generally of sufficient
capacity, then from a local point of view the most efficient will
generally be to reuse that capacity, i.e. to use a raw array for the
temporary or to build the result right into the out-parameter.

However, reuse of capacity runs the risk of "leaking" memory in the
sense that the capacity is never lowered, only increasing.

So measuring for actual usage, typical usage pattern, is critical.


[snip]
The problem is I need to be able to work on pieces of the
digit string as well as the entire string with some
operations (addition, subtraction).

Sounds like slices.

The standard library isn't very well equipped when it comes to slices:
you may have to build that yourself.

There is std::valarray, which I think supports slices, but that's a
not-quite-finished piece of the library. If it works, great. If not...

And if the operations
ensure the max length does not that require those if()
checks in them to make sure the length parameter passed
does not exceed the limit?

No, the point is to not pass separate length parameters, that whatever
object you're passing knows its length.

But again, it sounds as if you're dealing with slices.

Then, if so, then Bignum (or whatever) needs to support slices,
efficiently. In the operation that creates a logical slice there will
be a length check. But that will then be the only one, centralized.


[snip]
/* Do the multiplication */
for(int i(0);i<length1;i++,++v1i)
{
TwoDigit carry;
carry = 0;
v2i = v2init;
for(int j(0);j<length2;j++,++v2i)
{
TwoDigit
tmp((static_cast<TwoDigit>(*v1i)*static_cast<TwoDigit>(*v2i))
+static_cast<TwoDigit>(tmpbuf[i+j])+carry);
This looks rather suspicious.

However, you don't provide definitions of Digit and TwoDigit, so
difficult to say.

TwoDigit can be any unsigned type that has twice the bitlength
of Digit. You could use unsigned int (32 bit) for TwoDigit,
unsigned short (16 bit) for Digit, for example.

Well, the ordinary arithmetic promotions were designed to handle this
kind of stuff.

It's enough that one operand of an operation is of type TwoDigit in
order to force the other operand to be promoted.

This is one case where I'd forsake the newfangled casts for clarity,

TwoDigit tmp = TwoDigit(*v1i)*(*v2i) + tmpbuf[i+j] + carry;

"TwoDigit(*v1i)" is a syntax that was introduced in C++ to make it look
like a constructor call, but is defined for a primitive type as
equivalent to a C style cast "((TwoDigit)*v1i)". I.e. in some other
context it could invoke reinterpret_cast or const_cast or even cast to
inaccessible base. I don't know why it wasn't defined as static_cast.
Probably an oversight or misunderstanding: they bungled it.

carry = tmp>>DIGIT_SIZE;
tmp &= MAX_DIGIT;
tmpbuf[i+j] = tmp;
}
tmpbuf[i+length2] = carry;
}
/* Copy result back */
vec0.reserve(length1 + length2);
std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());
Don't modify out-arguments directly.

Create the result in a local variable, then at the end swap().

For exception safety.

The result is already being built in the temp buffer, that's the
point. And does this mean one cannot resize the output vector if
it cannot hold the full size product? Or will swap() do that
automatically? Is swap() slower than the above?

You can always resize. Whether swap is more or less efficient depends
on the context, in effect on the usage pattern for the calls of this
function. However, given your measurements I think what I wrote above
is too categorical: you'll need to try out both approaches (array temp +
resize + copy, or vector temp + swap) in the actual typical usage,
measuring.

Happily there's not much difference regarding the rest of the code.


Cheers, & hth,.

- Alf
 
M

mike3

* mike3:



It sets the length. Plus allocates if necessary, and initializes
elements that become accessible.

Oh, so then size() gets the lenth, resize() resets the length.
Yeah, but /do/ you need that?

Well, not for the multiplication, so maybe I can get rid of it there.
But for the addition and subtraction, I need variable legnths _and_
variable starting offsets so I can manipulate pieces of the digit
strings (to allow for the bit/word shift used when adding floating
point).
Without needing it the passed lengths are just one more thing that can
go wrong.

Is it possible though the std::vector may expand/shrink on me and I
have to update the "length" field in my bignums accordingly? Or be
prepared to handle operands of differing lengths going into the
routine (which is bad for the addition/subtraction as every operation
counts in terms of speed and coding up things to handle varying
lengths when in the floating point arithmetic all passed lengths of
digit slices _should_ be the same is a waste of performance AND a
silly way to code something! Why put in what you don't use?)?!
{
/* Set up temporary digit buffer */
Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
to
// maximize performance. Vector
// takes forever to constantly
de/
// reallocate.
Have you measured this?
Sure did. Took ~34 secs, for 100,000,000 calls, with all
compiler optimizations turned on.
I just reduced the routine to this:
FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
const std::vector<Digit> &vec1,
const std::vector<Digit> &vec2,
int length1, int length2)
{
std::vector<Digit> tmpbuf(length1 + length2);
return(FG3DError(FG3D_SUCCESS));
}
and just ran it over and over 100,000,000 times to
see how long that allocate/free takes and I was
stunned to find it wastes 34 seconds, more time than
the entire multiplication loop!

Well, it comes down to usage pattern.

If the out-parameter is generally an empty vector or of less capacity
than required for the result, then there will be an allocation anyway in
most cases. And then the most efficient will generally be to declare
the temporary as a vector, and swap at the end.

But the simple act of declaring the vector (as I showed above) takes
mammoth amounts of time -- unreasonable amounts, in fact.

On the other hand, if the out-parameter is generally of sufficient
capacity, then from a local point of view the most efficient will
generally be to reuse that capacity, i.e. to use a raw array for the
temporary or to build the result right into the out-parameter.

This is the case I have. These low-level digit routines
are used as building blocks to build a floating point
bignum package. The out-parameter _is_ of sufficient
capacity, as it's a temporary buffer _inside the floating
point routine_. So I suppose I could scrap the
buffer in the mul routine above entirely. But that of
course raises the question of how to handle the buffer
in the floating point routine. How could I do that?
However, reuse of capacity runs the risk of "leaking" memory in the
sense that the capacity is never lowered, only increasing.

In the floating point library, the precision/"capacity"
of all operands should be fixed, not increasing _or_
decreasing as if this happens too frequently (as in
more than almost never) it's going to cut into the
performance of the fractal calculation.
So measuring for actual usage, typical usage pattern, is critical.

But what this shows is the mere act of creating the temporary buffer
consumes ludicrous amounts of time.
This is an application in which every last bit of speed
I can get counts. There are calculations that need
to be done involving these operations -- lots of 'em --
for many different data points (and they're all complex
numbers -- don't even get me started on evaluating
transcendental functions). This has ramifications for
the floating point library which needs temporary buffers
to avoid munging operands (in the case of addition/
subtraction with it's bit shift) or to give the
correct result (in the case of multiplication which
requires the upper bits of the 2x-width product).
[snip]


The problem is I need to be able to work on pieces of the
digit string as well as the entire string with some
operations (addition, subtraction).

Sounds like slices.

The standard library isn't very well equipped when it comes to slices:
you may have to build that yourself.

How? Any ideas on how to design it for maximum performance?
There is std::valarray, which I think supports slices, but that's a
not-quite-finished piece of the library. If it works, great. If not...



No, the point is to not pass separate length parameters, that whatever
object you're passing knows its length.

But for maximum speed, I'd rather demand all operands
to, say, the addition/subtraction routines have the same
length, as that's all that's going to occur in my floating
point arithmetic. Having all that redundancy not only
wastes performance but just doesn't feel good in my
opinion.
But again, it sounds as if you're dealing with slices.

Then, if so, then Bignum (or whatever) needs to support slices,
efficiently. In the operation that creates a logical slice there will
be a length check. But that will then be the only one, centralized.

Does such an operation cost less than the addition/
subtraction of the digits & the bit shifting combined,
even when repeated billions of times as it is in the
fractal generation?

There are 4 possible cases that can occur:

Floating Point Digit Add/Sub Scenarios:

R = A +/- B

MSD --- LSD
RRRRRRRRRRR
AAAAAAAAAAA
BBBBB (we need 2 slices of A, 1 of B, 2 of R)

RRRRRRRRRRR
AAAAAAAAAAA
BBBB (we need 3 slices of A, 1 of B, 3 of R)

RRRRRRRRRRR
AAAAAAA
BBBBBB (we need 2 slices of A, 2 of B, 3 of R)

RRRRRRRRRRR
AAAAA
BBBB (we need 1 slice of A, 1 of B, 3 of R)

Any one of these cases may occur, although case 1
will be the most frequent due to the numbers involved
having equal precision. That requires 5 slice requests.
With around 5 slice-requiring floating point operations
per fractal iteration, we must do this 25 times per
iteration, and one fractal image may average 12,000
iterations and hence require over 92 billion such
requests.

Well, the ordinary arithmetic promotions were designed to handle this
kind of stuff.

It's enough that one operand of an operation is of type TwoDigit in
order to force the other operand to be promoted.

But doesn't that require a cast? Also, carry is of type
TwoDigit, so should not it promote everything else?
This is one case where I'd forsake the newfangled casts for clarity,

TwoDigit tmp = TwoDigit(*v1i)*(*v2i) + tmpbuf[i+j] + carry;

"TwoDigit(*v1i)" is a syntax that was introduced in C++ to make it look
like a constructor call, but is defined for a primitive type as
equivalent to a C style cast "((TwoDigit)*v1i)". I.e. in some other
context it could invoke reinterpret_cast or const_cast or even cast to
inaccessible base. I don't know why it wasn't defined as static_cast.
Probably an oversight or misunderstanding: they bungled it.

Thanks, I'll do that instead. Although, what about carry?
That's a TwoDigit, would not it automatically do the
promotion and hence one would not even need the thing
around (*v1i)? Or does one operand for each binary operator need to be
TwoDigit (in which case putting it at the beginning ensures all the
successive operations recieve at least one TwoDigit)?
carry = tmp>>DIGIT_SIZE;
tmp &= MAX_DIGIT;
tmpbuf[i+j] = tmp;
}
tmpbuf[i+length2] = carry;
}
/* Copy result back */
vec0.reserve(length1 + length2);
std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());
Don't modify out-arguments directly.
Create the result in a local variable, then at the end swap().
For exception safety.
The result is already being built in the temp buffer, that's the
point. And does this mean one cannot resize the output vector if
it cannot hold the full size product? Or will swap() do that
automatically? Is swap() slower than the above?

You can always resize. Whether swap is more or less efficient depends
on the context, in effect on the usage pattern for the calls of this
function. However, given your measurements I think what I wrote above
is too categorical: you'll need to try out both approaches (array temp +
resize + copy, or vector temp + swap) in the actual typical usage,
measuring.

Well in my usage, it might be possible to entirely avoid the in-place
multiplication in the tight inner fractal loops at least for the
Mandelbrot set (that takes 1 complex multiplication, equivalent to 3
real multiplications), however for floating point multiplication one
still needs a temporary buffer to hold the 2x-width product of the
mantissas which is is then truncated to 1x-width and rounded. So that
simply moves the problem outside the multiplication routine I gave and
into that of it's caller.
 
T

terminator

Hi.

I've been reworking a bignum package I made a bit. I've got this thing
here that multiplies two strings of digits using a simple "grade
school" algorithm, and it seems I can't help but use an array in there
for a temporary buffer, as reallocating an std::vector<> over and over
again takes lots of time, and storing a global buffer would be a
problem since my program will have multithreading. (Although I suppose
one could allocate a certain amount of global buffers equal to the
maximum amount of threads one could run and then pass an index
parameter to each routine but that seems kind of silly. And how do you
keep track of all these buffers especially with overloaded operators
where you can't pass a parameter?! Eww.) Can I avoid the use of the
array and still get maximum performance? Or is this the case when
arrays are appropriate? The point of the temporary buffer is to allow
for this to be used as an "in-place" multiplication procedure as well
as out-of-place. There are other places in the bignum package where
similar array-vs-vector performance considerations arise involving
temporary buffers (there's a floating point addition/subtraction
routine that requires shifting the bits in one of the operands and to
avoid messing up the original it creates a temporary buffer... Also,
that buffer needs to be able to be passed to routines similar to this
one which accept vectors... Should I just give up and use arrays
everywhere for this?! (sucks a lot)).

Also, are the typecasts in there necessary or could they be removed
and the full-width product of the two digits still be obtained?

This is the code:

---
/* Multiply two digit vectors.
* Parameters:
* vec0: Reference to destination vector.
* vec1: Reference to vector to multiply.
* vec2: Reference to vector to multiply by.
* length1: Length of vec1
* length2: Length of vec2
*
* Returns: Error code
*
* Operation: vec0 = vec1 * vec2.
*
* Note: vec0 must be able to hold length1 + length2 worth of
* elements.
*/
FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
const std::vector<Digit> &vec1,
const std::vector<Digit> &vec2,
int length1, int length2)
{
/* Set up temporary digit buffer */
Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
to
// maximize performance. Vector
// takes forever to constantly
de/
// reallocate.

/* Safety */
if(length1 > MAX_PRECISION)
return(FG3DError(FG3D_INVALID_PRECISION, length1));
if(length2 > MAX_PRECISION)
return(FG3DError(FG3D_INVALID_PRECISION, length2));
/*Very bad. I would do this instead:*/

const size_t len=vec1.size()+vec2.size();
vector<Digit> tmpbuf(len);//to be swapped.

/*if you really need to check you can do this instead:*/

vector<Digit> tmpbuf;
const size_t MaxSize = tmpbuf.get_allocator().max_size();
if(len > std::min( MaxSize , MAX_PRECISION ))
throw FG3DError(FG3D_INVALID_PRECISION, len);
tmpbuf.reserve(len);

Do not do this:
/* Clear buffer */
std::fill(tmpbuf, tmpbuf + length1 + length2, 0);
No need to do that .vector has already done it.

Do not do this:
/* Copy result back */
vec0.reserve(length1 + length2);
std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());

/*you can simply swap instead of duplicating via copy:*/

vec0.swap(tmpbuf);//tmpbuf needs to be a vector<Digit>.

operations other than multiplication may reduce the size so you may
need to do this proir to above:

typedef vector<Digit>::reverse_iterator Rit;
const Rit rstart=tmpbuf.rbegin();
const Rit rfine=tmpbuf.rend();
const Rit
it=find_if(rstart,rfine,std::bind1st(std::not_equal_to(),0));
const size_t newlen = rfine-it;
if(newlen!=len)
tmpbuf.resize(newlen);

putting an intrinsic array on stack is really bad in that it increases
the risk of stack overflow which is usually an irricoverable exception
that leads to crul termination of the program.but vector puts the data
on the heap.

regards,
FM.
 
T

terminator

Why do you have the length arguments?
A std::vector has a length, accessible via size() and resize().

I thought resize() is to do just that, resize the vector,
not get it's length. Also, I have then length arguments
so I can use lengths different from those of the vector
if one needs to do that for some reason.
{
/* Set up temporary digit buffer */
Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
to
// maximize performance. Vector
// takes forever to constantly
de/
// reallocate.
Have you measured this?

Sure did. Took ~34 secs, for 100,000,000 calls, with all
compiler optimizations turned on.

I just reduced the routine to this:

FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
const std::vector<Digit> &vec1,
const std::vector<Digit> &vec2,
int length1, int length2)
{
std::vector<Digit> tmpbuf(length1 + length2);
return(FG3DError(FG3D_SUCCESS));

}

and just ran it over and over 100,000,000 times to
see how long that allocate/free takes and I was
stunned to find it wastes 34 seconds, more time than
the entire multiplication loop!
Anyway, reserve ALL UPPERCASE for macros.
If MAX_PRECISION is a macro, make it a typed constant.

It's a macro, #defined as some number equal to the maximum
amount of "Digit"s that we want to allow (right now, it's
set at 128).






Ensure class invariants in operations so that you don't have to pepper
the code with checks all over the place.
In other words,
class Bignum
{
private:
std::vector<Digit> myDigits;
public:
// Whatever, all operations assume a max length
// and ensure that max length.
};

The problem is I need to be able to work on pieces of the
digit string as well as the entire string with some
operations (addition, subtraction). And if the operations
ensure the max length does not that require those if()
checks in them to make sure the length parameter passed
does not exceed the limit?
Just zero-initialize the array if you're using an array,
Digit tmpbuf[whatever] = {0};
Alright.






/* Set up iterators */
std::vector<Digit>::iterator v0i(vec0.begin());
std::vector<Digit>::const_iterator v1i(vec1.begin());
std::vector<Digit>::const_iterator v2i(vec2.begin());
std::vector<Digit>::const_iterator v2init(v2i);
/* Do the multiplication */
for(int i(0);i<length1;i++,++v1i)
{
TwoDigit carry;
carry = 0;
v2i = v2init;
for(int j(0);j<length2;j++,++v2i)
{
TwoDigit
tmp((static_cast<TwoDigit>(*v1i)*static_cast<TwoDigit>(*v2i))
+static_cast<TwoDigit>(tmpbuf[i+j])+carry);
This looks rather suspicious.
However, you don't provide definitions of Digit and TwoDigit, so
difficult to say.

TwoDigit can be any unsigned type that has twice the bitlength
of Digit. You could use unsigned int (32 bit) for TwoDigit,
unsigned short (16 bit) for Digit, for example.




carry = tmp>>DIGIT_SIZE;
tmp &= MAX_DIGIT;
tmpbuf[i+j] = tmp;
}
tmpbuf[i+length2] = carry;
}
/* Copy result back */
vec0.reserve(length1 + length2);
std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());
Don't modify out-arguments directly.
Create the result in a local variable, then at the end swap().
For exception safety.

The result is already being built in the temp buffer, that's the
point. And does this mean one cannot resize the output vector if
it cannot hold the full size product? Or will swap() do that
automatically? Is swap() slower than the above?
turninng debug switch off I found the following two sequences almost
equivalent in time:

1.vector buf ; swap ;
2. array buf ; fill ; result.reserve ; copy ;

#include <iostream>
#include <conio.h>
#include <vector>
#include <string>
#include <Ctime>
using namespace std;

enum {sz=1000,loop=100*1000};

void vec1(){
vector <int> v(sz),dest;
v.swap(dest);
}

void vec0(){
vector <int> v,dest;
v.reserve(sz);
fill(v.begin(),v.begin()+v.capacity(),0);
v.swap(dest);
}

void vec2(){
vector <int> v(sz),dest;
swap(dest,v);
}

void arr(){
int arr[sz];
vector <int> dest;
fill(arr,&(arr[sz]),0);
dest.reserve(sz);
copy(arr,&arr[sz],dest.begin());
}

template <typename pred>
void test(pred f,string txt){
cout<<("testing:"+txt)<<endl;
const clock_t c=clock();
for(int i=0;i<loop;i++)
f();
cout<<clock()-c<<" ms"<<endl;
}

int main()
{
vector<A> v1(sz);
cout << "\nvector * i=" << A::get() << endl ;
A::get()=0;
vector<A> v2;
v2.reserve(sz);
cout << "reserve * i=" << A::get() << endl ;
test(vec0,"reserve");
test(vec1,"vector");
test(vec2,"std::swap");
test(arr,"array");
getch();
return 0;
}

output:

testing:reserve
1172 ms
testing:vector
1141 ms
testing:std::swap
1125 ms
testing:array
1141 ms


regards,
FM.
 
T

terminator

It sets the length. Plus allocates if necessary, and initializes
elements that become accessible.

Oh, so then size() gets the lenth, resize() resets the length.
Yeah, but /do/ you need that?

Well, not for the multiplication, so maybe I can get rid of it there.
But for the addition and subtraction, I need variable legnths _and_
variable starting offsets so I can manipulate pieces of the digit
strings (to allow for the bit/word shift used when adding floating
point).
Without needing it the passed lengths are just one more thing that can
go wrong.

Is it possible though the std::vector may expand/shrink on me and I
have to update the "length" field in my bignums accordingly? Or be
prepared to handle operands of differing lengths going into the
routine (which is bad for the addition/subtraction as every operation
counts in terms of speed and coding up things to handle varying
lengths when in the floating point arithmetic all passed lengths of
digit slices _should_ be the same is a waste of performance AND a
silly way to code something! Why put in what you don't use?)?!


{
/* Set up temporary digit buffer */
Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
to
// maximize performance. Vector
// takes forever to constantly
de/
// reallocate.
Have you measured this?
Sure did. Took ~34 secs, for 100,000,000 calls, with all
compiler optimizations turned on.
I just reduced the routine to this:
FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
const std::vector<Digit> &vec1,
const std::vector<Digit> &vec2,
int length1, int length2)
{
std::vector<Digit> tmpbuf(length1 + length2);
return(FG3DError(FG3D_SUCCESS));
}
and just ran it over and over 100,000,000 times to
see how long that allocate/free takes and I was
stunned to find it wastes 34 seconds, more time than
the entire multiplication loop!
Well, it comes down to usage pattern.
If the out-parameter is generally an empty vector or of less capacity
than required for the result, then there will be an allocation anyway in
most cases. And then the most efficient will generally be to declare
the temporary as a vector, and swap at the end.

But the simple act of declaring the vector (as I showed above) takes
mammoth amounts of time -- unreasonable amounts, in fact.
On the other hand, if the out-parameter is generally of sufficient
capacity, then from a local point of view the most efficient will
generally be to reuse that capacity, i.e. to use a raw array for the
temporary or to build the result right into the out-parameter.

This is the case I have. These low-level digit routines
are used as building blocks to build a floating point
bignum package. The out-parameter _is_ of sufficient
capacity, as it's a temporary buffer _inside the floating
point routine_. So I suppose I could scrap the
buffer in the mul routine above entirely. But that of
course raises the question of how to handle the buffer
in the floating point routine. How could I do that?
However, reuse of capacity runs the risk of "leaking" memory in the
sense that the capacity is never lowered, only increasing.

In the floating point library, the precision/"capacity"
of all operands should be fixed, not increasing _or_
decreasing as if this happens too frequently (as in
more than almost never) it's going to cut into the
performance of the fractal calculation.
So measuring for actual usage, typical usage pattern, is critical.

But what this shows is the mere act of creating the temporary buffer
consumes ludicrous amounts of time.
This is an application in which every last bit of speed
I can get counts. There are calculations that need
to be done involving these operations -- lots of 'em --
for many different data points (and they're all complex
numbers -- don't even get me started on evaluating
transcendental functions). This has ramifications for
the floating point library which needs temporary buffers
to avoid munging operands (in the case of addition/
subtraction with it's bit shift) or to give the
correct result (in the case of multiplication which
requires the upper bits of the 2x-width product).
[snip]
Ensure class invariants in operations so that you don't have to pepper
the code with checks all over the place.
In other words,
class Bignum
{
private:
std::vector<Digit> myDigits;
public:
// Whatever, all operations assume a max length
// and ensure that max length.
};
The problem is I need to be able to work on pieces of the
digit string as well as the entire string with some
operations (addition, subtraction).
Sounds like slices.
The standard library isn't very well equipped when it comes to slices:
you may have to build that yourself.

How? Any ideas on how to design it for maximum performance?
There is std::valarray, which I think supports slices, but that's a
not-quite-finished piece of the library. If it works, great. If not...
No, the point is to not pass separate length parameters, that whatever
object you're passing knows its length.

But for maximum speed, I'd rather demand all operands
to, say, the addition/subtraction routines have the same
length, as that's all that's going to occur in my floating
point arithmetic. Having all that redundancy not only
wastes performance but just doesn't feel good in my
opinion.
But again, it sounds as if you're dealing with slices.
Then, if so, then Bignum (or whatever) needs to support slices,
efficiently. In the operation that creates a logical slice there will
be a length check. But that will then be the only one, centralized.

Does such an operation cost less than the addition/
subtraction of the digits & the bit shifting combined,
even when repeated billions of times as it is in the
fractal generation?

There are 4 possible cases that can occur:

Floating Point Digit Add/Sub Scenarios:

R = A +/- B

MSD --- LSD
RRRRRRRRRRR
AAAAAAAAAAA
BBBBB (we need 2 slices of A, 1 of B, 2 of R)

RRRRRRRRRRR
AAAAAAAAAAA
BBBB (we need 3 slices of A, 1 of B, 3 of R)

RRRRRRRRRRR
AAAAAAA
BBBBBB (we need 2 slices of A, 2 of B, 3 of R)

RRRRRRRRRRR
AAAAA
BBBB (we need 1 slice of A, 1 of B, 3 of R)

Any one of these cases may occur, although case 1
will be the most frequent due to the numbers involved
having equal precision. That requires 5 slice requests.
With around 5 slice-requiring floating point operations
per fractal iteration, we must do this 25 times per
iteration, and one fractal image may average 12,000
iterations and hence require over 92 billion such
requests.

Well, the ordinary arithmetic promotions were designed to handle this
kind of stuff.
It's enough that one operand of an operation is of type TwoDigit in
order to force the other operand to be promoted.

But doesn't that require a cast? Also, carry is of type
TwoDigit, so should not it promote everything else?
This is one case where I'd forsake the newfangled casts for clarity,
TwoDigit tmp = TwoDigit(*v1i)*(*v2i) + tmpbuf[i+j] + carry;
"TwoDigit(*v1i)" is a syntax that was introduced in C++ to make it look
like a constructor call, but is defined for a primitive type as
equivalent to a C style cast "((TwoDigit)*v1i)". I.e. in some other
context it could invoke reinterpret_cast or const_cast or even cast to
inaccessible base. I don't know why it wasn't defined as static_cast.
Probably an oversight or misunderstanding: they bungled it.

Thanks, I'll do that instead. Although, what about carry?
That's a TwoDigit, would not it automatically do the
promotion and hence one would not even need the thing
around (*v1i)? Or does one operand for each binary operator need to be
TwoDigit (in which case putting it at the beginning ensures all the
successive operations recieve at least one TwoDigit)?


carry = tmp>>DIGIT_SIZE;
tmp &= MAX_DIGIT;
tmpbuf[i+j] = tmp;
}
tmpbuf[i+length2] = carry;
}
/* Copy result back */
vec0.reserve(length1 + length2);
std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());
Don't modify out-arguments directly.
Create the result in a local variable, then at the end swap().
For exception safety.
The result is already being built in the temp buffer, that's the
point. And does this mean one cannot resize the output vector if
it cannot hold the full size product? Or will swap() do that
automatically? Is swap() slower than the above?
You can always resize. Whether swap is more or less efficient depends
on the context, in effect on the usage pattern for the calls of this
function. However, given your measurements I think what I wrote above
is too categorical: you'll need to try out both approaches (array temp +
resize + copy, or vector temp + swap) in the actual typical usage,
measuring.

Well in my usage, it might be possible to entirely avoid the in-place
multiplication in the tight inner fractal loops at least for the
Mandelbrot set (that takes 1 complex multiplication, equivalent to 3
real multiplications), however for floating point multiplication one
still needs a temporary buffer to hold the 2x-width product of the
mantissas which is is then truncated to 1x-width and rounded. So that
simply moves the problem outside the multiplication routine I gave and
into that of it's caller.> Happily there's not much difference regarding the rest of the code.

testing the following urged me that the you need to pay a time
overhead because you can not put a large number of large objects on
stack and if you use dynamic allocation(as vector does) then
allocation and deallocation costs you some time. In place calculation
is the only case that does not demand dynamic allocation and only for
this one case using a normal array placed on the stack can
help.Putting arrays on stack is not normally a good idea .

void arr3(){
int *arr=new int[sz];
fill(arr,&(arr[sz]),0);
delete arr;
};

void vec4(){
vector <int> v;
v.reserve(sz);
};

void vec5(){
vector <int> v;
v.reserve(sz);
v.insert(v.begin(),sz,0);
};

void vec6(){
vector <int> v;
v.insert(v.begin(),sz,0);
};

void vec7(){
vector <int> v;
v.resize(sz);
};

ps:
reserve dose not add elements to vector,it just increase current
capacity, so use insert or resize to add elements to the vector.

regards,
FM.
 
M

mike3

testing the following urged me that the you need to pay a time
overhead because you can not put a large number of large objects on
stack and if you use dynamic allocation(as vector does) then
allocation and deallocation costs you some time. In place calculation
is the only case that does not demand dynamic allocation and only for
this one case using a normal array placed on the stack can
help.Putting arrays on stack is not normally a good idea .

So then in this case, if I want to maximize performance, I'll
need to use an array instead of vector?
 
M

mike3

turninng debug switch off I found the following two sequences almost
equivalent in time:

1.vector buf ; swap ;
2. array buf ; fill ; result.reserve ; copy ;

#include <iostream>
#include <conio.h>
#include <vector>
#include <string>
#include <Ctime>
using namespace std;

enum {sz=1000,loop=100*1000};

void vec1(){
vector <int> v(sz),dest;
v.swap(dest);

}

void vec0(){
vector <int> v,dest;
v.reserve(sz);
fill(v.begin(),v.begin()+v.capacity(),0);
v.swap(dest);

}

void vec2(){
vector <int> v(sz),dest;
swap(dest,v);

}

void arr(){
int arr[sz];
vector <int> dest;
fill(arr,&(arr[sz]),0);
dest.reserve(sz);
copy(arr,&arr[sz],dest.begin());

}

template <typename pred>
void test(pred f,string txt){
cout<<("testing:"+txt)<<endl;
const clock_t c=clock();
for(int i=0;i<loop;i++)
f();
cout<<clock()-c<<" ms"<<endl;

}

int main()
{
vector<A> v1(sz);
cout << "\nvector * i=" << A::get() << endl ;
A::get()=0;
vector<A> v2;
v2.reserve(sz);
cout << "reserve * i=" << A::get() << endl ;
test(vec0,"reserve");
test(vec1,"vector");
test(vec2,"std::swap");
test(arr,"array");
getch();
return 0;

}

output:

testing:reserve
1172 ms
testing:vector
1141 ms
testing:std::swap
1125 ms
testing:array
1141 ms

regards,
FM.

I noticed however in even the "array" routine
a vector is created. Have you tried one where
there are _no_ vectors created in the routines?
 
T

terminator

<snip>




turninng debug switch off I found the following two sequences almost
equivalent in time:
1.vector buf ; swap ;
2. array buf ; fill ; result.reserve ; copy ;
#include <iostream>
#include <conio.h>
#include <vector>
#include <string>
#include <Ctime>
using namespace std;
enum {sz=1000,loop=100*1000};
void vec1(){
vector <int> v(sz),dest;
v.swap(dest);

void vec0(){
vector <int> v,dest;
v.reserve(sz);
fill(v.begin(),v.begin()+v.capacity(),0);
v.swap(dest);

void vec2(){
vector <int> v(sz),dest;
swap(dest,v);

void arr(){
int arr[sz];
vector <int> dest;
fill(arr,&(arr[sz]),0);
dest.reserve(sz);
copy(arr,&arr[sz],dest.begin());

template <typename pred>
void test(pred f,string txt){
cout<<("testing:"+txt)<<endl;
const clock_t c=clock();
for(int i=0;i<loop;i++)
f();
cout<<clock()-c<<" ms"<<endl;

int main()
{
vector<A> v1(sz);
cout << "\nvector * i=" << A::get() << endl ;
A::get()=0;
vector<A> v2;
v2.reserve(sz);
cout << "reserve * i=" << A::get() << endl ;
test(vec0,"reserve");
test(vec1,"vector");
test(vec2,"std::swap");
test(arr,"array");
getch();
return 0;

testing:reserve
1172 ms
testing:vector
1141 ms
testing:std::swap
1125 ms
testing:array
1141 ms
regards,
FM.

I noticed however in even the "array" routine
a vector is created. Have you tried one where
there are _no_ vectors created in the routines?- Hide quoted text -

later I added:
void arr3(){
int *arr=new int[sz];
fill(arr,&(arr[sz]),0);
//delete arr; //wrong
/*that was wrong.this I meant:*/
delete[] arr;
};

void vec4(){
vector <int> v;
v.reserve(sz);
};

void vec5(){
vector <int> v;
v.reserve(sz);
v.insert(v.begin(),sz,0);
};

void vec6(){
vector <int> v;
v.insert(v.begin(),sz,0);
};

void vec7(){
vector <int> v;
v.resize(sz);
};

ps:
reserve dose not add elements to vector,it just increase current
capacity, so use insert or resize to add elements to the vector.

If you test the above functions you can see that using a vector is
equivalent to dynamically allocating and deallocating an array.Thus if
you do not perform an in-place calculation ,sooner or later you would
need to store the result in a somewhat dynamically allocated array
object.No matter when the dynamic allocation happens it always
increase runtime dramatically in an OS-dependent(unforcastable but
estimatable) manor and it is inevitable.So except for the case of in-
place calculation I would just start with a vector and finish with a
swap that takes place in no time.

regards,
FM.
 
T

terminator

Well, if you really want in-place multiplication you can try this:
Arrange your loops in such a way, that the output digits (r) are
computed from 0 straight up. That is, for each n in your *result*,
starting at 0 sum up all the a_i*b_j where i+j=n. You will need a finite
number (usually 2, unless you want to multiply verrrry long numbers) of
temporaries to hold the overflow that will leap to the next few digits.
I'm leaving it to you how juggle these. Now look at this example
multiplication r=a*b where len(a)==3 and len(b)==4

r_0 = a_0*b_0
r_1 = a_0*b_1 + a_1*b_0 + overflow
r_2 = a_0*b_2 + a_1*b_1 + a_2*b_0 + overflow
r_3 = a_0*b_3 + a_1*b_2 + a_2*b_1 + overflow
r_4 = a_1*b_3 + a_2*b_2 + overflow
r_5 = a_2*b_3 + overflow
r_6 = overflow

You might have noticed that after we are finished with r_3, a_0 is never
read again, and so the computation of r_4 obsoletes a_1 etc..Also,
pushing the carries through the result happens only once. Now I rename r:

a_3 = a_0*b_0
a_4 = a_0*b_1 + a_1*b_0 + overflow
a_5 = a_0*b_2 + a_1*b_1 + a_2*b_0 + overflow
a_6 = a_0*b_3 + a_1*b_2 + a_2*b_1 + overflow // a_0 obsoleted!
a_0 = a_1*b_3 + a_2*b_2 + overflow // a_1 obsoleted!
a_1 = a_2*b_3 + overflow // a_2 obsoleted!
a_2 = overflow

Your result is there, rotate a left by 3 to put it in place.

This should keep you busy.

for the 'n'th digit 'n' numbers need to be summed so one should make
sure that n is less than upper bound of digit type or extra carry
digits would be needed.

regards,
FM.
 
T

terminator

Sounds like slices.

The standard library isn't very well equipped when it comes to slices:
you may have to build that yourself.

There is std::valarray, which I think supports slices, but that's a
not-quite-finished piece of the library. If it works, great. If not...
on my platform index/iterating through a valarray is twice slower than
vector though for some mistery construction is somewhat faster.

regards,
FM.
 
M

mike3

<snip>
turninng debug switch off I found the following two sequences almost
equivalent in time:
1.vector buf ; swap ;
2. array buf ; fill ; result.reserve ; copy ;
#include <iostream>
#include <conio.h>
#include <vector>
#include <string>
#include <Ctime>
using namespace std;
enum {sz=1000,loop=100*1000};
void vec1(){
vector <int> v(sz),dest;
v.swap(dest);
}
void vec0(){
vector <int> v,dest;
v.reserve(sz);
fill(v.begin(),v.begin()+v.capacity(),0);
v.swap(dest);
}
void vec2(){
vector <int> v(sz),dest;
swap(dest,v);
}
void arr(){
int arr[sz];
vector <int> dest;
fill(arr,&(arr[sz]),0);
dest.reserve(sz);
copy(arr,&arr[sz],dest.begin());
}
template <typename pred>
void test(pred f,string txt){
cout<<("testing:"+txt)<<endl;
const clock_t c=clock();
for(int i=0;i<loop;i++)
f();
cout<<clock()-c<<" ms"<<endl;
}
int main()
{
vector<A> v1(sz);
cout << "\nvector * i=" << A::get() << endl ;
A::get()=0;
vector<A> v2;
v2.reserve(sz);
cout << "reserve * i=" << A::get() << endl ;
test(vec0,"reserve");
test(vec1,"vector");
test(vec2,"std::swap");
test(arr,"array");
getch();
return 0;
}
output:
testing:reserve
1172 ms
testing:vector
1141 ms
testing:std::swap
1125 ms
testing:array
1141 ms
regards,
FM.
I noticed however in even the "array" routine
a vector is created. Have you tried one where
there are _no_ vectors created in the routines?- Hide quoted text -

later I added:
void arr3(){
int *arr=new int[sz];
fill(arr,&(arr[sz]),0);
//delete arr; //wrong

/*that was wrong.this I meant:*/
delete[] arr;


void vec4(){
vector <int> v;
v.reserve(sz);
};
void vec5(){
vector <int> v;
v.reserve(sz);
v.insert(v.begin(),sz,0);
};
void vec6(){
vector <int> v;
v.insert(v.begin(),sz,0);
};
void vec7(){
vector <int> v;
v.resize(sz);
};
ps:
reserve dose not add elements to vector,it just increase current
capacity, so use insert or resize to add elements to the vector.

If you test the above functions you can see that using a vector is
equivalent to dynamically allocating and deallocating an array.Thus if
you do not perform an in-place calculation ,sooner or later you would
need to store the result in a somewhat dynamically allocated array
object.No matter when the dynamic allocation happens it always
increase runtime dramatically in an OS-dependent(unforcastable but
estimatable) manor and it is inevitable.So except for the case of in-
place calculation I would just start with a vector and finish with a
swap that takes place in no time.

regards,
FM.

Then why does the routine using an array seem to run
much faster than the one using a vector?
 
M

mike3

<snip>

I just finished testing another attempt at this. It seems the
problem is not the choice of vector vs array, but instead how
the memory is handled.

A temporary array made like this:

Digit *digits(new Digit[4]);

and then removed using delete is just as slow as a vector.

Whereas an array like this:

Digit digits[4];

is far faster.

The problem is though those variable-length arrays are
necessary -- no easy ways around 'em with the Floating Point
(FP) routines as I need to store bitshifted versions of the
operands without munging said operands. It might be possible
to allocate arrays of the second type I discussed above, but
then all the digit-manipulators have to take arrays not
vectors and that just doesn't seem like a good idea. Now,
if my program were single-threaded, I would have been able to
make do with a global buffer that would only get allocated once,
and always use that. But the program is going to be multi-
threaded, so having a single global buffer cannot really work.
And it seems ugly to have to pass a buffer operand to every
single arithmetic routine and is downright impossible when
using overloaded operators.

What do you do?
 
V

Victor Bazarov

mike3 said:
<snip>

I just finished testing another attempt at this. It seems the
problem is not the choice of vector vs array, but instead how
the memory is handled.

A temporary array made like this:

Digit *digits(new Digit[4]);

and then removed using delete is just as slow as a vector.

Whereas an array like this:

Digit digits[4];

is far faster.

Welcome to the dark side of the dynamic memory management.
The problem is though those variable-length arrays are
necessary -- no easy ways around 'em with the Floating Point
(FP) routines as I need to store bitshifted versions of the
operands without munging said operands. It might be possible
to allocate arrays of the second type I discussed above, but
then all the digit-manipulators have to take arrays not
vectors and that just doesn't seem like a good idea. Now,
if my program were single-threaded, I would have been able to
make do with a global buffer that would only get allocated once,
and always use that. But the program is going to be multi-
threaded, so having a single global buffer cannot really work.
And it seems ugly to have to pass a buffer operand to every
single arithmetic routine and is downright impossible when
using overloaded operators.

What do you do?

There are commercial and homegrown memory managers that can be
made much faster than the generic one your RTL supplies. Also
consider that if you know the possible largest size of your
array, use that to define an automatic array, even if you only
use part of it in the current call to your function.

V
 
W

werasm

The problem is though those variable-length arrays are
necessary -- no easy ways around 'em with the Floating Point
(FP) routines as I need to store bitshifted versions of the
operands without munging said operands. It might be possible
to allocate arrays of the second type I discussed above, but
then all the digit-manipulators have to take arrays not
vectors and that just doesn't seem like a good idea.

A vector should be compatible with arrays as the memory
of vectors are guaranteed to be contiguous. Therefore
having an array should IMO not force you to use arrays
all over the show. If you really don't have a choice,
I would use a scheme whereby one pre creates vectors as
you require them, whereafter you just pop them of
a stack. This would/should reduce the required memory
allocations to a minimum. Something like this (probably
naive implementation, you would have to test and experiment
a little):

#include <cassert>
#include <vector>
#include <memory>
//...#include "yourMutex.h"

template <class T>
struct VecStack
{
VecStack(){ vecList_.reserve( 50 ); }

std::auto_ptr<std::vector<T> > pop()
{
//...Mutex::guard lock( mutex_ );
std::auto_ptr<std::vector<T> > v( 0 );

if( vecList_.empty() )
{
v.reset( new std::vector<T> );
}
else
{
v.reset( vecList_.back() );
vecList_.pop_back();
}
return v;
}

void push( std::auto_ptr< std::vector<T> > v )
{
//...Mutex::guard lock( mutex_ );
assert( v.get() );
vecList_.push_back( v.release() );
}

private:
std::vector<std::vector<T>*> vecList_;
//...Mutex mutex_;
};


int main()
{
VecStack<int> vs;
std::auto_ptr<std::vector<int> > ivec( vs.pop() );
vs.push( ivec );
}


Replace Mutex with your favourite mutex... I used
auto_ptr incase something goes wrong in the using
function (a throw preventing push). OTOH if
I am not forced to use arrays all over the show
and speed does matter, I would stick with it. I
think the above solution might be worth a try
because popping the vectors IMO should be almost
as fast as creating the array. OTOH you might loose
some speed as result of indirection as well.

Kind regards,

Werner
 
T

terminator

<snip>

I just finished testing another attempt at this. It seems the
problem is not the choice of vector vs array, but instead how
the memory is handled.

A temporary array made like this:

Digit *digits(new Digit[4]);

and then removed using delete is just as slow as a vector.

Whereas an array like this:

Digit digits[4];

is far faster.

The problem is though those variable-length arrays are
necessary -- no easy ways around 'em with the Floating Point
(FP) routines as I need to store bitshifted versions of the
operands without munging said operands. It might be possible
to allocate arrays of the second type I discussed above, but
then all the digit-manipulators have to take arrays not
vectors and that just doesn't seem like a good idea.

That is what I am talking about.
Now,
if my program were single-threaded, I would have been able to
make do with a global buffer that would only get allocated once,
and always use that. But the program is going to be multi-
threaded, so having a single global buffer cannot really work.
And it seems ugly to have to pass a buffer operand to every
single arithmetic routine and is downright impossible when
using overloaded operators.

What do you do?

even in that case(global buffer) you would need to store the result in
a dynamically allocated array,so alloacating memory is inevitable and
I prefere to do it at the begining of the function.
If you do in-place calculation no dynamic allocation is necessary if
the capacity of destination is already enough,but if you need to store
the initial value of the destination in another 'bignum' object then
one dynamic alloction is needed.You are destined to pay that
overhead ,so a vector buffer followed by a swap is the clean readable
way of doing it ,but for the sake of performance I would declare an in-
place version too (just the way we can write 'x=y+z' and 'x+=y').

cheers,
FM.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,967
Messages
2,570,148
Members
46,694
Latest member
LetaCadwal

Latest Threads

Top