Do I really have to use an array?

M

mike3

On Jan 26, 12:15 am, Jerry Coffin <[email protected]> wrote:

But how do I assign those stacks to each thread without passing a
parameter
to the bignum routines (which looks bad and is outright impossible
when
using overloaded operators!)?

Hey, if I can do this, then why not just assign vector buffers to each
thread,
directly, and forget about stacks altogether? That was the whole
problem
the stack was originally intended to solve in the first place. If
there
is an alternative method to tackle this problem then I'll just toss
the whole
stack idea altogether and go for it!

PS. What did your profiling say the rest of the time was spent on?

So what's the answer here?
 
J

John Scheldroup

mike3 said:
[ ... ]
That's what the point of the vector stack was: to avoid having to keep
creating vectors. It would set up a pile of them that could then be
popped off as needed. Once there's enough for all the threads, no
more are created. They're just pushed/popped on/off the stack.
The vector stack probably isn't doing much good. In fact, it's probably
harmful.
If you notice how it's coded, the stack contains a list of _pointers_
to preallocated vectors.
Would there be a Template parameter defined in its list to see
what types are preallocated. The compiler can preinstall the
type first before the stack allocates. I could be wrong but the other
way *what type of parameter* do we have before FGWDStack()
initializes. Templates.. those curious creatures...

I've noticed you got rid of the vector of _pointers_ and replaced it
with a vector of objects. But then you run into all those problems
with copying objects -- why not keep the vector of _pointers_?

If objects are being copied they are not getting reused,
likely such the result of to many details to early on
Like what, exactly? What sort of details would that be?

Its a template not an algorithm
Does all that stuff like hiding pointers impact the
performance any? Remember, I need every last ounce of
performance I can get out of this.

I don't recall your saying such.. why use templates
anyhow ? If not what kind are the reusability issues
that lead you to employ them ?

John
 
J

Jerry Coffin

[ ... ]
If you notice how it's coded, the stack contains a list of _pointers_
to preallocated vectors.

Okay -- I'm afraid I haven't had a chance to look at the code much yet.
I've gotten an idea of the basic structure, but not much about the
individual bits and pieces.

[ ... ]
I push _pointers_ on the stack, not the vector objects themselves. The
vectors are made using a "new", then the resulting pointer is filed
away on the stack. When I need it, I pop off the pointer, use it, then
stick it back on.

Should I just use raw pointers instead of auto_ptr?

auto_ptr is pretty lightweight -- it's not showing up as something
that's taking much time in the profile...

[ ... ]
Hey, if I can do this, then why not just assign vector buffers to each
thread, directly, and forget about stacks altogether? That was the
whole problem the stack was originally intended to solve in the first
place. If there is an alternative method to tackle this problem then
I'll just toss the whole stack idea altogether and go for it!

Sounds reasonable to me. Generally, when you start up a thread, you do
it by specifying a function that will be called in the new thread. If
you want something done on a per-thread basis, that's the usual place to
start. Unfortunately, I don't know enough about how you're doing the
threading to say much more than that.
PS. What did your profiling say the rest of the time was spent on?

I thought of posting it before. The formatting gets hosed because the
lines are long, but here's the significant parts:

Time % Time % Count Function
---------------------------------------------------------
1318.725 18.4 6778.675 94.6 3000000 RawDigit::Mul(class
RawDigit const &,class RawDigit const &) (rawdigit.obj)
1148.997 16.0 5270.801 73.6 1000000 RawDigit::MulInPlace
(class RawDigit const &) (rawdigit.obj)
790.983 11.0 1199.126 16.7 2000000 RawDigit::Zeroize(void)
(rawdigit.obj)
771.520 10.8 1333.048 18.6 1000000 FG3DStack<class
RawDigit>::push(class std::auto_ptr<class RawDigit>) (rawdigit.obj)
605.482 8.4 977.610 13.6 1000000 FG3DStack<class
RawDigit>::pop(void) (rawdigit.obj)
412.900 5.8 613.727 8.6 1000000 RawDigit::Copy(class
RawDigit const &) (rawdigit.obj)
408.143 5.7 408.143 5.7 2000000 RawDigit::Zeroize
(int,int) (rawdigit.obj)
380.550 5.3 7165.489 100.0 1 _main (rawdigit.obj)
379.897 5.3 379.897 5.3 2000000 CriticalSection::Leave
(void) (cs.obj)
363.641 5.1 363.641 5.1 2000000 CriticalSection::Enter
(void) (cs.obj)
200.827 2.8 200.827 2.8 1000000 RawDigit::Copy(class
RawDigit const &,int,int,int,int) (rawdigit.obj)
190.118 2.7 190.119 2.7 1000000 std::vector<class
RawDigit *,class std::allocator<class RawDigit *> >::insert(class
RawDigit * *,unsigned int,class RawDigit * const &) (rawdigit.obj)
187.439 2.6 187.441 2.6 1000000 RawDigit::Resize(int)
(rawdigit.obj)
5.920 0.1 5.920 0.1 430 std::basic_filebuf
<char,struct std::char_traits<char> >::eek:verflow(int) (iostream.obj)

Just keep in mind that this is done with an old implementation; a newer
one might easily change things, especially wrt the standard library.
IOW, this is probably better than nothing, but don't take it as the
absolute and final word.
 
M

mike3

mike3 said:
On Jan 26, 12:39 pm, "John Scheldroup" <[email protected]>
wrote:
[ ... ]
That's what the point of the vector stack was: to avoid having to keep
creating vectors. It would set up a pile of them that could then be
popped off as needed. Once there's enough for all the threads, no
more are created. They're just pushed/popped on/off the stack.
The vector stack probably isn't doing much good. In fact, it's probably
harmful.
If you notice how it's coded, the stack contains a list of _pointers_
to preallocated vectors.
Would there be a Template parameter defined in its list to see
what types are preallocated. The compiler can preinstall the
type first before the stack allocates. I could be wrong but the other
way *what type of parameter* do we have before FGWDStack()
initializes. Templates.. those curious creatures...
<snip>
I've noticed you got rid of the vector of _pointers_ and replaced it
with a vector of objects. But then you run into all those problems
with copying objects -- why not keep the vector of _pointers_?
If objects are being copied they are not getting reused,
likely such the result of to many details to early on
Like what, exactly? What sort of details would that be?

Its a template not an algorithm
Does all that stuff like hiding pointers impact the
performance any? Remember, I need every last ounce of
performance I can get out of this.

I don't recall your saying such..  why use templates
anyhow ? If not what kind are the reusability issues
that lead you to employ them ?

I'm using templates because I might need stacks of other
types of buffers beside vectors of bignum digits.
 
M

mike3

[ ... ]
If you notice how it's coded, the stack contains a list of _pointers_
to preallocated vectors.

Okay -- I'm afraid I haven't had a chance to look at the code much yet.
I've gotten an idea of the basic structure, but not much about the
individual bits and pieces.

Well, for this exercise you really need only look at the basic
structure of RawDigit, it's member routines Mul() and MulInPlace() and
the stack routines in fg3dstack.h.

RawDigit is a low-level type that handles arithmetic on raw strings of
"digits" (or "limbs" as they're also called), which is used to build
more complicated types of bignum. It more-or-less consists of a vector
holding the digits (which represent the number in some power-of-2
base), stored in "little-endian" order: digit 0 is least-significant,
digit 1 is the next higher in significance, etc.

All arithmetic is done using simple "grade school" methods. For the
bignum sizes I need, these methods are likely the fastest.
[ ... ]
I push _pointers_ on the stack, not the vector objects themselves. The
vectors are made using a "new", then the resulting pointer is filed
away on the stack. When I need it, I pop off the pointer, use it, then
stick it back on.
Should I just use raw pointers instead of auto_ptr?

auto_ptr is pretty lightweight -- it's not showing up as something
that's taking much time in the profile...

OK, so that's not it...
[ ... ]
Hey, if I can do this, then why not just assign vector buffers to each
thread, directly, and forget about stacks altogether? That was the
whole problem the stack was originally intended to solve in the first
place. If there is an alternative method to tackle this problem then
I'll just toss the whole stack idea altogether and go for it!

Sounds reasonable to me.  Generally, when you start up a thread, you do
it by specifying a function that will be called in the new thread. If
you want something done on a per-thread basis, that's the usual place to
start. Unfortunately, I don't know enough about how you're doing the
threading to say much more than that.

The threads are done by specifying a function as usual. But if I pass
a parameter to that thread function, how do I pass it to the entire
bignum library without passing parameters to each and every arithmetic
function (which looks silly, is hard to maintain and read (= bad
programming practice. Better not to get in to bad habits in the first
place then to get into them and have to rid oneself of them later.)
and is 100% impossible when using overloaded operators)? That's what
I'm after -- how to pass it to the bignum routines without including
it as an extra parameter in the calls.

How then do I tell the bignum routines what buffer to use????
That's the big question, and that just might do away with this clumsy,
kludgy stack thing.
I thought of posting it before. The formatting gets hosed because the
lines are long, but here's the significant parts:
<snip>

Hmm. Can't seem to make much of it, though.
 
J

Jerry Coffin

[ ... ]
The threads are done by specifying a function as usual. But if I pass
a parameter to that thread function, how do I pass it to the entire
bignum library without passing parameters to each and every arithmetic
function (which looks silly, is hard to maintain and read (= bad
programming practice. Better not to get in to bad habits in the first
place then to get into them and have to rid oneself of them later.)
and is 100% impossible when using overloaded operators)? That's what
I'm after -- how to pass it to the bignum routines without including
it as an extra parameter in the calls.

You've got a couple of options. The most obvious would be to store the
vector in a thread-local variable. I believe the boost threading library
allows you to do this portably, and forms the basis for what's supposed
to be added in C++ 0x.

If you don't care a lot about portability, with MS' compilers you'd do
something like:

__declspec(thread) vector<long> temp;

Then each thread gets its own instance of temp automatically.

[ ... ]
<snip>

Hmm. Can't seem to make much of it, though.

It is a bit dense and can be hard to follow, but let's consider one line
of the output:

Func Func+Child Hit
Time % Time % Count Function
---------------------------------------------------------
380.550 5.3 7165.489 100.0 1 _main (rawdigit.obj)

This says that main executed for 380.550 milliseconds, which was 5.3
percent of the total time taken. Along with the other functions it
called, it took 7165.489 milliseconds, which was 100% of the total time
(as you'd expect for main). The name in the object file is _main, and
the object file it came from is rawdigit.obj.
 
J

Jerry Coffin

[ ... ]
Func Func+Child Hit
Time % Time % Count Function
---------------------------------------------------------
380.550 5.3 7165.489 100.0 1 _main (rawdigit.obj)

This says that main executed for 380.550 milliseconds, which was 5.3
percent of the total time taken. Along with the other functions it
called, it took 7165.489 milliseconds, which was 100% of the total time
(as you'd expect for main). The name in the object file is _main, and
the object file it came from is rawdigit.obj.

Oops, sorry. I skipped over the hit count -- that's the number of times
the profiler saw this function called.
 
M

mike3

[ ... ]
The threads are done by specifying a function as usual. But if I pass
a parameter to that thread function, how do I pass it to the entire
bignum library without passing parameters to each and every arithmetic
function (which looks silly, is hard to maintain and read (= bad
programming practice. Better not to get in to bad habits in the first
place then to get into them and have to rid oneself of them later.)
and is 100% impossible when using overloaded operators)? That's what
I'm after -- how to pass it to the bignum routines without including
it as an extra parameter in the calls.

You've got a couple of options. The most obvious would be to store the
vector in a thread-local variable. I believe the boost threading library
allows you to do this portably, and forms the basis for what's supposed
to be added in C++ 0x.

If you don't care a lot about portability, with MS' compilers you'd do
something like:

__declspec(thread) vector<long> temp;

Then each thread gets its own instance of temp automatically.

Hmm. This makes a lot more sense and would be
quite a bit better than the stack -- it wouldn't
involve that mutex in there either.
 
J

Jerry Coffin

[ ... ]
Hmm. This makes a lot more sense and would be
quite a bit better than the stack -- it wouldn't
involve that mutex in there either.

I've finally had a chance to look at the code a bit more. I was looking
at MulInPlace:

void RawDigit::MulInPlace(const RawDigit &a)
{
/* Obtain a temporary buffer to hold the result. */
std::auto_ptr<RawDigit> tmpBuf(rdStack.Pop());
RawDigit *pTmpBuf=tmpBuf.get();

/* Make it big enough if it isn't already. */
pTmpBuf->Resize(length + a.length);

/* Do multiplication */
pTmpBuf->Mul(*this, a);

/* Copy result back */
Copy(*pTmpBuf);

/* Done! Save the buffer for future reuse. */
rdStack.Push(tmpBuf);
}

Since you don't need pTmpBuf's value after you copy it into the current
object, you can probably save a fair amount of time by swapping the two
vectors instead. Copying has linear complexity but swapping is constant
complexity.
 
M

mike3

[ ... ]
Hmm. This makes a lot more sense and would be
quite a bit better than the stack -- it wouldn't
involve that mutex in there either.

I've finally had a chance to look at the code a bit more. I was looking
at MulInPlace:

void RawDigit::MulInPlace(const RawDigit &a)
{
/* Obtain a temporary buffer to hold the result. */
std::auto_ptr<RawDigit> tmpBuf(rdStack.Pop());
RawDigit *pTmpBuf=tmpBuf.get();

/* Make it big enough if it isn't already. */
pTmpBuf->Resize(length + a.length);

/* Do multiplication */
pTmpBuf->Mul(*this, a);

/* Copy result back */
Copy(*pTmpBuf);

/* Done! Save the buffer for future reuse. */
rdStack.Push(tmpBuf);

}

Since you don't need pTmpBuf's value after you copy it into the current
object, you can probably save a fair amount of time by swapping the two
vectors instead. Copying has linear complexity but swapping is constant
complexity.

But a copy shouldn't take the same length as the whole multiplication,
as it does for the 128-bit numbers I tested.

Furthermore, doesn't a swap then though go and result in a potentially
shorter vector being brought on to the stack, which would then
have to be resized later to accomodate the full product on the next
multiplication, eating time?

Hmm. I noticed that I had made Mul() so it could produce a truncated
product, so we could just make the buffer on stack as large as the
result vector, and then no resize is needed... I'll give that a try
and see what happens.
 
M

mike3

[ ... ]
Hmm. This makes a lot more sense and would be
quite a bit better than the stack -- it wouldn't
involve that mutex in there either.
I've finally had a chance to look at the code a bit more. I was looking
at MulInPlace:
void RawDigit::MulInPlace(const RawDigit &a)
{
    /* Obtain a temporary buffer to hold the result. */
    std::auto_ptr<RawDigit> tmpBuf(rdStack.Pop());
    RawDigit *pTmpBuf=tmpBuf.get();
    /* Make it big enough if it isn't already. */
    pTmpBuf->Resize(length + a.length);
    /* Do multiplication */
    pTmpBuf->Mul(*this, a);
    /* Copy result back */
    Copy(*pTmpBuf);
    /* Done! Save the buffer for future reuse. */
    rdStack.Push(tmpBuf);

Since you don't need pTmpBuf's value after you copy it into the current
object, you can probably save a fair amount of time by swapping the two
vectors instead. Copying has linear complexity but swapping is constant
complexity.

But a copy shouldn't take the same length as the whole multiplication,
as it does for the 128-bit numbers I tested.

Furthermore, doesn't a swap then though go and result in a potentially
shorter vector being brought on to the stack, which would then
have to be resized later to accomodate the full product on the next
multiplication, eating time?

Hmm. I noticed that I had made Mul() so it could produce a truncated
product, so we could just make the buffer on stack as large as the
result vector, and then no resize is needed... I'll give that a try
and see what happens.

Anyway, I just went to the thread-local storage method.
It seems to be the winner in terms of performance, and
a lot cleaner than that goofy stack thing, which was
such an awful kludge.
 
M

mike3

Measure the std::vector's performance when optimized. Your code uses
integer everywhere instead of std::size_t (or size_type) to track
length. So what you probably have is repeated conversions from integer
to unsigned (or unsigned long presumeably) and back.

I was using int instead of std::size_t becuase loops like
this, which are used in several routines, do not work:

---
std::size_t myLength(myVector.size());

for(std::size_t i(myLength-1);i>=0;i--)
{
... do something ...
}
---

"i" decreases down beyond zero. But with an int, ie.

---
int myLength(myVector.size());

for(int i(myLength-1);i>=0;i--)
{
... do something ...
}
---

it works fine.

Does that make any sense? It seems you've just got to have
a conversion somewhere.

Or is there a way to still use std::size_t without these
conversions???
 
S

Sam

mike3 said:
I was using int instead of std::size_t becuase loops like
this, which are used in several routines, do not work:

size_t is unsigned, so your >= 0 comparison always evaluates to true, since
decrementing a size_t of 0 wraps around to a huge positive value.
Or is there a way to still use std::size_t without these
conversions???

Yes. In most cases, you just need to use a construct of the form:

size_t i=myLength;

while (i > 0)
{
--i;

// your code here
}


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBHno+5x9p3GYHlUOIRAjJFAJ9od6MZ8+wBd7GwGzWPuhrcg10ULQCeMKUj
IsS2gZ2S0vF3upZUqEKjuWE=
=dpzn
-----END PGP SIGNATURE-----
 
M

mike3

 application_pgp-signature_part
1KDownload



size_t is unsigned, so your >= 0 comparison always evaluates to true, since
decrementing a size_t of 0 wraps around to a huge positive value.

And int is signed, so...
Yes. In most cases, you just need to use a construct of the form:

size_t i=myLength;

while (i > 0)
{
    --i;

    // your code here



}

Alright, I'll use that then. Thanks for the answer.
 
D

Daniel T.

mike3 said:
I was using int instead of std::size_t becuase loops like
this, which are used in several routines, do not work:

---
std::size_t myLength(myVector.size());

for(std::size_t i(myLength-1);i>=0;i--)
{
... do something ...
}
---

"i" decreases down beyond zero. But with an int, ie.

---
int myLength(myVector.size());

for(int i(myLength-1);i>=0;i--)
{
... do something ...
}
---

it works fine.

Does that make any sense? It seems you've just got to have
a conversion somewhere.

Or is there a way to still use std::size_t without these
conversions???

Here is a different answer. Use iterators.

for ( vector<X>::reverse_iterator it = myVector.rbegin();
it != myVector.rend();
++it )
{
// do something
}

Because the above uses a reverse_iterator, you are stepping through the
vector backwards. You might also want to consider a functor:

struct LoopGuts : unary_function< X, void > {
void operator()( const X& myX ) const {
// do something
}
};

then the loop would be:

for_each( myVector.rbegin(), myVector.rend(), LoopGuts() );

or...

void doSomething( const X& x ) {
// do something
}

then...

for_each( myVector.rbegin(), myVector.rend(), &doSomething );
 
M

mike3

Here is a different answer. Use iterators.

for ( vector<X>::reverse_iterator it = myVector.rbegin();
   it != myVector.rend();
   ++it )
{
   // do something

}

Because the above uses a reverse_iterator, you are stepping through the
vector backwards. You might also want to consider a functor:

struct LoopGuts : unary_function< X, void > {
   void operator()( const X& myX ) const {
      // do something
   }

};

then the loop would be:

for_each( myVector.rbegin(), myVector.rend(), LoopGuts() );

or...

void doSomething( const X& x ) {
   // do something

}

then...

for_each( myVector.rbegin(), myVector.rend(), &doSomething );

Is this just as fast as a simple for() loop over the vector?
I admit, it would shorten the code, but the indirection...

See, I need speed, and so I can't have too much
calling overhead in those tight loops or too much indirection.
The functions I use that do the operations in the inner loops
are inlined, removing the call overhead.

In that case, I'll just settle for the iterator.
 
J

James Kanze

I was using int instead of std::size_t becuase loops like
this, which are used in several routines, do not work:
for(std::size_t i(myLength-1);i>=0;i--)
{
... do something ...}

Hmmm. I'd use a while loop here:

size_t i = myLength ;
while ( i != 0 ) {
-- i ;
// do something...
}

But no matter.
"i" decreases down beyond zero. But with an int, ie.

Not if i has type size_t, it doesn't:). The loop as you just
wrote is an infinite loop. (Some compilers will warn about
this: something along the lines of "condition always true".)
for(int i(myLength-1);i>=0;i--)
{
... do something ...}

it works fine.

Unless the vector contains more than INT_MAX elements, of
course.
Does that make any sense? It seems you've just got to have
a conversion somewhere.
Or is there a way to still use std::size_t without these
conversions???

Or does it matter? Depending on the hardware, I think that
there will be two possible cases:

-- If size_t and int have the same size (usually the case on 32
bit machines), the conversion is purely a compile time
thing. No machine code is involved---you're just telling
the compiler to interpret the bytes differently. (Note that
at the machine code level, "words" are basically
untyped---type is determined by the instructions applied to
them, rather than by the data type.)

-- If size_t and int have different sizes, then the conversion
may cost an instruction or two. Or not, it depends. In all
of the cases I know of, when the sizes are different, int is
smaller than size_t, so once again, it may be just a
question of choosing instructions which only consider a part
of the size_t word. (That's the case on my 64 bit Sparc,
for example.) In other cases, however, (e.g. 16 bit 8086,
in model huge, but probably some modern architectures as
well), manipulating an int may be significantly faster, and
the conversion involves simply loading part of the size_t,
rather than the entire size_t, and is faster than
initializing a size_t variable would be.

In general, unless there are very strong reasons for not doing
so, you should use int in C++. The only motivation for possibly
using a size_t in your case would be to support vectors with
more than INT_MAX elements.
 
J

James Kanze

Is this just as fast as a simple for() loop over the vector?

It might be, it might not be.
I admit, it would shorten the code, but the indirection...

I'm not sure what you mean about the indirection. On most
modern machines (and most not so modern machines as well---it
was definitely the case on the Interdata 8/32, ca. 1977),
something like:

for ( double* p = array ; p != array + N ; ++ p ) {
use *p
}

is faster than:

for ( int i = 0 ; i < N ; ++ i ) {
use array[ i ]
}

Or would be, except that the compiler translates the two into
exactly the same machine code (which corresponds to the first).

With regards to the for_each, a lot depends on how much the
compiler can inline. Measures I did some years back (four or
five, at least) suggest that with the compiler (g++) I had then,
using a reverse iterator (rather than a normal iterator and
decrementing) did have measurable runtime cost. Measures others
have made also suggest that using a pointer to a function,
rather than a functional object, can be expensive---with the
pointer to the function, the compiler wasn't able to inline the
function, and did generate an indirect function call, which
wrecked havoc with the pipeline on the machine in question.

Note, however, that those are specific measurements on specific
machines, using specific compilers, and probably aren't
representative of what you might find in your environment.
See, I need speed, and so I can't have too much
calling overhead in those tight loops or too much indirection.
The functions I use that do the operations in the inner loops
are inlined, removing the call overhead.

Use a functional object, and most compilers will be able to
inline it in the expansion of for_each. (The reason for this is
that each functional object is a different type, resulting in a
different specialization of for_each, i.e. a different function.
All of the pointers to functions have the same type, and are
handled by the same specialization of for_each---the optimizer
must do constant propagation into the inlined function to avoid
the indirect call. And since historically, people didn't use
pointers to functions if the function was a constant, older
compilers generally don't propagate such constants.)
 
R

Ralph D. Ungermann

mike3 said:
I was using int instead of std::size_t becuase loops like
this, which are used in several routines, do not work:

std::size_t myLength(myVector.size());

for(std::size_t i(myLength-1);i>=0;i--)
{
... do something ...
}
Or is there a way to still use std::size_t without these
conversions???

for ( std::size_t i = myLength; i--; )
{
//...
}

-- ralph
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,697
Latest member
AugustNabo

Latest Threads

Top