ideal interface for Random Number Generators?

T

Thomas J. Gritzan

Am 10.06.2010 07:37, schrieb orz:
It is true that that makes efficient full equality checks a bit
easier. But...
1. Equality check is not a common operation on MT.

Agreed. And it wouldn't be done in a performance critical loop, would it?
2. Even without the state double, full equality checks on MT instances
are not inefficient.

Well... you would have to bring both instances to the same state, so
that both either store x(i)...x(i+n-1) or x(i-n)...x(i-1). If you don't
want to change the original objects (they are const in the op==), you
have to make a copy of both for the check.
3. You could do mostly the same thing for equality checks with just a
single extra integer instead of 624 extras.

How? Comparing the original seeds and number of calls?
4. The naive approach to MT equality checks mostly works okay anyway.
Not perfectly, but I think the only ways to trip it up require either
running an instance of MT through an entire period (functionally
impossible to do without a skip-ahead type method, which is not
provided), or comparing two instances of MT originally seeded with
different seeds but then moved to the exact same position (on average
requires going through half of a full cycle, though it is possible to
do so with slightly less by way of bad luck, or a lot less if you pick
your seeds with great care).

A simple save/restore might invalidate the naive approach, depending on
how you do it. I think an equality check should be robust, if it is
provided. So it must normalize the internal state before comparing.
5. The Boost random docs list MT as taking up the normal amount of
memory, not the doubled amount of memory.

Interestingly, the TR1 implementation of Dinkumware (Visual Studio 2010)
also uses the double amount of memory.
 
O

orz

Agreed. And it wouldn't be done in a performance critical loop, would it?
I suppose it's best to not bet that programmers will never need an
efficient operator==, but it certainly doesn't seem like it should be
a priority when optimizing.
Well... you would have to bring both instances to the same state, so
that both either store x(i)...x(i+n-1) or x(i-n)...x(i-1). If you don't
want to change the original objects (they are const in the op==), you
have to make a copy of both for the check.
In a worst-case phase it's no worse than the cost of generating 623
untempered outputs. If the MT states were equal or ridiculously
similar, otherwise the cost is approximately zero. In an average
phase it's about half the cost it would be in a worst-case phase.
How? Comparing the original seeds and number of calls?
No, because the original seed could be large depending upon the
seeding method. But the technique Boosts implementation used doesn't
strictly require access to two copies of the state, it just needs
efficient access to either the next or previous 624 (preferably
untempered) outputs. You can store a second index, indicating the
highest position in the array which has been updated to the next set
of states. It would have to never be higher than the index to return,
and it would normally be zero, but when a comparison was performed it
would be updated to one less than the next index to be returned. That
would let operator== access the entire next 624 outputs, with the only
extra cost being a tiny bit of extra logic to handle that variable
once per 624 outputs, plus the extra 4 bytes of state. That would
have performance approximately identical to the current implementation
while significantly shrinking the state size.
A simple save/restore might invalidate the naive approach, depending on
how you do it. I think an equality check should be robust, if it is
provided. So it must normalize the internal state before comparing.
It is *possible* for a simple save to invalidate the comparison, if
someone goes out of their way to write the save routine that way, but
under normal circumstances that would not happen. On a normal MT
implementation, serialization naturally preserves the phase - for MT
implementations using naive operator== the only reason to not preserve
the phase is if you really really wanted to optimize the serialized
state size - that would reduce the saved size from 625*4 bytes down to
624*4 bytes.

Interestingly, the TR1 implementation of Dinkumware (Visual Studio 2010)
also uses the double amount of memory.
My impression was that TR1 and Boost were cousins, though really I
know nothing about TR1. Boost is the first implementation of MT I've
seen to keep two copies, though I've only read 4 or 5 implementations
and some of those were derivative from others of those.
 
O

orz

Shrug. Underneath it's all bits, too.


Maybe, but why do you want to do this? Have you shown that text has
inherent inefficiencies that produce significant bottlenecks for your
application? For the design goals of TR1 and C++0x, text is sufficient
and simple.
Text does have major bottlenecks for both state size and runtime
speed. On the rare occasions when you wouldn't otherwise be linked
with string handling code it also forces that. I've seen code for a
few applications that needed to do fast serialization of RNG states
(on commercial game engine, one amateur game engine, and one research
project), though those are admittedly quite rare. It's also my
understanding that some scientific simulation projects like to
instantiate large numbers of RNGs and do periodic state saves on all
of them as well, potentially running in to issues with the total
serialized state size or serialization time, though I've never
actually seen code for such a project and am not really sure if that
could be an issue or not.
 
K

Keith H Duggar

So, again, have you made measurements that show that any application
gets a significant performance hit from the use of text? Note that the
key word here is "application", not "operation that an application
sometimes does".

Are you asking for an example of RNG serialization specifically?
Or text encoding in general? If text encoding in general then the
answer from the high-frequency trading community (of which I am
part) is a absolute and resounding yes.

And that applies to serialization of market (and other data) for
both networked communication and data files for backtest and
historical data analysis.

KHD
 
O

orz

Maybe, but why do you want to do this? Have you shown that text has
Non-answer. Again: have you shown that text has inherent inefficiencies
that produce significant bottlenecks for your application? Hint: you
need to establish that text is significantly slower than your
unspecified binary approach, then analyze how often your program has to
do this and compare that with how often it does other things. One good
tool for this is a profiler.


Shrug. On the rare occasions when you wouldn't otherwise be linked with
the code for handling your presently unspecified binary protocol it
also forces that.

I do not know why you are acting as if text serialization and binary
serialization+deserialization are of similar complexity. Text
representation involves converting to/from base 10, packing only 3 and
a fraction bits in to each 8 bits, a variable length format, and
either scientific notation or otherwise depending upon the magnitude
of the number.
The natural binary representation for integers would simply be the
normal integer binary representation in whatever endianness you
considered to have the most moral weight, and the normal FP
representation would be that representation defined in IEEE 754, once
again in whatever endianness you consider to have the most moral
weight. Can you *really* argue that variable-length low-density multi-
format representations are of comparable complexity to integer
representations that require, at most, swapping byte orders, or FP
representations that involve, in a worst-case non-IEEE compliant
setting, a couple of bitwise operations and either knowledge of the
local FP format or calls to frexp and ldexp.
The folks at Fermilab do massive simulations on non-homogeneous
distributed systems. They're the ones who wanted serialization in the
TR1 RNGs. And they're responsible for the text specification for that.

So, again, have you made measurements that show that any application
gets a significant performance hit from the use of text? Note that the
key word here is "application", not "operation that an application
sometimes does".

I'm currently travelling for a few weeks, and getting access to the
appropriate codebases is non-trivial for me atm, let alone
benchmarking them. I can certainly understand the idea that efficient
(timewise or spacewise) serialization of RNG states is unimportant for
99.999+% of applications (heck, probably less than 1% need any form of
RNG serialization at all). As I mentioned, I am familiar with a few
projects were RNG serialization occurred in relatively performance-
critical contexts, where a difference of a few hundred nanoseconds per
serialization or deserialzation could have a noticable effect on
performance under some circumstances. These include game engines that
serialize RNGs large numbers of times per second and research projects
that occasionally do loops containing almost nothing except RNG
serialization and deserialization. I believe there are also some rare
codebases that like to repeatedly transmit serialized RNG states over
low bandwidth network connections, where space efficiency could be
important, though I have never seen a codebase where that would have
been a significant issue.
 
T

Thomas J. Gritzan

Am 11.06.2010 09:36, schrieb Pete Becker:
On 2010-06-10 20:54:27 -1000, orz said:
[text serialization of PRNGs to slow?]
The folks at Fermilab do massive simulations on non-homogeneous
distributed systems. They're the ones who wanted serialization in the
TR1 RNGs. And they're responsible for the text specification for that.

So, again, have you made measurements that show that any application
gets a significant performance hit from the use of text? Note that the
key word here is "application", not "operation that an application
sometimes does".

The 'optimize when to slow' approach would be valid if there would be
any means to get your hands on the internal state. But in the TR1 RNGs,
you can only retrieve the state in textual form. In the case you really
need the speed or want to conform to an already specified textual (i.e.
XML or comma separated list) or binary format, you can't use the TR1
classes.

On the other hand, the seed sequence class in C++0x, which can be used
to seed any RNG in the library, can be "loaded" using a pair of input
iterators, and these values can be written back to an output iterator
using the param function.

So a seed sequence can be saved/restored to virtually any format using
iterators, but the RNGs in the same library can only be saved/restored
in textual format.
 
K

Keith H Duggar

Sigh. That's what this thread is about. You claim that some unspecified
binary protocol would be better than the text-based protocol that TR1
and C++0x provide for random number enerators, but you've only given
hand-waving arguments to support it.

I don't know what's with the resident curmudgeon's lately but it
seems they've lost their ability to read English and think clearly.

First Alf pulling up all this idiotic "xyz must be stupid" because
he was ignorant of the basic facts of digital division algorithms,
second James (very uncharacteristically) making equally misinformed
statements about division and then refusing to come clean when hard
facts were presented proving him wrong, and now Pete putting words
into my mouth and failing to notice the "xdr" in my earlier post.

1) I never claimed that some "unspecified" protocol would be "better"
I said "efficient" which is not the same as "better" as we all know.

2) I specified a specific example of a binary protocol, XDR, what is
industry defined and widely implemented. And there are many others
I'm not going to bother mentioning because what would be the point?
I don't disagree the the committee's decision in the first place, and
second I'm sure all it would get is some childish "shrug" or perhaps
a one word response like "No." if we were lucky.

I have already as much said that the choice to use text encoding
in the C++ is "fine" for me. It is just that your "binary doesn't
work" comment was a poor choice of words and if taking literally
amounts to non-sense.
Shrug. Again.

Sigh. Shrug. Face-palm. Whatever. Like I said, I wonder why the
"elders" are being such childish curmudgeon's lately. Maybe they
are tired and burnt-out with working on C++? I'm sure it has been
a long, hard, and tiring journey. Is their vitality sapped from
all their noble hard work?

KHD
 
O

orz

There are a number of applications where serialization speed and/or
size is a major bottleneck, and if you really want me to find
profiling numbers on such an application I probably could, though
there are better uses for my time. Applications where serialization
speed is a bottleneck and RNGs make up a large fraction of what gets
serialized are MUCH more rare (but I could probably find performance
numbers for such an application too), and probably not worth worrying
about for the standard library. Still, I'm getting the impression
that the standard library (and you, Pete Becker) considers text based
serialization the only serious serialization, which I find rather
disturbing. Binary serialization in general is not new, unproven,
hypothetical, or anything else like that. It applies to RNGs in
pretty much exactly the same way it applies to other simple data
structures. That is, it's significantly more efficient than text
serialization in terms of run-time speed, output size, and code
complexity, though for simple data-structures it's usually much harder
for a human to interpret the output. That has all been well known and
thoroughly tested for the last 25 years if not longer. The helper
functions needed for binary serialization on unknown platforms are
already present in standards (frexp, ldexp, htonl, etc), though it's
more efficient if the serializing code knows what platform it's on at
compile time.
 
O

orz

Not at all. The standard library uses text because it's sufficient for
the library's design goals. Anything more is overdesign, and
inefficient use of programmer time.


Nobody said it was.


Text has been portable in standard C for forty years. Binary has not
and is not.


Funny thing: when you use text for serialization, the target platform
is known at compile time. The details are handled in the standard
library; you don't have to write any platform-specific code.

For binary serialization the standard would have to specify details of
the protocol, and, perhaps, helper functions that are not standard C or
C++ (e.g. htonl), and implement the translation layer. That's far more
complex than specifying and writing a text interface that sits on top
of the text conversion functions that have been around for forty years.

Sigh. Are you deliberately misinterpreting everything I say? Lets
try to be clear and comprehensive hear:

The possible format are:
1. native binary
portable?: no
code on known ideal platform: trivial complexity, very high efficiency
code on known non-ideal platform: n/a
code on unknown or worst-case platform: n/a

2. match reference platforms binary
portable?: yes
code on known ideal platform: trivial complexity, very high efficiency
code on known non-ideal platform: high efficiency & simple code (~ 40
lines w/ templates, ~ 150 lines w/o)
code on unknown platform: medium efficiency & simple code (~ 40 lines
w/ templates, ~ 150 lines w/o)
extra requirements: "known platform" must include knowledge of byte
ordering, "unknown platform" requires efficient implementations of
frexp, ldexp

3. match (simple) imaginary platforms binary
portable?: yes
code on known ideal platform: n/a
code on known non-ideal platform: n/a
code on unknown or worst-case platform: medium efficiency & simple
code (~40 lines w/ templates, ~150 lines w/o)
extra requirements: must define your format, must have efficient
implementations of frexp, ldexp

4. text
portable?: yes
code on known ideal platform: n/a
code on known non-ideal platform: n/a
code on unknown or worst-case platform: low efficiency & trivial
complexity
extra requirements: must link with equivalents of sscanf, sprintf

I have advocated #2 and #3 as superior to #4. None of this is related
to RNGs, so this whole bizarre argument is all off-topic. My
impression is that all programmers remotely familiar with low level
issues know most of this (or could figure most of it out by thinking
about it for a few minutes), and most semi-competent programmers are
at least remotely familiar with low level issues, and that you are
more than semi-competent, therefore it's extremely likely that you
already know this. But somehow you are acting as if you can't
understand this or don't want to admit this or you are trolling. If
you really don't understand one of those cases and someone else in
this thread thinks that you are not trolling then I can write up a
simple implementation of any single case of your choice out of those
(don't have a compiler here but I could probably download & install
one without trouble).

The line counts are guestimates, and I'd say actuals should be within
a factor of 2 either way. Line counts assume reasonably high code
density. Line counts do not include the #ifdefs or whatever necessary
to figure out what the platform is or isn't. Particularly wierd
platforms count as "unknown" even if they are otherwise known, and
truly bizarre platforms (ie 7 bit bytes, pointer arithmetic not
working or otherwise strange, that kind of thing) don't count at all.
 
K

Keith H Duggar

Not at all. The standard library uses text because it's sufficient for
the library's design goals. Anything more is overdesign, and
inefficient use of programmer time.


Nobody said it was.

Pete Becker wrote all of the following in this thread:

"Binary representations just don't work for [serialization on non-
homogeneous systems]"

"your hypothetical binary protocol"

"your unspecified binary approach"

"You claim that some unspecified binary protocol would be
better than the text-based protocol"

"you've only given hand-waving arguments to support it"

"Shrug ... Sigh ... Shrug ... Shrug ... Shrug. Again ..."

KHD
 
O

orz

Funny thing: the title of this thread says it's about Random Number Generators.
Yes. I'm not sure if you're suggesting the thread get back, or
disagreeing with my suggestion to that effect but either way, enough
about serialization. I halfway expect that the Boost rng
serialization might be able to do alternate serialization schemes as-
is anyway by just passing it a different object type.

Anyone have more comments on random number generation interfaces? Or
heck, anything else more related to RNGs than generic serialization
issues?
 
T

Thomas J. Gritzan

Am 11.06.2010 08:45, schrieb orz:
I suppose it's best to not bet that programmers will never need an
efficient operator==, but it certainly doesn't seem like it should be
a priority when optimizing.

Exaclty. I hope they don't do it often ;-)
In a worst-case phase it's no worse than the cost of generating 623
untempered outputs. If the MT states were equal or ridiculously
similar, otherwise the cost is approximately zero. In an average
phase it's about half the cost it would be in a worst-case phase.
It is *possible* for a simple save to invalidate the comparison, if
someone goes out of their way to write the save routine that way, but
under normal circumstances that would not happen. On a normal MT
implementation, serialization naturally preserves the phase - for MT
implementations using naive operator== the only reason to not preserve
the phase is if you really really wanted to optimize the serialized
state size - that would reduce the saved size from 625*4 bytes down to
624*4 bytes.

I also would simply store the current state + index, but in TR1/C++0x it
is specified to store the last N outputs (before tempering). At least it
is portable this way. You can take the serialized text from one
implementation to another.
My impression was that TR1 and Boost were cousins, though really I
know nothing about TR1. Boost is the first implementation of MT I've
seen to keep two copies, though I've only read 4 or 5 implementations
and some of those were derivative from others of those.

TR1 is based on Boost, but TR1 is only specified as interface and some
implementation constraints, exactly as all of the C++ standard.

In the last hours, I implemented the Mersenne Twister of the C++0x
working draft using only the single amount of storage:
http://draig.de/phy/random/
The templated engine can be used to implement some other non-MT
generators, like TT800. I use this one at work, so I provided a typedef
for it as well.
 
J

James Kanze

On Jun 11, 4:41 pm, Pete Becker <[email protected]> wrote:
First Alf pulling up all this idiotic "xyz must be stupid" because
he was ignorant of the basic facts of digital division algorithms,
second James (very uncharacteristically) making equally misinformed
statements about division and then refusing to come clean when hard
facts were presented proving him wrong,

Just to be clear: the only *hard* fact that has been presented
is that I did run my benchmark on a Sparc, about a year ago, and
on that machine, with the version of Sun CC I was using,
division took exactly the same time as addition, subtraction and
multiplication. I don't currently have access to a Sparc, so I
can't rerun it, and I don't know exactly what the configuration
was I ran it on, but on at least one Sparc, under Solaris,
division isn't any slower than any of the other four operations.

Out of curiosity, I did rerun the benchmark on the PC's (under
Windows) I currently have access to, and there, division is
slower.

(Another interesting point was that on earlier Sparcs, floating
point division was faster than fixed point.)

Anyway, my statements in this regard were based on concrete
measurements. The code which I used to make the measurements is
available online (whenever my provider decides to make it so).
So you can easily test the situation on your particular
hardware. Expect to find something different than what other
people find, however, because hardware varies enormously.
 
J

James Kanze

On 2010-06-11 21:01:42 -1000, orz said:

[...]
Funny thing: when you use text for serialization, the target
platform is known at compile time. The details are handled in
the standard library; you don't have to write any
platform-specific code.

Yes and no. I have platform independent code (no #ifdef's)
which reads and writes XDR. Once you've defined the output
format, be it text or otherwise, any "platform specific" code is
only optimization.
For binary serialization the standard would have to specify
details of the protocol, and, perhaps, helper functions that
are not standard C or C++ (e.g. htonl),

Interestingly enough, I've done a significant amount of binary
serialization without every using htonl and company. For
positive integers, there's nothing significantly different
between binary serialization and text, except that the base is
always a power of two (which means a single shift, on machines
where multiplication or division is slow---and in many cases,
the compiler will "optimize" the shift to use byte accesses to
memory), and the encoding is direct: the code for the digit is
the value of the digit.

For floating point, it's a bit more difficult, but the necessary
functions are present in the C library (and thus in C++).
and implement the translation layer. That's far more complex
than specifying and writing a text interface that sits on top
of the text conversion functions that have been around for
forty years.

Yes and no. There is some advantage to knowing that the integer
you are reading is exactly 4 bytes (and each byte is 8 bits).
The real advantage in text serialization is during debugging.
 
J

James Kanze

I do not know why you are acting as if text serialization and
binary serialization+deserialization are of similar
complexity. Text representation involves converting to/from
base 10, packing only 3 and a fraction bits in to each 8 bits,
a variable length format, and either scientific notation or
otherwise depending upon the magnitude of the number.

Binary may or may not have many of the same problems. The big
advantage is 1) a fixed length format, so you don't need
separators (although this isn't true for all binary, and of
course, in many cases, a binary variable length format will
result in smaller files than either text or fixed length
binary), and 2) the fact that all of the multiplication/division
is by 256, which on most modern machines, a good compiler will
handle by using byte addressing, with no multiplication or
division whatsoever.

Another frequent advantage is that the format is much more
rigorously specified---any given value has exactly one format,
rather than "10.", "1e1", etc.
The natural binary representation for integers would simply be
the normal integer binary representation in whatever
endianness you considered to have the most moral weight,

Which leaves open the question of representation of negative
values.
and the normal FP representation would be that representation
defined in IEEE 754, once again in whatever endianness you
consider to have the most moral weight.

Which is fine if (and only if) you can limit your portability to
machines which use IEEE 754. Which excludes pretty much all
mainframes.
Can you *really* argue that variable-length low-density multi-
format representations are of comparable complexity to integer
representations that require, at most, swapping byte orders,
or FP representations that involve, in a worst-case non-IEEE
compliant setting, a couple of bitwise operations and either
knowledge of the local FP format or calls to frexp and ldexp.

For output, there's really not a significant difference in
complexity. For input, the problem is that text generally
allows many different representations for the same format. For
both, however, text requires separators (unless you're using
fixed with formats, but in that case, the C++ conversions aren't
that easy to use), which binary doesn't.
 
K

Keith H Duggar

Just to be clear: the only *hard* fact that has been presented
is that I did run my benchmark on a Sparc, about a year ago, and

So you missed the engineering specification from Sun themselves?

https://www.opensparc.net/pubs/t2/docs//OpenSPARCT2_Core_Micro_Arch.pdf

That I posed here

http://groups.google.com/group/comp.lang.c++/msg/5467503d1fb3e411

There is very little harder than the precise architecture specs
based on the physical engineering. They know exactly the clocks
for every operation in isolation. So any measurements deviating
far from those specs is undoubtedly a broken measurement process.

I'm hard pressed to understand how you can argue against such
engineering specs and also the basic facts of digital division
methods that is computer engineering 101.
on that machine, with the version of Sun CC I was using,
division took exactly the same time as addition, subtraction and
multiplication. I don't currently have access to a Sparc, so I
can't rerun it, and I don't know exactly what the configuration
was I ran it on, but on at least one Sparc, under Solaris,
division isn't any slower than any of the other four operations.

Out of curiosity, I did rerun the benchmark on the PC's (under
Windows) I currently have access to, and there, division is
slower.

(Another interesting point was that on earlier Sparcs, floating
point division was faster than fixed point.)

Anyway, my statements in this regard were based on concrete
measurements. The code which I used to make the measurements is
available online (whenever my provider decides to make it so).

Yes, I took a look and mentioned there were big problems for using
that code to measure single machine op performance. One can adapt
the circle drawing measurement code I've posted a few times to
measure single ops much more accurately.
So you can easily test the situation on your particular
hardware. Expect to find something different than what other
people find, however, because hardware varies enormously.

Yeah hardware varies but the fundamental facts of digital division
methods are what they are and are pretty much set in stone at this
point until somebody comes up with radically improved technology.

KHD
 
J

James Kanze

So you missed the engineering specification from Sun themselves?

Which concerns one particular implementation of the Sparc. (Not
that I was able to find any exact timing figures in that
document.)
That I posed here

There is very little harder than the precise architecture
specs based on the physical engineering. They know exactly the
clocks for every operation in isolation. So any measurements
deviating far from those specs is undoubtedly a broken
measurement process.
I'm hard pressed to understand how you can argue against such
engineering specs and also the basic facts of digital division
methods that is computer engineering 101.

I did the measurements. I'm a practical engineer, and I've
enough experience in this to know more or less what I'm
measuring. (With regards to computer engineering, anyone who's
done some work at the hardware level knows that if you can spare
enough space on the chip to do it, division, can be just as fast
as addition. And I've seen processors where dividing was faster
than a single shift---the NSC 32000, for example.)

[...]
Yeah hardware varies but the fundamental facts of digital division
methods are what they are and are pretty much set in stone at this
point until somebody comes up with radically improved technology.

At the hardware level, there are two solutions for just about
everything, combinatorial and sequential. Division requires
a lot of gates to be done combinatorially, so it is usually done
sequentially, or with a combination of the two. But in the end,
it's really just a question of how much real estate is available
on the chip, and what you want to use it for.
 
K

Keith H Duggar

Which concerns one particular implementation of the Sparc.

You just said "modern Sparc" so I picked one of the most modern.
Name whichever "modern Sparc" you want to consider and the results
are going to be same.
(Not that I was able to find any exact timing figures in that
document.)

I provided references to the tables with the timings (in terms of
exact clocks) in my original post. So all you need to do is click,
scan, and read to find them.
I did the measurements. I'm a practical engineer, and I've
enough experience in this to know more or less what I'm
measuring.

I suggest you revisit your bechmarking code. It is inadequate
for measuring single instruction timings. I gave some reasons
in the post I linked and you can recognize the others if you
dig just a bit.
(With regards to computer engineering, anyone who's
done some work at the hardware level knows that if you can spare
enough space on the chip to do it, division, can be just as fast
as addition. And I've seen processors where dividing was faster
than a single shift---the NSC 32000, for example.)

[...]
Yeah hardware varies but the fundamental facts of digital division
methods are what they are and are pretty much set in stone at this
point until somebody comes up with radically improved technology.

At the hardware level, there are two solutions for just about
everything, combinatorial and sequential. Division requires
a lot of gates to be done combinatorially, so it is usually done
sequentially, or with a combination of the two. But in the end,
it's really just a question of how much real estate is available
on the chip, and what you want to use it for.

You are conveniently leaving out circuit propagation delay from
your discussion above. Once gigaherts/nanoseconds time scales were
reached this became a major factor. Check some of the recent papers
regarding division algorithms and you will see that at those time
scales performance is not (necessarily) improved by larger circuits
and that contrary to your claim one /cannot/ do division as fast
as addition regardless of available real estate (at these speeds).

KHD
 
K

Keith H Duggar

Once upon a time I could look at the specs for a CPU, and say that this
op would take 5 microseconds and that ten. And it would. But these
days with pipelining and caches it's pretty well impossible to calculate
the actual times.

The instruction "times" in that spec are expressed in clock cycles
and they are exact best case numbers including the throughput cycle
counts. Pipeline etc runtime dynamics only increase the number of
cycles required above that best case throughput.

But, yes, in principle a piece of code attempting to measure single
machine instructions may be so broken at that task as to report any
result. In fact, that is exactly what I think is happing ie that the
referenced library is measuring overhead rather than target code.
If James says he got the same result I believe him. But... It's
entirely possible that some other factor was the limit on the speed of
his benchmark - main memory bus bandwidth or something.

I believe so too and I think I know why: the benchmark library as it
stands is unsuitable for timing functions as tiny as single machine
instructions.
Without sight of the code we can't say. I'm not familiar with Sparc
anyway, so I can't help.

The code was available on his website. I was able to easily locate
the repository some weeks back but cannot locate it at the moment.
James, can you please post a current link? (There is a lot of nice
stuff in his library actually. There were also some very well done
profiling experiments on various hash and map implementations.).

KHD
 
J

James Kanze

On Jun 22, 3:58 pm, Andy Champ <[email protected]> wrote:

[...]
I believe so too and I think I know why: the benchmark library
as it stands is unsuitable for timing functions as tiny as
single machine instructions.

Practically speaking, I don't think that "timing" a single
machine instruction has any meaning in today's architectures.
I've not had the time to study the documents you cited in
detail, to see the exact details, but I did see enough to make
it clear that in the case of division (and a lot of other
things), there is a lot of pipelining going on, at several
different levels. This means that any measurements one can make
only measure the times in a specific context---the timing can be
different with different instructions surrounding the
instruction. I think I understand what you mean concerning the
unsuitablity of my benchmark harness for timing single
arithmetic instructions---it "surrounds" the instruction with
instructions which don't use any of the arithmetic units, and so
can procede without waiting, even if the arithmetic unit hasn't
finished its operation. Which would account for all operations
having the same time. (But why did they have different times on
older Sparcs, then?) On the other hand, I'm not sure what
a suitable benchmark would be, for real life problems. In the
code I write, division generally does "take" a single clock, in
the sense that the processor is able to continue, and execute
the next instruction immediately, without stalling.
The code was available on his website. I was able to easily
locate the repository some weeks back but cannot locate it at
the moment. James, can you please post a current link? (There
is a lot of nice stuff in his library actually. There were
also some very well done profiling experiments on various hash
and map implementations.).

The site seems to come and go, according to the whims of my
provider. (I just accessed it without problems, but a day or
two ago, it wasn't working.) The provider (Neuf Télécom) was
purchased, or merged, with another operator, and now uses
a different base address for personal home pages. They did not,
however, migrate any existing pages; this is something I have to
do. Having changed jobs (and the country I'm working in) not
that long ago, I've not had time to do so yet. (Hopefully this
summer, now that I've finally finished shutting down my
activities as an independent consultant in France.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,146
Messages
2,570,832
Members
47,374
Latest member
anuragag27

Latest Threads

Top