J
James Dow Allen
Is it OK to discuss C programming in this ng?
I just stumbled on an optimization which surprised
me, though I'm sure it's "old hat" to many of you.
An interest in random-number generation was kindled
by some recent Usenet threads, so I looked into
the Mersenne twister. There are downloadable
twisters on the 'Net but for fun I typed in some
C code of my own.
It was no surprise that my code ran slower, at first,
than the downloaded code which had obviously been
carefully optimized. Comparing the codes I noticed
one difference which turned out to have a huge effect.
Here is the relevant fragment: If you turn on
FASTWAY, the Mersenne twister runs *more than
twice as fast* !! I'm not saying the fragment
here runs twice as fast: The *entire twister*,
which contains another 20 ops or so in addition
to this fragment, runs more than twice as fast!
Experimental conditions:
gcc -O3 (4.1.2)
Pentium(R) T2390
/* Fragment demonstrating a huge speed-up: */
auto unsigned long y, *p;
static unsigned long xorer[2] = {
0, 0x9908b0df,
};
/* ... */
#ifdef FASTWAY
p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
#else
p[0] = y & 1
? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);
#endif
/* End Fragment */
By the way,
p[0] = (p[397] ^ y >> 1)
^ (y & 1 ? 0x9908b0df : 0);
is *almost* as fast as FASTWAY. gcc -O3 produces
to deal with this " y & 1 ?" without branching.
I avoided the "FASTWAY", at first, since "^ 0"
seemed like a time-waster.
I'm afraid I have a decades-old prejudice against
memory accesses, but it appears that may be totally
misplaced! Why is FASTWAY so much faster? I
suppose because it avoids any conditional branches,
is that correct? (The "y & 1 ?" condition will
get a 50:50 split I think which might be worst-case
for any branch-prediction.)
Several weeks ago I suggested we post "surprising
optimizations." I hope those eager to tell us how
naive James must be about *this* optimization will
post some surprises of their own!
James Dow Allen
I just stumbled on an optimization which surprised
me, though I'm sure it's "old hat" to many of you.
An interest in random-number generation was kindled
by some recent Usenet threads, so I looked into
the Mersenne twister. There are downloadable
twisters on the 'Net but for fun I typed in some
C code of my own.
It was no surprise that my code ran slower, at first,
than the downloaded code which had obviously been
carefully optimized. Comparing the codes I noticed
one difference which turned out to have a huge effect.
Here is the relevant fragment: If you turn on
FASTWAY, the Mersenne twister runs *more than
twice as fast* !! I'm not saying the fragment
here runs twice as fast: The *entire twister*,
which contains another 20 ops or so in addition
to this fragment, runs more than twice as fast!
Experimental conditions:
gcc -O3 (4.1.2)
Pentium(R) T2390
/* Fragment demonstrating a huge speed-up: */
auto unsigned long y, *p;
static unsigned long xorer[2] = {
0, 0x9908b0df,
};
/* ... */
#ifdef FASTWAY
p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
#else
p[0] = y & 1
? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);
#endif
/* End Fragment */
By the way,
p[0] = (p[397] ^ y >> 1)
^ (y & 1 ? 0x9908b0df : 0);
is *almost* as fast as FASTWAY. gcc -O3 produces
which *I think* is a surprisingly clever wayandl $1, %edx
...
negl %edx
...
andl $-1727483681, %edx
xorl %edx, %eax
to deal with this " y & 1 ?" without branching.
I avoided the "FASTWAY", at first, since "^ 0"
seemed like a time-waster.
I'm afraid I have a decades-old prejudice against
memory accesses, but it appears that may be totally
misplaced! Why is FASTWAY so much faster? I
suppose because it avoids any conditional branches,
is that correct? (The "y & 1 ?" condition will
get a 50:50 split I think which might be worst-case
for any branch-prediction.)
Several weeks ago I suggested we post "surprising
optimizations." I hope those eager to tell us how
naive James must be about *this* optimization will
post some surprises of their own!
James Dow Allen