Optimization idea: put (y&1) in [] instead of if()

James Dow Allen · Nov 4, 2009

Is it OK to discuss C programming in this ng?

I just stumbled on an optimization which surprised
me, though I'm sure it's "old hat" to many of you.

An interest in random-number generation was kindled
by some recent Usenet threads, so I looked into
the Mersenne twister. There are downloadable
twisters on the 'Net but for fun I typed in some
C code of my own.

It was no surprise that my code ran slower, at first,
than the downloaded code which had obviously been
carefully optimized. Comparing the codes I noticed
one difference which turned out to have a huge effect.

Here is the relevant fragment: If you turn on
FASTWAY, the Mersenne twister runs *more than
twice as fast* !! I'm not saying the fragment
here runs twice as fast: The *entire twister*,
which contains another 20 ops or so in addition
to this fragment, runs more than twice as fast!

Experimental conditions:
gcc -O3 (4.1.2)
Pentium(R) T2390

/* Fragment demonstrating a huge speed-up: */
auto unsigned long y, *p;
static unsigned long xorer[2] = {
0, 0x9908b0df,
};
/* ... */
#ifdef FASTWAY
p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
#else
p[0] = y & 1
? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);
#endif
/* End Fragment */

By the way,
p[0] = (p[397] ^ y >> 1)
^ (y & 1 ? 0x9908b0df : 0);
is *almost* as fast as FASTWAY. gcc -O3 produces

andl $1, %edx
...
negl %edx
...
andl $-1727483681, %edx
xorl %edx, %eax

which *I think* is a surprisingly clever way
to deal with this " y & 1 ?" without branching.

I avoided the "FASTWAY", at first, since "^ 0"
seemed like a time-waster.
I'm afraid I have a decades-old prejudice against
memory accesses, but it appears that may be totally
misplaced! Why is FASTWAY so much faster? I
suppose because it avoids any conditional branches,
is that correct? (The "y & 1 ?" condition will
get a 50:50 split I think which might be worst-case
for any branch-prediction.)

Several weeks ago I suggested we post "surprising
optimizations." I hope those eager to tell us how
naive James must be about *this* optimization will
post some surprises of their own!

James Dow Allen

Noob · Nov 4, 2009

James said:
Is it OK to discuss C programming in this ng?

The thing is, your question is really not about C. ;-)

You're asking why the machine code generated for

p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];

runs faster on a specific micro-architecture than the machine code
generated for

p[0] = y & 1 ? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);

If you're specifically interested in x86 and amd64 performance, then you
want comp.lang.asm.x86

If you want to know why branches (actually, branch mispredicts) are so
expensive, irrespective of architecture, then you want comp.arch

Regards.

bartc · Nov 4, 2009

James Dow Allen said:
Is it OK to discuss C programming in this ng?

Well, it's mainly endless bickering between experts, but if you must...

Here is the relevant fragment: If you turn on
FASTWAY, the Mersenne twister runs *more than
twice as fast* !!

/* Fragment demonstrating a huge speed-up: */
auto unsigned long y, *p;
static unsigned long xorer[2] = {
0, 0x9908b0df,
};
/* ... */
#ifdef FASTWAY
p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
#else
p[0] = y & 1
? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);
#endif
/* End Fragment */

By the way,
p[0] = (p[397] ^ y >> 1)
^ (y & 1 ? 0x9908b0df : 0);
is *almost* as fast as FASTWAY. gcc -O3 produces

What about:

p[0] = (p[397] ^ y >> 1) ^ (0x9908b0df * (y&1));

I suppose it comes down to whether a multiply is slower if faster than a
branch. An expression that converts 0,1 to ~0,0 would help (but I couldn't
think of one), then an xor instead of multiply can be used.

James Dow Allen · Nov 4, 2009

My post was sincere. I thought C programmers
might want to discuss C programming! (Perhaps
I should have mentioned that I *am* already aware
that some non-Pentium architectures have
C compilers.

If you're specifically interested in x86 and amd64 performance, then
you want comp.lang.asm.x86

If you want to know why branches (actually, branch mispredicts) are so
expensive, irrespective of architecture, then you want comp.arch

Well, although I think yours is just a parody post, in fact
I'm *not* interested in x86 performance and I *don't* want to learn
why branch mispredicts are expensive.

I *do* want to stimulate discussion about C programming by
C programmers. Can you recommend a ng for that?

I thought that was a great parody by Noob of
some of the topicality complaints in this ng!
(But Noob, please include a smiley-face in future;
you'd be amazed at some of the outrageous complaints
which are actually presented seriously here!)

Of course, *discussions of non-topicality* are
*always* topical!! Just as a patent lawyer can fix
an improper claim with six silly words (see below),
I could have avoided any objection by changing
each of my sentences into a *question* by prefixing
eight words:
"Would it not be non-topical to comment that ..."

(A patent lawyer can convert a claim which improperly
"patents an algorithm" into a proper claim by adding
6 words. Change:
"I claim the algorithm which ..."
to
"I claim a digital computing apparatus which
performs the algorithm which ..."
Apologies to patent lawyers for divulging this trick:
they've been charging $500 per word for these six words.)

James Dow Allen

Eric Sosman · Nov 4, 2009

bartc said:
James Dow Allen said:

Is it OK to discuss C programming in this ng?

Click to expand...

Well, it's mainly endless bickering between experts, but if you must...

Here is the relevant fragment: If you turn on
FASTWAY, the Mersenne twister runs *more than
twice as fast* !!

Click to expand...

/* Fragment demonstrating a huge speed-up: */
auto unsigned long y, *p;
static unsigned long xorer[2] = {
0, 0x9908b0df,
};
/* ... */
#ifdef FASTWAY
p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
#else
p[0] = y & 1
? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);
#endif
/* End Fragment */

By the way,
p[0] = (p[397] ^ y >> 1)
^ (y & 1 ? 0x9908b0df : 0);
is *almost* as fast as FASTWAY. gcc -O3 produces

Click to expand...

What about:

p[0] = (p[397] ^ y >> 1) ^ (0x9908b0df * (y&1));

I suppose it comes down to whether a multiply is slower if faster than a
branch. An expression that converts 0,1 to ~0,0 would help (but I
couldn't think of one),

(y & 1) - 1

then an xor instead of multiply can be used.

I suspect you mean "and."

Seebs · Nov 4, 2009

Is it OK to discuss C programming in this ng?

No, this is the group for discussing things you wish to sell in
World of Warcraft. (You might think the in-game "trade" channel
was for that, but actually "trade" is a code word for "uninformed
political debate and gay-bashing".)

/* Fragment demonstrating a huge speed-up: */
auto unsigned long y, *p;
static unsigned long xorer[2] = {
0, 0x9908b0df,
};
/* ... */
#ifdef FASTWAY
p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
#else
p[0] = y & 1
? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);
#endif
/* End Fragment */

This is not especially surprising. On modern CPUs, branches are usually
much more expensive than virtually anything else. Your optimization
removes a branch, replacing it with a lookup. The chances are that even
optimizing the subexpression out might help.

By the way,
p[0] = (p[397] ^ y >> 1)
^ (y & 1 ? 0x9908b0df : 0);
Yup.

I'm afraid I have a decades-old prejudice against
memory accesses, but it appears that may be totally
misplaced! Why is FASTWAY so much faster? I
suppose because it avoids any conditional branches,
is that correct? (The "y & 1 ?" condition will
get a 50:50 split I think which might be worst-case
for any branch-prediction.)
Yes.

Several weeks ago I suggested we post "surprising
optimizations." I hope those eager to tell us how
naive James must be about *this* optimization will
post some surprises of their own!

It's not hugely surprising, but it's a very nice piece of work -- simple,
trivially shown correct, and a huge speed improvement for a lot of common
cases.

-s

bartc · Nov 4, 2009

Eric Sosman said:
bartc said:

#ifdef FASTWAY
p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
#else
p[0] = y & 1
? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);
#endif

Click to expand...

p[0] = (p[397] ^ y >> 1) ^ (0x9908b0df * (y&1));

I suppose it comes down to whether a multiply is slower [or] faster than
a branch. An expression that converts 0,1 to ~0,0 would help (but I
couldn't think of one),

Click to expand...

(y & 1) - 1

then an xor instead of multiply can be used.

Click to expand...

I suspect you mean "and."

Yes, of course...

Stephen Sprunk · Nov 4, 2009

James said:
Experimental conditions:
gcc -O3 (4.1.2)
Pentium(R) T2390

/* Fragment demonstrating a huge speed-up: */
auto unsigned long y, *p;
static unsigned long xorer[2] = {
0, 0x9908b0df,
};
/* ... */
#ifdef FASTWAY
p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
#else
p[0] = y & 1
? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);
#endif
/* End Fragment */
...
I avoided the "FASTWAY", at first, since "^ 0"
seemed like a time-waster.
I'm afraid I have a decades-old prejudice against
memory accesses, but it appears that may be totally
misplaced! Why is FASTWAY so much faster? I
suppose because it avoids any conditional branches,
is that correct? (The "y & 1 ?" condition will
get a 50:50 split I think which might be worst-case
for any branch-prediction.)

That's it, in a nutshell. Modern CPUs are pretty good at hiding the
latency of predictable/cacheable loads and of predictable branches.
Modern compilers are also good at hoisting loads, whereas it's much more
difficult to hoist a branch.

Your "slow" version had an unpredictable branch; your "fast" version
replaced that with a cacheable, hoistable load. You are using a
reasonably modern compiler and CPU, so a performance improvement is not
surprising.

S

James Dow Allen · Nov 5, 2009

That's it, in a nutshell. Modern CPUs are pretty good at hiding the
latency of predictable/cacheable loads and of predictable branches.
Modern compilers are also good at hoisting loads, whereas it's much more
difficult to hoist a branch.

Thank you. Still, it seemed remarkable that *more than half* the
time spent by the entire twister (much longer piece of code than
the posted fragment) was wasted by this "unhoistable" branch,
especially since the FASTWAY would have been slower, I think,
on many Jurassic-era processors.

I shared my surprise altruistically(!) thinking others might want
to be reminded to look for this (unusual?) optimization idea.

Any other not-so-obvious optimization ideas?

Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

Retrocausality is a fascinating possibility. When the method
of the Old One is finally revealed, I wonder if Einstein will
turn out to be right, after all.

James

Flash Gordon · Nov 5, 2009

Noob said:
James said:

Is it OK to discuss C programming in this ng?

Click to expand...

The thing is, your question is really not about C. ;-)

You're asking why the machine code generated for

p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];

runs faster on a specific micro-architecture than the machine code
generated for

p[0] = y & 1 ? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);

If you're specifically interested in x86 and amd64 performance, then you
want comp.lang.asm.x86

If you want to know why branches (actually, branch mispredicts) are so
expensive, irrespective of architecture, then you want comp.arch

Also if you want to know why it is architecture dependent, because some
architectures don't have pipelines, some have delayed branches, so it
can do the test, decide whether or not to branch, and then do something
else that is always required before the branch takes effect.

Don't assume all architectures work like the ones you know.

Dik T. Winter · Nov 6, 2009

About:

#ifdef FASTWAY
p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
#else
p[0] = y & 1
? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);
#endif

> Thank you. Still, it seemed remarkable that *more than half* the
> time spent by the entire twister (much longer piece of code than
> the posted fragment) was wasted by this "unhoistable" branch,
> especially since the FASTWAY would have been slower, I think,
> on many Jurassic-era processors.

I do not think so. On many early CDC's the first version would be
faster, even if you compiled the code fragment naively to machine
code. The forward branches would kill performance (they would void
the instruction stack leaving a delay comparable to the time for
about 50 instructions, while a load was about 12 instructions worth).

Or do you not think a mainframe from the early 60's is Jurassic-era?

Naive thinking can lead to bad programming and also extrapolating from
past experience to new machines can lead to slow programs. Consider
the relative cost of floating-point multiplication vs floating-point
addition and subtraction. I have seen routines where algorithms using
many multiplications where converted to algorithms using far fewer
multiplications and many more additions, for a total increase in the
number of operations. While this was faster on the first CDC I did
use (addition was 3 cycles, multiplication 57), it failed on later
generations (addition still 3 cycles but multiplication now 4 cycles).

James Dow Allen · Nov 7, 2009

About:

#ifdef FASTWAY
p[0] = (p[397] ^ y >> 1) ^ xorer[y & 1];
#else
p[0] = y & 1
? (p[397] ^ y >> 1) ^ 0x9908b0df
: (p[397] ^ y >> 1);
#endif

Click to expand...

...
> Thank you. Still, it seemed remarkable that *more than half* the
> time spent by the entire twister (much longer piece of code than
> the posted fragment) was wasted by this "unhoistable" branch,
> especially since the FASTWAY would have been slower, I think,
> on many Jurassic-era processors.

I did write *many* Jurassic-era processors; I did not write *all*
Jurassic-era processors.

I do not think so. On many early CDC's the first version would be
faster,..

By "many" I think you mean the 6600. They kept that in the Rad Lab
in my day, and we undergrads had to use the 6400 which had no
pipeline. Among IBM mainframes in the mid-1970's, only the
million-dollar-plus models had any real pipeline.

Or do you not think a mainframe from the early 60's is Jurassic-era?

What I think is that I wrote *many* and you
misinterpreted that to mean *all*.

Naive thinking can lead to bad programming and also extrapolating from
past experience to new machines can lead to slow programs.

Everyone seems eager to impugn my optimization talents! For the
first class using the afore-mentioned CDC6400, the Professor gave me
an A+ as my programs outperformed his for speed. I *did* write
the World's Fastest JPEG Compressor. Etc.

I once sat at a desk next to a circuit designer who was, supposedly,
one of only two people in the world to be able to debug the 6600
register-reservation scoreboard (the other was Seymour Cray!)
I did *not* post my message because I wanted someone to explain
pipelining to me! I *did* post, altruistically, because I
thought that the huge performance advantage shown in the fragment
might be of interest to others who seek speed. Perhaps the
post should have been addressed only to "naive thinkers."

James Dow Allen

function call optimization question	6	Sep 12, 2007
review of the "container library", part 1/?	18	Mar 1, 2011
Access individual bytes of a 4 byte long (optimization)	17	Aug 10, 2007
How do ensure atomic update of a shared global in a multi-threadedapplication?	7	Mar 16, 2011
Aliassing and reinterpret_cast and optimization	4	Apr 15, 2008
algorithm, optimization, or other problem?	2	Feb 21, 2006
jsCreator2 - CachedRowSetXImpl - returns 4294967297 instead of 1	2	Feb 5, 2006
Performance measurement and optimization levels	3	Jul 21, 2004

Optimization idea: put (y&1) in [] instead of if()

James Dow Allen

Noob

bartc

James Dow Allen

Eric Sosman

Seebs

bartc

Stephen Sprunk

James Dow Allen

Flash Gordon

Dik T. Winter

James Dow Allen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads