From: (e-mail address removed) (Paul Hsieh)
Date: Sat, Apr 29 2006 3:57 pm
I didn't read your whole page but had a look at the table in the
section "Strictly for beginners". Can you explain why would
"x = y << 3" be faster than "x = y * 8" ? Or why would
"if( ((a-b)|(c-d)|(e-f))==0 )" be faster than "if( a==b &&c==d &&e==f
)" ?
It depends on your compiler. However, the assumption is that you
aren't in control of the quality of your compiler, and hence, it may
not perform the relevant transformation for you. Certainly for some
compilers it may make no difference.
The idea for x = y << 3, is to let the compiler use its faster integer
shift operations (a clock or two) instead of its slower multiplier
(usually upwards of 5 clocks). The x86 includes a *8 operation with
its address units, but its still usually costs an extra cycle of
latency between the address and integer units (so its effectively two
clocks). This is probably the easiest case where a good modern
compiler will make the transformation for you, and you probably don't
need to worry about it.
The second case is a bit more subtle. First you should see that the
operations are the same. Originally it was thought that short cutting
was better because it performed fewer actual ALU operations on average.
However, with the advent of super-scalar and out of order execution
CPUs, it turns out that ALU speed has gone way up relative to branch
control (which at *best* has stayed the same, but sometimes gets much
worse due to mispredictions). So this calculation: (a-b)|(c-d)|(e-f)
leverages everything that modern CPUs have been geared towards
calculating with maximum performance, leaving only one conditional
operation: if ( (...) == 0). Whereas if (a==b && c==d && e==f) is the
equivalent of if (a==b) if (c==d) if (e==f), which is 3 conditional
operations.
The shortcut method *may* be faster if typically a != b, with a high
degree of probability. The problem is that it becomes associated with
a very fast control transfer. That is to say, it decreases the "meat"
code versus the branch code. Many modern processors are not well
designed for that kind of behaviour -- more geared towards the
assumption that there is some minimum number of ALU operations per
branch. So even in this case, the CPU may not benefit from the best
case of operation short cutting.