On 4/26/2011 10:40 PM, Travers Naran wrote:
...
...
The obvious possibility is a random mix of left and right shifts, with
the two directions equally probable. In that case, the hardware branch
prediction would mis-predict, on average, half the time.
There are worse distributions. Consider a program that alternates
between the two possible branch outcomes, assuming a four-bit branch
prediction counter like those found in several generations of Intel
chips: it's possible that the sequence will start by acting opposite to
the prediction and flipping the predicted direction for the next branch
(from b01 to b10 or vice versa). A strictly alternating sequence of
branches would then be mis-predicted *every time*.
One of the things that could've made Itanic so cool was the ability to
pre-emptively evaluate both sides of a branch, many instructions in
advance, and throw away all the alternatives but the one that "actually
happened". Unfortunately, programming around this was really hard even
for experienced compiler writers, and the chips were notoriously
power-hungry and expensive, so it never really took off.
When the hardware branch prediction gets it wrong, the result is a
complete pipeline flush. With deep pipelines and multiple instructions
in each pipeline stage, a pipeline flush costs the equivalent of dozens
of simple arithmetic operations.
My question is whether the measurement was from the actual program, or
from a benchmark that may differ from the real program in ways that
affect branch prediction.
My question is "who cares?"
There are lots of ways to squeeze more performance out of an
otherwise-optimal algorithm: make better use of vector operations,
migrate some or all of the work to the GPU (a massively-parallel
floating point unit), or find ways to avoid running it at all.
Measuring the cost of an if statement is pretty far down the list of
tuning choices.
(Having seen examples of Sanny's code in other newsgroups, I am
extremely doubtful that he's reached the "optimal algorithm" stage.
However, I haven't seen his code, so it's possible.)
Why bother? Your sample code only has one conditional branch,
"jump-less-than to L1". Unconditional jumps to a fixed target, such as
"jump L2" are always correctly predicted, and take about the same time
as a simple arithmetic operation.
Unless they force a cache stall. For a routine as trivial as the one
Travers posted, it's unlikely that the final 'return' would be on a
separate cache line from the start of the routine, but in a longer
routine, replacing the jump-to-return with a return might actually keep
the end of the routine out of cache.
Of course, if the code that called this routine has fallen out of
cache, the return instruction will force a cache stall, instead.
The method is short enough that the
JVM will in-line it anywhere it is frequently called so we don't have to
worry about the return.
Agree. Even though it's not actually magic and doesn't always produce
totally optimal (either in time or in space) code, the JIT compiler
produces surprisingly good code most of the time. Second-guessing it
down at the level of individual machine instructions is the sort of
thing that impresses bystanders but rarely accomplishes anything useful.
-o