Is this anything more than received wisdom? Does anyone have a
concrete example (read: generated machine code) showing *how*
rearranging evaluation order can make an operation more efficient?
As a undergrad, I once had a summer job where I was responsible for
writing some hand-optimized assembler to replace two C subroutines that
had been found to be a bottleneck in a scientific calculation. I no
longer remember the concrete details you're asking for, but I do
remember the general strategies I used, and they all depended upon
rearranging the evaluation order.
On the machine I was writing for, there was one component that could
handle either a division or a sqrt() operation in parallel with
multiplications and additions. The main loop required one division,
followed by a sqrt(), followed by another division. Division and sqrt()
both took much longer than either a multiplication or an addition, so
one of the first things I did was rearrange evaluation so the division
started as early as possible, and was immediately followed by the
sqrt(), which was then followed by the second division, and I rearranged
things so that as many multiplications and additions as possible were
done between the start of the division and the completion of the sqrt().
Another key aspect of this machine was a component that could carry out
the register-based equivalent of fma() in three clock cycles, and could
be working on three different fma() operations at the same time, if they
were each started on a different clock cycle. Because the subroutine was
carrying out matrix multiplications, I could easily arrange to keep that
component busy almost full time, but doing so required rearranging the
order of evaluation extensively.
Finally, one of the key things I could do which the compiler was not
able to do, was to treat three different registers as the elements of a
3-component array. This helped to drastically reduce the number of slow
load and store operations, but it meant that a naive implementation of
the subroutine would have required three times as many registers as were
actually available. I dealt with this by rearranging the order of
evaluation so that the each register could be used to represent the
value of several different variables during the course of the
subroutine, by making sure that the periods of time during which the
register represented each variable didn't overlap. IIRC, if there had
been even a single additional register, I could have completely avoided
using RAM for any of the intermediate steps; as it was, I think there
was only one load and one store per pass through the main loop.
It was my first and last project writing in that assembly language. I
documented as accurately as I could precisely how I had designed it, but
I'm sure it was almost completely unmaintainable. My supervisor expected
me to spend the entire summer on it. I completed it in a couple of
weeks, and sped up the subroutine by a factor of 3.5, which disappointed
me - I'd been expecting a factor of 5. But the relevant subroutine was
going to be keeping hundreds of processors busy with a nearly 100% duty
cycle for several years, so that speed-up easily paid for the cost of my
salary. And most of it would have been impossible if I'd not been free
to rearrange the order of evaluation.
All of my rearrangements conformed to the as-if rule, however, so I'm
not sure they're directly relevant to this discussion. The real issue is
whether declaring certain kinds of behavior undefined makes it easier to
optimize code by mechanisms that would violate the as-if rule if the
behavior were defined. I believe this is the case, but I don't have any
anecdotes that demonstrate that fact.