I've heard it said so many times that dynamic code optimizers are so
much better than static ones. Well, I have heard of a number of
isolated cases where that may make sense. However, there is somthing
I never said they were. If you carefully read again, I was talking
about native binary code and that generating it is more productive
with online compiler than by hand using assembler.
It's really about feasibility of using static off line compiler for a
specific task. If you have a very complex state machine, which gives
you thousands or even millions of permutations for the generated code
it means in runtime you will be branching heavily to select the
correct binary to execute.
You can flatten the code with online code generator. Even if the code
isn't "optimal", it is still better than heavily nested conditional
basic blocks.
happening now which kind of makes this all mute and it's a little hard
to beat. Since the MIPS R10000, there has been a steady improvement in
the ability for the CPU to do much of the dynamic optimization. If
you look at the Transmeta CPU's, you'll see the same thing happening
in software. However with the latest generation of speculatively
executed, register renaming, branch predicting CPU's, it's hard to
optimize better than the CPU itself.
That's why we want to avoid branching and flatten the conditional
basic blocks: mis-predicted branches are expensive on setup like you
describe above.
So yes, dynamic optimization may be better, better done by the CPU
itself.
By itself it solves problem only partially. It means that low-level
bit-twiddling is less relevant than it was couple of architectures
ago, but still, it is not substitute for writing good code.
For example, you could write graphics code using putpixel(x,y,color)
function. Or, you could write scanline renderer which processes
fragments in order, so address computation logic is simpler:
*scan++ = color;
The putpixel() version of the same would look like this:
buffer[y * stride + x] = color;
What's wrong with above?
1. There is unnecessary multiplication
2. If this is non-inlined function, the call overhead
3. The memory write request is dependent on executing the address
computation logic
4. NOTE: the "extra" addition there might be free, depending on the
addressing modes a specific architecture supports.
The "CPU runtime optimization" can re-order and issue the translated
microcode instructions in more optimal sequence. Still, we added so
much cruft and overhead there for no real tangible gain. We just
burned trace/code cache even if we ran at the same performance without
consideration to the bigger picture. Yay.
The biggest optimization is still between the keyboard and the chair:
the human with living, thinking mind.
-- extra section --
Example of what OpenGL state machine's depth compare function would
look like in C++:
bool zpass;
uint32 depth = zbuffer[x];
switch ( depthFunc )
{
case DEPTHFUNC_LESS: zpass = z < depth; break;
case DEPTHFUNC_MORE: zpass = z > depth; break;
case DEPTHFUNC_LEQUAL: zpass = z <= depth; break;
case DEPTHFUNC_NOTEQUAL: zpass = z != depth; break;
// .... all depth function cases ....
}
if ( zpass )
{
process_fragment(...);
}
z += dxdz;
Since there are 8 different depth compare functions, we have to
generate code for all these. Above looks totally retarded way to do
it. Okay, so, smart way to go about it would have different innerloop
for different depth compare functions, right? Right..
OK, we did that. Now what? We have stenciling, alpha testing, alpha
blending, yada yada yada. We cannot realistically generate ALL
combinations of these. So we either make them own functions, call them
and have tens of functions calls PER FRAGMENT! That is even MORE
idiotic than what we have above!
So what the heck we going to do about this? At this moment in time, it
looks fairly attractive to GENERATE (!) the code dynamically. Doesn't
matter if we got 10,000,000 different innerloops to write.
The SIMPLEST mechanics to do this is to stitch the code. You write
nice neat pieces and stitch them together like lego blocks. You need
to specify carefully how you broadcast data from one stage to the
next. Not too difficult.
The next step in evolution is to generate intermediate representation
of the operations we want to do. Then we take this representation and
transform in meaningful ways. We could do constant folding, we could
do strength reduction, constant propagation. We could do instruction
selection and register allocation, we could do these in different ways
or combine some of the steps and so on and on. Effectively we would be
writing optimizing code generator.
Combine static and dynamic code generation, shake (don't stir!) and
enjoy the cocktail!
