I'm developing an interpreter for a stack-based bytecode language, which
uses dynamic types. It's a simple interpreter with no JIT, but manages
about
4-5x slower than un-optimised C code. This is for simple numeric
benchmarks;
with anything more elaborate, the difference seems to narrow (but I haven't
yet got anything significant to test can be directly compared).
However, to get this I needed to use optimised C in the interpreter plus
some non-portable features (ASM for example), otherwise it might be nearer
10x slower in standard (but still optimised) C.
my faster plain interpreters are typically still around 10x slower than
native, but this is generally with a plain C interpreter (usually, I am
doing an interpreter for the portable backend, with a JIT on certain
targets to make things a little faster).
most of my recent JITs tend to produce a mix of threaded code (function
calls) and short segments of direct machine code (mostly using
prefabricated ASM code globs), hence, "naive".
while fancier JITs are possible, they are a lot more complicated and a
lot more painful to maintain and debug (especially if multiple targets
are involved).
a lot of my benchmarks had been things like running sorting algorithms
and similar (like sorting largish arrays of numbers, ...).
as well as some numbers of micro-benchmarks, ... (calling functions or
methods in a loop, ...).
for things like array-sorting and similar, it is about 2x-3x slower than
C, though there are other factors which could influence things (32-bit
raw pointers and no bounds-checks vs 64-bit array references and
bounds-checked arrays, ...), in addition to things like extra operations
related to variable load/store and stack-management stuff.
cross-language function calls are also a bit expensive due to various
factors (mostly involving argument marshaling and so on).
but, anyways, consider you have something like: "y=m*x+b;"
in a stack-based IL, this might look like:
LLOAD_I m
LLOAD_I x
MUL_I
LLOAD_I b
ADD_I
LSTORE_I y
whereas a register IR could do:
MUL_I t0, m, x
ADD_I y, t0, b
and, comparatively, a naive JIT could produce fewer instructions for the
register case, ...
for example, naive stack-JIT output:
mov eax, [ebp-12] ;LLOAD_I
mov [esp+4], eax
mov eax, [ebp-16] ;LLOAD_I
mov [esp+0], eax
mov eax, [esp+4] ;MUL_I
mov ecx, [esp+0]
imul eax, ecx
mov [esp+4], eax
mov eax, [ebp-8] ;LLOAD_I
mov [esp+0], eax
mov eax, [esp+4] ;ADD_I
mov ecx, [esp+0]
add eax, ecx
mov [esp+4], eax
mov eax, [esp+4] ;LSTORE_I
mov [ebp-20], eax
vs, naive register-JIT output:
mov eax, [ebp-12] ;MUL_I
mov ecx, [ebp-16]
imul eax, ecx
mov [ebp-24], eax
mov eax, [ebp-24] ;ADD_I
mov ecx, [ebp-8]
add eax, ecx
mov [ebp-20], eax
these differences largely disappear if the JIT is smart enough to use a
register allocator and peephole optimizer, but assumed here is a JIT
that is effectively too stupid to use these.
though, granted, a person can be "clever" and squeeze a little more
speed out of the stack-machine, say, via adding magic like:
LL2_MUL_I m, x ;load and multiply m and x, pushing result
LL_ADD_I b ;add b to top-of-stack
LSTORE_I y ;store into y
which can make things a little faster, mostly at the cost of adding
large numbers of special cases.
this works, but makes things a lot more hairy.
Of course for real applications, full of string processing, file-handling
and library calls, the difference between executing this bytecode and even
the optimised C equivalent, is generally not significant (it takes 150 msec
or so to compile the largest source file I have; C could do it in 0
msec, but I wouldn't really notice the difference!)
yeah.
it depends on what one is doing.
performance isn't really a huge issue at present for using it mostly for
things like game logic tasks (enemy AIs, ...) and some UI stuff, and
things like 2D and 3D animation tasks, ...
mostly this consists of function calls, messing with objects and arrays,
some amount of vector math, ...
the present form of the language more-or-less resembles ActionScript3
(mostly similar syntax and semantics, ...).
its current backend is a (mostly) statically-typed stack-machine (it
started out dynamically-typed, but static types were retrofitted onto it
afterwards).
a few efforts had been considered to move the main script language/VM
over to a register-based fully-statically-typed backend, but not enough
effort has been made on this front to actually go somewhere with this.
a place where speed started looking to be an issue though was for
possibly using script-code for video processing, where things have to be
pretty fast when manipulating pixels not to drive the performance into
the ground.
so, there was some effort to try to design/implement a backend that
could hopefully be C-competitive for this use-case (built around a
vector-oriented register IR).
however, this fizzled out some, and most of this sort of code remains as
plain C (generally compiled with optimizations, as well as
micro-optimized, and using SSE intrinsics, ...), mostly as my stuff
still tends to fall a bit behind C here, and even then C has a hard-time
keeping up (such as running chroma-key filters and compositing multiple
video-streams and re-encoding to another output stream, where doing this
CPU-side tends to quickly go sub-real-time, ...).
for real-time, mostly it ends up boiling down mostly to shoving
everything over to the GPU and doing much of the rest of the "real work"
via fragment shaders (IOW: moderately fast codecs and a lot of PBOs and
similar, as per-pixel filtering / image scaling / blending / ... is just
so much faster on the GPU).
it started looking a bit like there was little hope of making the
script-code performance competitive within any sane level of effort.
this basically means a lot of performance-sensitive stuff (image/video
processing, A/V codecs, 3D rendering, ...) mostly still needs to remain
as C (with most real-time per-pixel stuff and effects being done on the
GPU).
nevermind being like:
var fragShader:string="""
uniform sample2D texFoo;
uniform sample2D texBar;
...
""";
but this isn't quite the same thing...
or such...