Arrays vs Buffers

D

daniel.w.gelder

As an old-tymey C and C++ programmer I find Java very nice for
implementing algorithms in, with its garbage collection and array
bounds checking. However, as said old-tymey programmer, I end up
wanting to challenge time and space by building a super optimized
version of said algorithm.

I find Java not very good with this, as the array bounds checking seems
to be a tremendous bottleneck when you're trying to write close to the
metal. The requirement that arrays and objects initialize to clear
memory is no slowdown since I can just use the factory model and cache
the memory for such things.

I looked into Buffers, and they seem to be designed for the purpose of
making reads and writes more deterministic. Do any JVMs actually use
the properties of buffers to let the outputted JIT code run directly?
Or would that be asking too much.

HOW do I get close to the metal in Java? Or at least pretend to?

Thanks.
Dan
 
S

Stefan Ram

R

Robert Klemme

As an old-tymey C and C++ programmer I find Java very nice for
implementing algorithms in, with its garbage collection and array
bounds checking. However, as said old-tymey programmer, I end up
wanting to challenge time and space by building a super optimized
version of said algorithm.

Are you trying to optimize to see how fast Java can go or do you have
actual performance problems?
I find Java not very good with this, as the array bounds checking seems
to be a tremendous bottleneck when you're trying to write close to the
metal.

As Stefan pointed out already, JVM's nowadays do some optimizations in
this area. Since you write "seems" I assume at the moment you did not
yet have evidence that backs this.
The requirement that arrays and objects initialize to clear
memory is no slowdown since I can just use the factory model and cache
the memory for such things.

Actually you cannot cache memory in Java but you can cache objects.
However, you might be surprised that this *may* actually have adversary
effects depending on your algorithm. Java's memory handling and GC is
quite advanced and optimized for dealing with a large percentage of
short lived objects. Unfortunately I don't have my Java bookmarks
around but with a bit of research you'll find those articles at Sun's
and IBM's web site.
I looked into Buffers, and they seem to be designed for the purpose of
making reads and writes more deterministic. Do any JVMs actually use
the properties of buffers to let the outputted JIT code run directly?
Or would that be asking too much.

I'm sorry, I'm not sure I understand you completely here. Can you
rephrase? To try an answer: Buffer is part of NIO which is intended for
optimizing IO speed. I can't see a direct relation to the JVM running
JIT compiled code here. This code doesn't even need to reside on disk
to be run. Also, we're talking about to layers here, Java and the JVM
implementation. Although of course certain features of the JVM are used
to implement Java features I'm sure there's rarely a direct connection,
and especially won't the JVM use Java classes for it's work - rather the
other way round.
HOW do I get close to the metal in Java? Or at least pretend to?

JNI is always an option if you really need the speed.

Kind regards

robert
 
C

Chris Uppal

As an old-tymey C and C++ programmer I find Java very nice for
implementing algorithms in, with its garbage collection and array
bounds checking. However, as said old-tymey programmer, I end up
wanting to challenge time and space by building a super optimized
version of said algorithm.

And presumably the single biggest thing you consider is cache (and perhaps
pipeline) behaviour ?

I find Java not very good with this, as the array bounds checking seems
to be a tremendous bottleneck when you're trying to write close to the
metal.

Leaving aside the possibility, mentioned by other, that the bounds checking is
often optimised out, have you considered how much overhead it would add even if
all the checks were left in ? Taking into account the fact that the bounds
will be in L1 cache at worst, that the index will be in register, and that you
are running on a machine with pipelined execution and with speculative
execution (I'm assuming a modern P-class chip); the bounds check may actually
be free.

The requirement that arrays and objects initialize to clear
memory is no slowdown since I can just use the factory model and cache
the memory for such things.

It may be worth caching large arrays if your code doesn't require the
initialise-to-zero semantics. I wouldn't bother with pooling other objects for
two reasons -- one is that it interferes with GC and other optimisations, so it
may cause your code to slow down. The other is that I suspect that eliminating
redundant field initialisation code is one of the optimisations the JIT
performs.

-- chris
 
D

daniel.w.gelder

The requirement that arrays and objects initialize to clear
It may be worth caching large arrays if your code doesn't require the
initialise-to-zero semantics. I wouldn't bother with pooling other objects for
two reasons -- one is that it interferes with GC and other optimisations, so it
may cause your code to slow down. The other is that I suspect that eliminating
redundant field initialisation code is one of the optimisations the JIT
performs.

I'm doing graphics programming, not "business logic". Right now I'm
deep in an algorithm to compare and operate on multiple lists of
multiple polygons with multiple holes going to arbitrary depths. Should
I use Linked Lists? Or arrays? Or both? Why? Where? Actually, the
simplest way is to put the data in a two-dimensional array that has its
own linked-list logic. Stuff like this was what C was designed for.

Given this two-dimensional array, how do I avoid the overhead of having
the JITter check its bounds every time? Well, I'm obviously going to
have to create a one-dimensional array and use the old
[row*width+column] access trick, since if I let Java do the [][] thing,
it's going to be checking for a null pointer in the row field every
time, don't want that.

Next problem: the jitter may be able to analyze a function like this:

void function(int[] array, int n)
{
for (int i = 0; i < n; i++)
{
something(array);
}
}

....and see that the only one real check needed, specifically whether
"int[] array" is null and of length >= n, can be moved outside the
loop. Goodie. But what about this?

void function(int[] array, int n)
{
for (int i = 0; i < n; i++)
{
something(array, i);
}
}

The check can be moved outside, but does the something() call have to
do the same check if *it* accesses array?

Is there an obscure "contract" anywhere designed to enumerate the
conditions where the array needs to be rechecked?

As for JNI, J-N-O. I'd rather just run j2c and never even look back.

Dan
 
C

Chris Uppal

I'm doing graphics programming, not "business logic".

I don't think anyone replying in this thread had jumped to any conclusions at
all about what kind of programming you were doing.

Right now I'm
deep in an algorithm to compare and operate on multiple lists of
multiple polygons with multiple holes going to arbitrary depths. Should
I use Linked Lists? Or arrays? Or both? Why? Where? Actually, the
simplest way is to put the data in a two-dimensional array that has its
own linked-list logic. Stuff like this was what C was designed for.

Fair enough. And probably a good idea if the dataset is large or the access is
performance critical.

Given this two-dimensional array, how do I avoid the overhead of having
the JITter check its bounds every time?

I doubt whether that overhead is significant in relation of the overhead of an
extra level of indirection on (potentially) every array access.

Well, I'm obviously going to
have to create a one-dimensional array and use the old
[row*width+column] access trick, since if I let Java do the [][] thing,
it's going to be checking for a null pointer in the row field every
time, don't want that.

I doubt whether that overhead is significant in relation of the overhead of an
extra level of indirection on (potentially) every array access.

I.e. (and on the continuing assumption that this is performance-critical code)
I think you are doing the right thing, but for the wrong reasons.

Next problem: the jitter may be able to analyze a function [...]
But what about this?

void function(int[] array, int n)
{
for (int i = 0; i < n; i++)
{
something(array, i);
}
}
The check can be moved outside, but does the something() call have to
do the same check if *it* accesses array?


No idea. Probably. Unless the call to something() gets inlined.

I repeat: have you any evidence to suggest that these checks matter /in
practise/ ? Maybe it does, I don't know. I would expect (admittedly with no
more evidence than you have yet adduced ;-) that the effect would be trivial at
worst.

Is there an obscure "contract" anywhere designed to enumerate the
conditions where the array needs to be rechecked?

Nope. Or rather, yes -- every array access has to be bounds checked (but, of
course, implementations are allowed to cheat if they can get away with it
undetectably).

-- chris
 
D

daniel.w.gelder

I repeat: have you any evidence to suggest that these checks matter /in
practise/ ? Maybe it does, I don't know. I would expect (admittedly with no
more evidence than you have yet adduced ;-) that the effect would be trivial at
worst.

I went ahead and tried that...unfortunately I can't get gcvt() to work
right in stdlib...maybe someone knows a better way to get the time out.

Thx

********* OK, here's the Java file: ***********

public class TestClass {
int nextPointer(int[] array, int point)
{
return array[point];
}
public TestClass()
{
int[] array = new int[3000];
for (int i = 0; i < 3000; i++)
{
array = (int)(Math.random() * 3000d) % 3000;
}

for (int times = 0; times < 10; times++) // interp vs. comp
{
long start = System.currentTimeMillis();
int pointer = 0;
for (int i = 0; i < 10000000; i++)
{
pointer = nextPointer(array, pointer);
}
System.out.println(System.currentTimeMillis() - start);
}
}
}

********** And here's a C++ file: **********

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int nextPointer(int* array, int point);
int nextPointer(int* array, int point)
{
return array[point];
}

int main(void)
{
printf("Hello World, this is CodeWarrior!\n");

int* array = new int[3000];
for (int i = 0; i < 3000; i++)
{
array = (int)(rand()) % 3000;
}

for (int times = 0; times < 10; times++) // interp vs. comp
{
clock_t start = clock();

int pointer = 0;
for (int i = 0; i < 10000000; i++)
{
pointer = nextPointer(array, pointer);
}

printf(gcvt(difftime(clock(), start) * (double)1000, 5, new
char[100]));
printf("\n");
}

return 0;
}
 
C

Chris Uppal

I went ahead and tried that...unfortunately I can't get gcvt() to work
right in stdlib...maybe someone knows a better way to get the time out.

I changed your code around a bit, increased the nuimber of loops in the inner
loop, and re-arranged the code to allow the JIT to work without dependng on
on-stack replacement. I also added a case where I explicitly coded a
bound-check in addition to whatever the compiler provided for me (I won't post
code here for reasons which will become apparent).

Comparing the following compilers:
gcc is the MinGW version of gcc 3.4.2 running at -O3 (no
other optimisation-related options).
msvc is the C++ compiler from MS Visual Studio running
with all time-releted optimisations turned on, bounds checking
turned off, and optimising for a P4 processor.
java is the "server" JVM from JDK 1.5.0_6

I got times (averaged over the last 10 of 20 runs -- which allows the JIT to
get its act together):
gcc: 278.4 138.2
msvc: 210.3 0.0
java: 332.6 209.2

The first column is the time for the loop with explicit bounds checks, the
second column is the time without explict bounds checks.

Some things to note.

One is that I was surprised that only MSVC managed to optimise out the useless
loop -- normally java -server is inconveniently good at optimising away
benchmark code. However, it is clear that we need a better benchmark.

Another is that the (hypothetical) two consecutive sets of bounds checks in the
java version don't get conflated into one check, nor made to vanish by the
processors internal caching and pipelining.

There's another important point -- in this loop, we are not really doing
anything /with/ the data, and the next array access is dependent on the value
retreived in the previous step. This will not allow the processors pipeline to
work correctly, and is also extremely unrepresentative.

So I added a couple of lines of code to sum each value as it was pulled out of
the array, and then return the overall (meaningless) sum to the test harness
where it was used in a conditional. That not only keeps the optimiser honest,
but also provides something for the processors arithmetic units to do while it
is waiting for the next element of the array to be pulled out of (probably) L1
cache. Fortunately, the test array is small enough that we don't need to worry
about the random access into the array providing a seriously false load to the
L1 and L2 caches, or the benchmark would be completely useless.

I will append the C++ and Java code to the end of this message.

With the changes, the times I saw were:
gcc: 272.4 207.2
msvc: 342.6 205.2
java: 380.6 211.3

(Again, that's the average of the last 10 runs -- I'll append the raw data to
this message too)

Notice:

There is very little difference between the runtimes for the three compilers
when there are no explicit bounds checks.

Java is still not managing to fold the explicit and implicit bounds checks
together, and neither is the processor managing to do so. I find this very
surprising. Maybe the JIT is emitting code to tell the CPU to expect the
explicit tests to fail, and thus buggering up the speculative execution that it
would otherwise be able to do. The MS compiler seems as if it might be doing
something similar, but GNU isn't. But that is pure (and not very willl
informed, to be honest) speculation, so who knows...

Java without explicit bounds checks is about as fast in this test as in the
previous one, thus supporting my feeling that the previous version wasn't
making representative use of the CPUs pipelining and out-of-order execution.

So, broadly speaking, I think the effect of Java's bounds checking can be
characterised as somewhere between trivial and zero. Of course, any such
assertion must also note that this is only one micro-benchmark. I have
attempted to make it at least minimally realistic, but for real purposes you
would need to measure real code working on real data, with real access
patterns.

-- chris

============== TestClass.java =============--
public class TestClass
{
int
nextPointer(int[] array, int point)
{
return array[point];
}

int
timeitWithoutCheck(int[] array)
{
int pointer = 0;
int sum = 0;
for (int i = 0; i < 100000000; i++)
{
pointer = nextPointer(array, pointer);
sum += pointer;
}
return sum; // meaningless
}

int
timeitWithCheck(int[] array, int size)
{
int pointer = 0;
int sum = 0;
for (int i = 0; i < 100000000; i++)
{
if (pointer < 0) return 0;
if (pointer >= size) return 0;
pointer = nextPointer(array, pointer);
sum += pointer;
}
return sum; // meaningless
}

public static void
main(String[] args)
{
TestClass test = new TestClass();

int[] array = new int[3000];
for (int i = 0; i < 3000; i++)
array = (int)(Math.random() * 3000d) % 3000;

for (int times = 0; times < 20; times++) // interp vs. comp
{
long start;

start = System.currentTimeMillis();
int s1 = test.timeitWithCheck(array, 3000);
long with = System.currentTimeMillis() - start;

start = System.currentTimeMillis();
int s2 = test.timeitWithoutCheck(array);
long without = System.currentTimeMillis() - start;

System.out.println(with + "\t" + without);
if (s1 != s2)
System.out.println("more fun fooling the optimiser");
}
}
}
============ TestClass.cpp ==================
#include <stdio.h>
#include <stdlib.h>
#include <sys/timeb.h>

int
nextPointer(int *array, int point)
{
return array[point];
}

int
timeitWithoutCheck(int *array)
{
int pointer = 0;
int sum = 0;
for (int i = 0; i < 100000000; i++)
{
pointer = nextPointer(array, pointer);
sum += pointer;
}
return sum; // meaningless
}

int
timeitWithCheck(int *array, int size)
{
int pointer = 0;
int sum = 0;
for (int i = 0; i < 100000000; i++)
{
if (pointer < 0) return 0;
if (pointer >= size) return 0;
pointer = nextPointer(array, pointer);
sum += pointer;
}
return sum; // meaningless
}

long
currentTimeMillis()
{
struct timeb times;

ftime(&times);
return times.time * 1000 + times.millitm;
}

int
main(int argc, char **argv)
{
int* array = new int[3000];
for (int i = 0; i < 3000; i++)
array = (int)(rand()) % 3000;

for (int times = 0; times < 20; times++) // interp vs. comp
{
long start;

start = currentTimeMillis();
int s1 = timeitWithCheck(array, 3000);
long with = currentTimeMillis() - start;

start = currentTimeMillis();
int s2 = timeitWithoutCheck(array);
long without = currentTimeMillis() - start;

printf("%ld\t%ld\n", with, without);
if (s1 != s2)
printf("more fun fooling the optimiser\n");
}
}
================ Raw data =====================
GCC -O3:
450 211
270 261
310 200
281 200
280 201
280 200
271 210
270 211
280 200
281 200
271 210
270 210
270 210
271 210
270 201
280 200
281 200
270 211
270 210
271 210

MS VS2003 + all time opts
521 210
340 201
340 210
341 200
341 200
341 200
440 201
350 201
340 210
341 210
341 200
350 201
350 200
341 210
341 200
341 210
340 211
340 200
341 210
341 210

JAVA -server:
471 220
371 210
371 210
381 200
370 211
370 211
380 200
381 200
381 200
381 210
421 250
381 200
380 211
370 210
381 200
381 200
381 210
371 210
370 211
370 211
======================================
 
D

daniel.w.gelder

I'm pretty impressed...

But why is every Java chat room so slow? Bad programming? Ah well. :_)

Thanks so much for helping me out.

Dan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,828
Latest member
LauraCastr

Latest Threads

Top