performance question

Roger Lindsjö · Dec 11, 2007

Making the loops smaller it is easier to show that "something" happens.
I ran it with 10000000 iterations instead, and then the output showed:
Bench 15
Took 0.229 seconds
Bench 16
Took 0.228 seconds
Bench 17
Took 0.228 seconds
Bench 18
Took 0.105 seconds
Bench 19
Took 0.107 seconds
Bench 20
Took 0.106 seconds
Bench 21
Took 0.107 seconds

So, after 18 iterations the time for each loop is halved. I'm pretty
sure this is not a cache warmup (it it is the coldest cache I have ever
seen ;-) )

Didn't manage to get compiler debugging to output anything, will se if I
can get some more info.

//Roger Lindsjö

Jon Harrop · Dec 11, 2007

Lew said:
Interesting to me, too, as I have found that one of the singular strengths
of Java is its insistence on OO programming.

You will quickly change your mind if you try to write that symbolic
simplifier in an OO style. OOP is hopelessly inadequate for a wide variety
of practical problems...

Jon Harrop · Dec 11, 2007

Roger said:
Making the loops smaller it is easier to show that "something" happens.
I ran it with 10000000 iterations instead, and then the output showed:
Bench 15
Took 0.229 seconds
Bench 16
Took 0.228 seconds
Bench 17
Took 0.228 seconds
Bench 18
Took 0.105 seconds
Bench 19
Took 0.107 seconds
Bench 20
Took 0.106 seconds
Bench 21
Took 0.107 seconds

So, after 18 iterations the time for each loop is halved. I'm pretty
sure this is not a cache warmup (it it is the coldest cache I have ever
seen ;-) )

Yes, this is much more compelling evidence of something happening during
running (assuming it is repeatable). However, is the >2x performance
improvement that you're seeing consistent with Steve's findings that
improved by <20%?

Steve Wampler · Dec 11, 2007

Jon said:
Here is an allocation-bound benchmark: allocate an array of "n" pairs of
floats.

In OCaml:

Array.init (int_of_string Sys.argv.(1))
(fun i -> (float i, float i +. 1.))

In Java:

public class Bench {
static final long CLOCKS_PER_SEC = 1000;

public final class Pair {
private Object fst;
private Object snd;
public Pair(Object fst, Object snd) {this.fst=fst; this.snd=snd;}
public Object getFirst() { return fst; }
public Object getSecond() { return snd; }
}

public void run(int n) {
Pair[] a = new Pair[n];
for (int i=0; i<n; ++i)
a = new Pair((double)i, (double)i + 1);
}

public static void main(String[] args) {
long start = System.currentTimeMillis();
int n = Integer.parseInt(args[0]);
(new Bench()).run(n);
long end = System.currentTimeMillis();
System.out.printf("%dms\n",
(int) ((end - start) * 1000 / CLOCKS_PER_SEC));
}
}

For n = 2x10^6. OCaml:

$ time ./bench_ocaml 2000000

real 0m1.674s
user 0m1.500s
sys 0m0.172s
$ time java Bench 2000000
3750ms

real 0m3.930s
user 0m4.688s
sys 0m0.592s

Not only does Java require several times as much code to get this trivial
task done, it is >2x slower at doing it.

I've been wondering about just what this benchmark is showing. It's already been
shown that the Java version is hit hard by the initial size of the Java heap and
the performance improves markedly when that factor is removed - so this benchmark
may be showing that Ocaml's initial heap is either larger or (see below) is allocating
fewer objects. On my machine, running the above code with a larger initial heap
yields:
----------------------------------------------------------
time java -server -Xms256M Bench 2000000
1330ms
java -server -Xms256M Bench 2000000 2.25s user 0.29s system 173% cpu 1.460 total
---------------------------------------------------------
which, given my slower clock speed, is probably nearly as fast or faster than the
Ocaml solution would be on my machine [yes, that's conjecture - and may be wrong.
I just don't feel like installing Ocaml to find out!] To try and avoid the impact
of initial heap size I'm going to give Java enough heap from here on.

However, this was supposed to show that allocation is slow in Java. So another
interesting question is whether the two versions are allocating the same number
of heap objects of the same size. That's hard to tell in Ocaml, since it is
a very expressive language, but float is a primitive type in Ocaml and it would
make sense not to allocate a separate heap object to hold the float. Jon can
no doubt tell whether a two element list of floats contains two floats or pointers
to two heap objects each containing a float.

The Java version, as written, pretty much maxes out the number of heap objects
you could produce for such a simple test. I think I understand why Jon wrote it
that way - not to deceive, but to try and match the expressiveness of Ocaml,
with its polymorphic typing. That's not what was being discussed here, however.

If Ocaml is not allocating separate objects for each float then it might be more
reasonable, when measuring allocation performance to do the same in Java. This
version does that:

----------------------------------------------------------------
public class Bench3 {

static final long CLOCKS_PER_SEC = 1000;

public final class Pair{
public double fst;
public double snd;
public Pair(double fst, double snd) {this.fst=fst; this.snd=snd;}
}

public void run(int n) {
Pair[] a = new Pair[n];
for (int i=0; i<n; ++i) {
a = new Pair((double)i, (double)i+1);
}
}

public static void main(String[] args) {
int n = Integer.parseInt(args[0]);
long start = System.currentTimeMillis();
(new Bench3()).run(n);
long end = System.currentTimeMillis();
System.out.printf("%dms\n",
(int) ((end - start) * 1000 / CLOCKS_PER_SEC));
}

}
-------------------------------------------------
(I stripped out the getters because they weren't being used and so don't contribute
to the discussion.)

When run on my machine, this produces:

-------------------------------------------------
->time java -server -Xms256M Bench3 2000000
463ms
java -server -Xms256M Bench3 2000000 0.51s user 0.15s system 107% cpu 0.617 total
-------------------------------------------------

Which, when taking clock speeds into account [granted, there are many other
factors at play] is most likely significantly faster than the Ocaml version.
[Again, this may not be relevant - it depends on just how Ocaml handles primitive
types.] Perhaps, when arguing whether or not Java is slow (as opposed to expressive!)
it might be better to stick with similar languages like C++ for comparison.

Mark Thornton · Dec 11, 2007

Jon said:
Exactly. I see no evidence that HotSpot is doing anything different to a
normal static optimizer, i.e. it does not appear to incrementally improve
the performance of the Java code as it runs.

It does inlining of virtual methods which static optimisers usually
don't do. However none of the micro benchmarks seen here will exhibit
that effect. So quite a bit of it optimisation effort is directed at OO
programs. In these cases it can show significant benefits relative to
C++, which do not appear in essentially functional examples.

Mark Thornton

Mark Thornton · Dec 11, 2007

Jon said:
You will quickly change your mind if you try to write that symbolic
simplifier in an OO style. OOP is hopelessly inadequate for a wide variety
of practical problems...

That is what is known as "Mandy Rice Davies applies". I.e. given the
nature of your business, you would say that wouldn't you.

There are another substantial category of applications where OO is very
appropriate.

Mark Thornton

Roger Lindsjö · Dec 11, 2007

Jon said:
Yes, this is much more compelling evidence of something happening during
running (assuming it is repeatable). However, is the >2x performance
improvement that you're seeing consistent with Steve's findings that
improved by <20%?

If I'm understanding you correct, you are asking if I get similar
results as above but using the same number as Steven did? Yes, this is
what I got:
java T1 10000000 10 > out
Bench 0
Took 2.928 seconds
Bench 1
Took 2.358 seconds
Bench 2
Took 1.124 seconds
Bench 3
Took 1.125 seconds
Bench 4
Took 1.155 seconds
Bench 5
Took 1.166 seconds
Bench 6
Took 1.13 seconds

After that it seems to stay below 1.2 seconds. So the jump happens
between 10 and 20 million, which is consistent with my previous run. It
still annoys me that -XX

rintCompilation does not actually print
anything.
The Java I'm using is:
java version "1.6.0_02"
Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
Java HotSpot(TM) Server VM (build 1.6.0_02-b05, mixed mode)
Maybe I'll try the latest. I'm quite sure I have gotten output from
previous JVMs of when compilation/recompilation occurred and how long
time it took. Too bad I can't remember how.

//Roger Lindsjö

Jon Harrop · Dec 11, 2007

Mark said:
That is what is known as "Mandy Rice Davies applies". I.e. given the
nature of your business, you would say that wouldn't you.

If it weren't true I'd be in the Java business.

There are another substantial category of applications where OO is very
appropriate.

Yes. That's exactly why both OCaml and F# provide OOP as well.

Jon Harrop · Dec 11, 2007

Steve said:
I've been wondering about just what this benchmark is showing. It's
already been shown that the Java version is hit hard by the initial size
of the Java heap and the performance improves markedly when that factor is
removed - so this benchmark may be showing that Ocaml's initial heap is
either larger or (see below) is allocating
fewer objects.
On my machine, running the above code with a larger
initial heap yields:
----------------------------------------------------------
time java -server -Xms256M Bench 2000000
1330ms
java -server -Xms256M Bench 2000000 2.25s user 0.29s system 173% cpu
1.460 total ---------------------------------------------------------
which, given my slower clock speed, is probably nearly as fast or faster
than the Ocaml solution would be on my machine [yes, that's conjecture -
and may be wrong.

Increasing the heap size in Java reduces the time taken from 3.826s to
2.430s here. However, doing the same to the OCaml also reduces the time
taken, from 1.654s to only 0.279s.

You do this by adding:

Gc.set { (Gc.get()) with Gc.minor_heap_size = 1 lsl 25 };;

However, this was supposed to show that allocation is slow in Java. So
another interesting question is whether the two versions are allocating
the same number
of heap objects of the same size. That's hard to tell in Ocaml, since it
is a very expressive language, but float is a primitive type in Ocaml and
it would
make sense not to allocate a separate heap object to hold the float. Jon
can no doubt tell whether a two element list of floats contains two floats
or pointers to two heap objects each containing a float.

I deliberately chose the worst-case data structure in OCaml to ensure that
the comparison is fair. OCaml is allocating an array of pointers to pairs
of pointers to floating point numbers, i.e. each float is being
individually boxed in the OCaml.

However, even though OCaml is allocating a lot of unnecessary pointers it
can do so incredibly quickly compared to almost any other language because
this is one of the most performance-critical parts of an OCaml
implementation and it has been very highly optimized as a consequence.

The Java version, as written, pretty much maxes out the number of heap
objects
you could produce for such a simple test. I think I understand why Jon
wrote it that way - not to deceive, but to try and match the
expressiveness of Ocaml,
with its polymorphic typing. That's not what was being discussed here,
however.

There is no polymorphism in the code here.

If Ocaml is not allocating separate objects for each float then it might
be more
reasonable, when measuring allocation performance to do the same in Java.
This version does that:

----------------------------------------------------------------
public class Bench3 {

static final long CLOCKS_PER_SEC = 1000;

public final class Pair{
public double fst;
public double snd;
public Pair(double fst, double snd) {this.fst=fst; this.snd=snd;}
}

public void run(int n) {
Pair[] a = new Pair[n];
for (int i=0; i<n; ++i) {
a = new Pair((double)i, (double)i+1);
}
}

public static void main(String[] args) {
int n = Integer.parseInt(args[0]);
long start = System.currentTimeMillis();
(new Bench3()).run(n);
long end = System.currentTimeMillis();
System.out.printf("%dms\n",
(int) ((end - start) * 1000 / CLOCKS_PER_SEC));
}

}
-------------------------------------------------
(I stripped out the getters because they weren't being used and so don't
contribute to the discussion.)

When run on my machine, this produces:

-------------------------------------------------
->time java -server -Xms256M Bench3 2000000
463ms
java -server -Xms256M Bench3 2000000 0.51s user 0.15s system 107% cpu
0.617 total -------------------------------------------------

Which, when taking clock speeds into account [granted, there are many
other factors at play] is most likely significantly faster than the Ocaml
version.

An equivalent OCaml program is also much faster, of course, taking only
0.17s here.

[Again, this may not be relevant - it depends on just how Ocaml handles
[primitive
types.] Perhaps, when arguing whether or not Java is slow (as opposed to
expressive!) it might be better to stick with similar languages like C++
for comparison.

Click to expand...

I had thought that but C++ actually does quite well here, although nothing
like as well as OCaml. The following C++ takes only 0.854s to perform the
first (fully boxed) benchmark including deallocation:

#include <vector>
#include <iostream>

typedef std:air<double *, double *> P;
typedef std::vector<P *> A;

int main(int argc, char *argv[]) {
const int n = atoi(argv[1]);
A a(n);

for (int i=0; i<n; ++i)
a.at(i) = new P(new double(i), new double(i+1));

for (A::iterator it = a.begin(); it != a.end(); ++it) {
delete (*it)->first;
delete (*it)->second;
delete *it;
}
}

The symbolic simplifier will be a better example of this but writing it
comparably in C++ is prohibitively difficult because it relies upon having
a GC...

George Neuner · Dec 11, 2007

Interesting to me, too, as I have found that one of the singular strengths of
Java is its insistence on OO programming.

I've noticed that people often prefer to take shortcuts when writing code -
breaking the OO style being one such. The problem is the concomitant
reduction in maintainability and stability of the code. As a good part of my
career has been to pick up after absent programmers, I am quite sensitive to
the true lifetime cost of such shortcuts.

As am I. It's not about shortcuts or "breaking style" - it's about
realizing that the most straight-forward solution to a problem may not
be so straight-forward to express in OO. A number of the common
design patterns really are just work arounds for practical
deficiencies in the model and/or the existing languages that implement
it.

Many times I have found myself working on a program in which certain
parts are a great match to OO, but other parts would be most clearly
expressed using a functional or even a procedural style.

There are no _single_ paradigm languages - all languages blend
elements of more than one model, but after 20+ years, half a dozen
general purpose languages and more DSLs than I can remember, I find it
restrictive to work in languages which actively discourage using
models other than their dominant one. Doesn't mean I won't work in
them ... just that I prefer not to.

What seems reasonable to an author, especially a highly-skilled one as,
presumably, Mr. Neuner is, will often cause trouble for those who follow, as
teams hire a range of skills and experience.

There is a technical management solution to the problem of highly
skilled programmers - make them teach the less skilled ones.

Levity aside, I agree that it is not a terribly good thing to let any
significant part of your code base become the property of just one or
a handful of programmers. However that can be a technical issue as
much as a management one - in a small group there may be only one
person with the domain expertise to maintain the code. When such
experts inevitably move on it can be very traumatic for the group.

George

Concurrent code performance	10	Oct 9, 2011
DatagramSocket receive performance	16	Mar 14, 2007
MappedByteBuffer	23	Jul 23, 2010
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
Finding empty columns. Is there a faster way?	2	Apr 21, 2011
greatly differing processing time between java and Linux while calculating hashes?	1	Sep 9, 2012
ctype performance benchmark	2	Jul 17, 2009
Ruby Performance	86	Aug 12, 2005

performance question

Roger Lindsjö

Jon Harrop

Jon Harrop

Steve Wampler

Mark Thornton

Mark Thornton

Roger Lindsjö

Jon Harrop

Jon Harrop

George Neuner

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads