Bulk Array Element Allocation, is it faster?

J

Jan Burse

Lew said:
This is equally true of primitive arrays. Why do you claim otherwise?

No, I don't claim otherwise. new int[] and new double[]
would also need treatment when the values 0 and 0.0
are not appropriate.

But you never run into a null-pointer exception with
primitive arrays, so they are a little bit more ready
made than non-primitive arrays.

Bye
 
J

Jan Burse

Jan said:
Lew said:
This is equally true of primitive arrays. Why do you claim otherwise?

No, I don't claim otherwise. new int[] and new double[]
would also need treatment when the values 0 and 0.0
are not appropriate.

But you never run into a null-pointer exception with
primitive arrays, so they are a little bit more ready
made than non-primitive arrays.

Bye

Just compare with C array allocation of a bunch of
fixed size records. In C you get the records. In Java
you only get the array but no records in it.
 
L

Lew

What I do in the code has nothing to do with my question. My

It has a great deal to do with your question.
question circles only around the following code fragment:

Bla[] bla = new Bla[n];

Here the system does allocate, as you plaintively request, "n*X bytes of space", where "X" is the size of a pointer.
for (int i=0; i<n; i++) {
bla = new Bla();
}


How would you imagine the JIT would optimize this without taking semantics into account? Each 'Bla' is required to be independently GC-able, you know.. The semantics of 'Bla[]' is not a block of instances; it's a block of references.

How would you bulk allocate a block of references each to a different reference? About the only way is to add a fixed offset to the value pushed intoeach successive array element, but that's already what you are doing with 'new' anyway, so you aren't "optimizing" by doing the same thing.

If the 'Bla' constructor is heavyweight, the construction of your n 'Bla' instances will far outweigh any slight overhead in calculating that offset increase for each array element.

I just don't see how you expect to bulk-load n different pointers into thatarray. Could you explain that in more detail, please?
And what the JIT does. And not what I am doing in a comparison
with lazy. That is why I started my post with:

"I just wonder wether [sic] modern JIT [sic] do optimize
Code [sic] of the following form:"

Please focus on that.

Aye, aye, focusing on that, Captain!

The answer is "nothing", because the semantics of the operation are such that the current mechanism is already quite close to optimal. This is what people have been explaining to you, that you dismissed as irrelevant.
 
J

Joshua Cranmer

I doubt that. The JLS never says exactly how something works, just
what the net effect must be.

The JLS clearly states the full observational effects of any action,
particularly in the realm of object initialization and the memory model.
It is part of what makes it so difficult to follow.

The JLS is not difficult to follow. ISO C++11 is difficult to follow.
Well, the section on generic type inference could use a few more
examples to be easier to follow, but even that is relatively
straightforward.
A JLS for programmers rather than implementors might
describe a simple reference implementation, and say, "As a programmer
you can presume it works like this, but actual implementations will be
more sophisticated and may not give exactly the same behaviour as this
simple explanation. For those fine points, see the implementors JLS."

As I stated earlier, the JLS states full observational effects. If the
underlying behavior is not the same (e.g., using escape analysis to
change heap allocations to stack allocations), it would require advanced
introspective techniques to observe this which are well beyond the
semantics of Java itself.
 
J

Jan Burse

Lew said:
How would you imagine the JIT would optimize this
without taking semantics into account? Each 'Bla'
is required to be independently GC-able, you know.

In my posts I mentioned two possible optimizations.
Optimization 1: Moveing the locking outside of the
loop, Optimization 2: Allocating at once n*X where
X is the space for the Bla record, not the space
for the Bla point.

Since after allocation, we have the assignment
bla = new Bla(), the object cannot escape as
long as bla does not escape, so there should be
not much problem with GC, but I am not 100% sure
what the problems with GC should be.

But anyway, let speak the figures, and lets stop
musing too much. I did some testing and it shows
that the 64bit can do the "speed up" whereby the
32bit cannot do it:

OS JDK Arch Bulk Lazy Delta %
Win 1.6 64bit 8'771 9'805 11.8%
Win 1.6 32bit 14'587 14'744 1.1%
Win 1.5 32bit 17'139 17'405 1.6%
Mac 1.6 64bit 11'003 12'363 12.4%
Unix 1.6 32bit 26'517 26'858 1.3%

Still this leaves open the question whether the
64-bit JIT is clever or the 64-bit JVM is clever
or the 64-bit CPU is clever.

Definitely it seems that the 32-bit is not clever,
there we all see a small overhead for the lazy,
which I interpret as the additional checks, eventually
done repeatedly, for the lazy.
The answer is "nothing", because the semantics
of the operation are such that the current
mechanism is already quite close to optimal.
This is what people have been explaining to you,
that you dismissed as irrelevant.

Where do you see "nothing" in the above figures?

Bye
 
E

Eric Sosman

Eric said:
Again, I point out that the bulk and lazy variants do not do
the same thing. Consider, for example

class Bla {
private static int master_seqno = 0;
public final int seqno = ++master_seqno;
}

Observe that the value of bla[42].seqno differs between the two
variants; it would therefore be an error to "optimize" either by
transforming it into the other.

Not really. You could only see a difference in the order
that Bla() gets a number.

You'll also see a difference in *which* number bla[42] gets.
In the bulk initialization, bla[42].seqno is 43, always. With
lazy initialization, it can be anything at all between 1 and n
inclusive. QED, and your "not really" makes no sense to me.

There are lots of other effects Bla() construction might have
that would turn out differently in bulk or lazy initialization.
If the constructor writes a message to System.out, all the messages
will appear together in bulk initialization, but will be scattered
all over the place with lazy initialization. If the constructor
throws an exception -- even a seldom-caught RuntimeException --
the exceptions will appear in a different order w.r.t. other actions
in the program. The two code sequences are irreconcilably different.
So if your application does not depend on the order
the seqno's are created there is no functional problem.

From the application's point of view perhaps not. But Java and
the JVM can't possibly know that; they follow the orders you give.
You say "Assign this seqno (not that one) to the new Bla, and throw
(or don't throw) this exception, write (or don't write) this output,
and do (or don't do) all these things in their proper order w.r.t.
all my other commands." Concrete example: In a sense you do not
care what value Math.random() returns, but Java is contractually
bound to deliver a specific value under specific circumstances, and
cannot violate its contract even if you are in a forgiving mood.
Main question I have here is not about relative
correctness of bulk versus lazy. But why bulk is
counterintuitively much faster?

Nobody knows, because you've shown no code. My first guess
(already hinted at) is that your timing is faulty; it is notoriously
hard to get good timings even from statically-bound languages these
days, and even more difficult with dynamically-bound partially-
interpreted-partially-compiled garbage-collected multi-threaded
languages like Java. I'm sticking with that guess until you show
some evidence to refute it.
Since the bulk is much faster I am assuming [...]
Yeah.

Actually this loop is something that shocked me
when I started working with Java many years ago.
What there is no way to make a "new" of an array
with all its elements?

That's another topic, and one that's been hashed over several
times in other threads. The principal reason, I think, is that
it would so seldom be useful to create an array of N objects *all*
created by the no-args constructor: Who needs N copies of "the
same" thing? And what if the class has no no-args constructor?

Okay, sure, it could happen occasionally. But often enough
to justify a special feature in the language, with special rules
for "What if I want to pass arguments to the constructors?" or
"What if the constructor is inaccessible and I'm supposed to use
a factory method?" and "What if I *don't* want objects created
yet?" and so on, and so on?

Challenge: Look through your own code, right now, and count
the number of times you have actually filled an array with references
to N identically-constructed-with-no-args objects. In other words,
count the number of times your own actual code would have been able
to take advantage of such a feature if Java provided it.
This is not seen if one works with int[] or
double[] arrays. But as soon as you work with some
other array, either a higher dimensional of int[]
or double[] or an array of some class X. You are
responsible for populating the elements.

How often do you create an int[] and leave all its elements
alone, still holding their original zeroes? (On purpose, I mean.)

In any case, Java *does* populate the array for you: With
null references. Do you have a prejudice against `null'?
I am
not interessted in a general theory between going
from bulk to lazy and back and forth. Forget
about the lazy and explain the bulk!

Exhibit some actual code, so people can stop guessing about
what you've done.
 
J

Jan Burse

Jan said:
OS JDK Arch Bulk Lazy Delta %
Win 1.6 64bit 8'771 9'805 11.8%
Win 1.6 32bit 14'587 14'744 1.1%
Win 1.5 32bit 17'139 17'405 1.6%
Mac 1.6 64bit 11'003 12'363 12.4%
Unix 1.6 32bit 26'517 26'858 1.3%

Update:

Win 1.7 64bit 8'159 8'975 10.0%

Same machine like 1.6 64bit testing was done,
just another JDK, the newer 1.7. Advantage of
bulk over lazy slightly less pronounced.

But application itself 10% faster again (sic!).

Bye
 
J

Jan Burse

Eric said:
Challenge: Look through your own code, right now, and count
the number of times you have actually filled an array with references
to N identically-constructed-with-no-args objects.
Exhibit some actual code, so people can stop guessing about
what you've done.

I am not asking people to guess what I have done. I
am asking about whether JIT can optimize loops with
new in it. The new in it needs not to be a argument
less constructor. It could be also a constructor
with arguments.

All the optimizations that came to my mind easily
work also with argument based constructors. For
example moving the lock outside of the loop should
also work with argument based constructors in most
cases:

for (int i=0; i<n; i++) {
lock heap
bla = new Bla(i);
unlock heap
}

Is the same as:

lock heap
for (int i=0; i<n; i++) {
bla = new Bla(i);
}
unlock heap

If <init> of bla does not much. And the other optimization
with the memory allocate is also not impacted much by
argument based constructors. And it can be combined with
the heap locking, and we can move the lock before
the loop, and will lock less time:

for (int i=0; i<n; i++) {
lock heap
p = allocate size X
bla = init p;
unlock heap
}

Is the same as:

lock heap
p = allocate size n*X
unlock heap
for (int i=0; i<n; i++) {
bla = init p;
p += x;
}

I hope you see now what I am heading for. But I must
disappoint you, if a compiler would be able to optimize
constructors with arguments, maybe not in all situations
but to a great degree, I will not have use for it in
my application, since my current problem is argument
less.

But I would be of course also happy to hear something
about the more general problem of constructors with
arguments and what the JIT would do for loops. Since
it would then be very easy for me to specialize the
insight to my argument less problem.

I do not exclude that JITs can also deal with
constructors that have arguments in loops.

Bye
 
L

Lew

In my posts I mentioned two possible optimizations.
Optimization 1: Moveing the locking outside of the
loop, Optimization 2: Allocating at once n*X where

That's not an optimization because things might need to happen in between loop cycles, so you can't "lock the heap" that long and call it an "optimization".

Semantically an array of n references is an array of n separate references.You can't count on the memory allocations to be contiguous, shouldn't have to count on that, and you certainly can't count on each reference becoming eligible for GC at the same time. GC in turn moves things around anyway,so any "benefit" of contiguous allocation will be gone quite quickly, if it ever existed in the first place.

And the point you ignore is that the "benefit" of contiguous allocation is questionable, let alone of your rather bizarre suggestion to move the "locking of the heap" outside the loop. Each 'new' just bumps up the heap pointer, so it's not much faster to bump it once than n times, in the grand scheme of things. Not when your tiny advantage is quickly eaten up by circumstances immediately afterward anyway.

X is the space for the Bla record, not the space
for the Bla point.

But that's the thing that's semantically different!

You can't "optimize" allocation of n references to also create a block of ninstances. The optimizer cannot know that that is the intended use of thearray. You can't "optimize" the allocation of n instances consecutively either, because you have to know a whole lot about what those n constructorswill try to do, including possibly allocate heap themselves or except out prematurely. For the few cases where block allocation *might* provide negligible speedup, it's not worth the analysis effort to determine that this one time is one of those rare exceptions when allocation might be sped up enough for anyone to notice.

As people keep pointing out in this conversation.
Since after allocation, we have the assignment
bla = new Bla(), the object cannot escape as
long as bla does not escape, so there should be
not much problem with GC, but I am not 100% sure
what the problems with GC should be.


What do you even mean by this? What do you mean by the object "escaping"?

I don't know about "problems" with GC, but you cannot be sure that the individual instances pointed to by the 'bla' pointers will be descoped at the same time. Ergo you cannot know ahead of time when they will be eligiblefor GC. Instances that survive GC will be moved to contiguous memory blocks, but not their original ones. Whatever negligible benefit you might have gotten, but more likely did not, from contiguous allocation of the 'Bla' instances will be wiped out.
But anyway, let speak the figures, and lets stop
musing too much. I did some testing and it shows
that the 64bit can do the "speed up" whereby the
32bit cannot do it:

OS JDK Arch Bulk Lazy Delta %
Win 1.6 64bit 8'771 9'805 11.8%
Win 1.6 32bit 14'587 14'744 1.1%
Win 1.5 32bit 17'139 17'405 1.6%
Mac 1.6 64bit 11'003 12'363 12.4%
Unix 1.6 32bit 26'517 26'858 1.3%

Still this leaves open the question whether the
64-bit JIT is clever or the 64-bit JVM is clever
or the 64-bit CPU is clever.

Definitely it seems that the 32-bit is not clever,
there we all see a small overhead for the lazy,
which I interpret as the additional checks, eventually
done repeatedly, for the lazy.


Where do you see "nothing" in the above figures?

Everywhere. You have some questionable method of timing something that youdo not describe, let alone with any precision, using doubtful methodology and suspect reasoning without reference to experimental error or confidenceanalysis. Your numbers literally tell me nothing.
 
J

Jan Burse

Lew said:
You can't "optimize" allocation of n references
to also create a block of n instances. The optimizer
cannot know that that is the intended use of the array.

The following optimization works independent of
the use of the array. You can even nullify an
array element or set it to a newly created object.
So going from an unoptimized code:

for (int i=0; i<n; i++) {
lock heap
bla = new Bla(i);
unlock heap
}

To the following:


lock heap
p = allocate size n*X
unlock heap
for (int i=0; i<n; i++) {
bla = init p;
p += x;
}

Shouldn't be a problem for a decent compilers.
Various compilers do much more with loops. Why
do you doubt that this optimization is possible.
For the few cases where block allocation
*might* provide negligible speedup.

Oh, you don't really doubt that it is possible.
You rather doubt that it is worth. Well I
can tell you, in my application, the allocation
of small arrays with preallocated empty constructor
less initialized objects is done over and over.

I have recently run visual vm. It is really
what is mostly happening in the app during
the process of solving a problem.

And it seems that since it is so dominant any
changes in the way this is done are directly
seen in the timings. Here are the timings
again:

OS JDK Arch Bulk Lazy Delta %
Win 1.7 64bit 8'159 8'975 10.0%
Win 1.6 64bit 8'771 9'805 11.8%
Win 1.6 32bit 14'587 14'744 1.1%
Win 1.5 32bit 17'139 17'405 1.6%
Mac 1.6 64bit 11'003 12'363 12.4%
Unix 1.6 32bit 26'517 26'858 1.3%

On 64bit the bulk is 10% faster for the
present application that makes heavy use
of allocating these objects and arrays.
As people keep pointing out in this conversation.

I know, it is logical that if it were that
the application would not make heavy use
of allocating these objects and arrays, then
nothing would be seen. But if nothing would
be seen at all, I wouldn't post here and
asking what is going on. But something is
seen, the 10%. It is not "nothing".

So what is going on in the 64bit?

Bye




Bye
 
J

Jan Burse

Jan said:
So what is going on in the 64bit?

Problem resolved! Thanks all for your attention and
good advise. Especially persisting in look at the
application, there was a bottleneck somewhere else.
Dunno why it wasn't seen for 32-bit, but anyway,
everything is fine now. Will revert to the lazy
version since it uses less memory...

Best Regards
 
A

Andreas Leitgeb

Jan Burse said:
Problem resolved! Thanks all for your attention and
good advise. Especially persisting in look at the
application, there was a bottleneck somewhere else.

If it can be demonstrated within the context of:
bulk instanciation in a loop
versus
on-demand instanciation ("lazy")
then I'd be curious about where/what that extra bottleneck
was.
 
J

Jan Burse

Andreas said:
If it can be demonstrated within the context of:
bulk instanciation in a loop
versus
on-demand instanciation ("lazy")
then I'd be curious about where/what that extra bottleneck
was.

Lets assume the following simplified lazy code:

bla = new Bla[n];
..
if (bla==null) /* the null test */
bla=new Bla(); /* the new call */
..

I thought that in the lazy version, bla was
maximally n times tested for null. It was
based on the figures I obtained, namely
counting the new calls in bulk and in lazy:

Bulk: 84'393'262
Lazy: 81'662'843
http://www.facebook.com/photo.php?fbid=199117193490169

But the number of new calls is not the
same as the number of null tests, since
a new call is only performed when the null
test succeeds.

So it turned out that there were much more
null tests than new calls. In certain sub
test cases the number can arbirarily go up, so
instead of n tests one might have m*n tests
where m depends on the sub test case, and
is not related to n.

Since my null test is in practice a little
bit more complex in my application and since
m is much above 1 in many of my sub test
cases, the overhead became recognizable.

Meanwhile I could reduce the complexity and
subsequently the frequency of the null test
in my application and the overhead for lazy
has dimished.

There is a little guard which is called like
m times. And then the null test is maximally
n times done. Since the guard is cheap and
since it guards a couple of null tests, the
overhead is now small.

Bye
 
J

Jan Burse

Jan said:
There is a little guard which is called like
m times. And then the null test is maximally
n times done. Since the guard is cheap and
since it guards a couple of null tests, the
overhead is now small.

This could be transformed to other problems.
Like for example an uninitialized matrice.
Just put the guard on the rows of a matrice,
and intialize rows in a bulk instead of the
elements of the matrice. Clearly this gives
a speed-up.

Why is still null test on elements needed?
Well the guard is reset from time to time.
At least this happens in my application.
Because otherwise one could object, the
guard is enough, why should there be
additional null tests on elements?

Bye
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,996
Messages
2,570,237
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top