Effective Multi Core Thread Programming

J

jonasforssell

Hello Experts.

This program does not give expected performance boost on my P4 with
HyperThreading under Linux

/****************************************************************************/
public class ThreadTester implements Runnable {
int id;

public ThreadTester(int tid) {
this.id = tid;
}

public void run() {
System.out.println("Starting thread: " + id);

long stime = System.currentTimeMillis();

for (int i = 0; i < 20000; i++)
for (int j = 0; j < 20000; j++) {
double p = i * j;
p = Math.sqrt(p);
}

stime = System.currentTimeMillis() - stime;

System.out.println("Solution for thread " + id + " took " +
stime + " ms");
}

public static void main(String[] args) {

if (args.length != 1) throw new
IllegalArgumentException("\n\nSyntax is 'java ThreadTester x' where x
is number of threads \n");

int cpu = Integer.parseInt(args[0]);
Thread[] t = new Thread[cpu];

for (int i = 0; i < cpu; i++)
t = new Thread(new ThreadTester(i));

for (int i = 0; i < cpu; i++)
t.start();

try {
for (int i = 0; i < cpu; i++) {
t.join();
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
/**************************************************************************/

And this is my output

pc58410@gustav Impact $ java ThreadTester 1
Starting thread: 0
Solution for thread 0 took 5381 ms

pc58410@gustav Impact $ java ThreadTester 1
Starting thread: 0
Solution for thread 0 took 5379 ms

pc58410@gustav Impact $ java ThreadTester 2
Starting thread: 1
Starting thread: 0
Solution for thread 1 took 11247 ms
Solution for thread 0 took 11321 ms

pc58410@gustav Impact $ java ThreadTester 2
Starting thread: 1
Starting thread: 0
Solution for thread 1 took 11241 ms
Solution for thread 0 took 11325 ms


With an effective core distribution, this should have similar values as
the first runs (< 6000 ms)

What have I done wrong? I thought the JVM would make a good
distribution automatically?

Many thanks
/Jonas Forssell, Gothenburg, Sweden
 
J

jonasforssell

Additional input:

My machine has SMP support enabled in the Linux core. The system sees
two CPU:s.

I'm running JVM 1.4.2 Blackdown which is based on SUN source.

/Jonas
 
B

blmblm

Additional input:

My machine has SMP support enabled in the Linux core. The system sees
two CPU:s.

I'm running JVM 1.4.2 Blackdown which is based on SUN source.

Is this a dual-core machine, or a machine with a single hyperthreaded
processor? if the latter, be advised that speedups produced by
hyperthreading apparently range from zero to about 30%, with pure
number-crunching (such as what you're doing) apt to *not* take
advantage of the hyperthreading magic.

I admit that I *am* surprised that you got what seems like a
significant slowdown, rather than just a lack of improvement.
 
H

hiwa

(e-mail address removed) ã®ãƒ¡ãƒƒã‚»ãƒ¼ã‚¸:
Hello Experts.

This program does not give expected performance boost on my P4 with
HyperThreading under Linux

/****************************************************************************/
public class ThreadTester implements Runnable {
int id;

public ThreadTester(int tid) {
this.id = tid;
}

public void run() {
System.out.println("Starting thread: " + id);

long stime = System.currentTimeMillis();

for (int i = 0; i < 20000; i++)
for (int j = 0; j < 20000; j++) {
double p = i * j;
p = Math.sqrt(p);
}

stime = System.currentTimeMillis() - stime;

System.out.println("Solution for thread " + id + " took " +
stime + " ms");
}

public static void main(String[] args) {

if (args.length != 1) throw new
IllegalArgumentException("\n\nSyntax is 'java ThreadTester x' where x
is number of threads \n");

int cpu = Integer.parseInt(args[0]);
Thread[] t = new Thread[cpu];

for (int i = 0; i < cpu; i++)
t = new Thread(new ThreadTester(i));

for (int i = 0; i < cpu; i++)
t.start();

try {
for (int i = 0; i < cpu; i++) {
t.join();
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
/**************************************************************************/

And this is my output

pc58410@gustav Impact $ java ThreadTester 1
Starting thread: 0
Solution for thread 0 took 5381 ms

pc58410@gustav Impact $ java ThreadTester 1
Starting thread: 0
Solution for thread 0 took 5379 ms

pc58410@gustav Impact $ java ThreadTester 2
Starting thread: 1
Starting thread: 0
Solution for thread 1 took 11247 ms
Solution for thread 0 took 11321 ms

pc58410@gustav Impact $ java ThreadTester 2
Starting thread: 1
Starting thread: 0
Solution for thread 1 took 11241 ms
Solution for thread 0 took 11325 ms


With an effective core distribution, this should have similar values as
the first runs (< 6000 ms)

What have I done wrong? I thought the JVM would make a good
distribution automatically?

Many thanks
/Jonas Forssell, Gothenburg, Sweden

If it is *practically* a single processor machine, the result is only
natural.
 
J

jonasforssell

Could anyone run this on their multi Core/CPU machine and show me some
evidence it works properly in that environment

Thanks
/Jonas


(e-mail address removed) skrev:
 
C

Chris Smith

Could anyone run this on their multi Core/CPU machine and show me some
evidence it works properly in that environment


Starting thread: 0
Solution for thread 0 took 14734 ms

---------------------

Starting thread: 0
Starting thread: 1
Solution for thread 1 took 14641 ms
Solution for thread 0 took 14922 ms

Is that what you wanted? Looks good to me.
 
C

Chris Uppal

for (int i = 0; i < 20000; i++)
for (int j = 0; j < 20000; j++) {
double p = i * j;
p = Math.sqrt(p);
}

How many independent floating-point units does a hyperthreaded P4 have ? I'm
not certain, but I /think/ that HT processors share all their actual execution
units, and that there is only one FP-capable unit. If so (and it's a fairly
big if) then this loop will saturate the FP unit even if only one thread is
running. Having more, will necessarily increase overheads without providing
any benefit.

If that's correct, then the only exploitable parallelism here is that one
thread can be doing the integer arithmetic for the loops while the other is
doing the FP calculations. But since, with pipelining and whatnot, the
processor could be doing that "in parallel" anyway (even with just one thread),
again, it seems that the two threads will be competing for a resource that
either one of them could saturate.


Another point, and probably a lot more important, is that this test code is not
testing anything useful. The JITer will still be attempting to optimise the
loops while you are taking your single measurement (per thread). So, (a) the
performance reported will not be at all representative, and (b) the JITer
itself will be competing for processor time (and cache space, etc) with the
benchmark threads. Please note that the effects are large, and cannot simply
be ignored -- if you fail to take account of the JITer's behaviour then the
results are quite likely not even to be indicative.

-- chris
 
J

jonasforssell

You are most probably correct. The hyperthreading does not double the
floating-point units and as the previous post show, there are examples
where this will truly execute in parallel. A HT P4 is not one of them.

Thanks for all the feedback!
/Jonas

Chris Uppal skrev:
 
E

Eric Sosman

Is this a dual-core machine, or a machine with a single hyperthreaded
processor? if the latter, be advised that speedups produced by
hyperthreading apparently range from zero to about 30%, with pure
number-crunching (such as what you're doing) apt to *not* take
advantage of the hyperthreading magic.

I admit that I *am* surprised that you got what seems like a
significant slowdown, rather than just a lack of improvement.

Looks like roughly a 5% slowdown -- not great, but not
terrible. Keep in mind that the two-thread version does
twice as much work, and still has only one FPU to use for
all those square roots.
 
J

jonasforssell

Chris does not state his configuration, but surely there must be two
FPU:s here?

/Jonas

Eric Sosman skrev:
 
T

Thomas Hawtin

You are most probably correct. The hyperthreading does not double the
floating-point units and as the previous post show, there are examples
where this will truly execute in parallel. A HT P4 is not one of them.

Funnily enough the eight core, thirty two thread Sun Niagara will
presumably show the same results too. It has only one floating point
unit (shared across the cache cross bar). Sun don't recommend it for
applications with more than 1% floating point.

It seems where multiple hardware threads are most useful is to do work
while during memory latency. Intel's talk of sharing functional units
within the same cycle is good marketing, but apparently not that
significant in practice.

Tom Hawtin
 
C

Chris Smith

Chris does not state his configuration, but surely there must be two
FPU:s here?

I don't know my configuration. It is a dual-core system; I know that.
Dell Inspiron E1405. You can probably look up info as well as I can.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top