Cannot optimize 64bit Linux code

L

legrape

I am porting a piece of C code to 64bit on Linux. I am using 64bit
integers. It is a floating point intensive code and when I compile
(gcc) on 64 bit machine, I don't see any runtime improvement when
optimizing -O3. If I construct a small program I can get significant
(>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
machine, it runs 5x faster on the 64 bit machine than does the 64bit
compiled code.

It seems like something is inhibiting the optimization. Someone on
comp.lang.fortran suggested it might be an alignment problem. I am
trying to go through and eliminate all 32 bit integers righ now (this
is a pretty large hunk of code). But thought I would survey this
group, in case it is something naive I am missing.

Any opinion is welcomed. I really need this to run up to speed, and I
need the big address space. Thanks in advance.

Dick
 
W

Walter Roberson

I am porting a piece of C code to 64bit on Linux. I am using 64bit
integers. It is a floating point intensive code and when I compile
(gcc) on 64 bit machine, I don't see any runtime improvement when
optimizing -O3.
It seems like something is inhibiting the optimization. Someone on
comp.lang.fortran suggested it might be an alignment problem.

Possibly. It could possibly also be a cache issue: you might have
cache-line conflicts, or the larger size of your integers might
be causing your key data to no longer fit into cache.
 
S

santosh

I am porting a piece of C code to 64bit on Linux. I am using 64bit
integers. It is a floating point intensive code and when I compile
(gcc) on 64 bit machine, I don't see any runtime improvement when
optimizing -O3. If I construct a small program I can get significant
(>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
machine, it runs 5x faster on the 64 bit machine than does the 64bit
compiled code.

It seems like something is inhibiting the optimization. Someone on
comp.lang.fortran suggested it might be an alignment problem. I am
trying to go through and eliminate all 32 bit integers righ now (this
is a pretty large hunk of code). But thought I would survey this
group, in case it is something naive I am missing.

Any opinion is welcomed. I really need this to run up to speed, and I
need the big address space. Thanks in advance.

This group may not be the best option. Maybe you should try a Linux or
GCC group?

If the same code and compilation commands produce such runtime
difference then perhaps the 64 bit version of the compiler and it's
runtime libraries, as well as perhaps the system runtime libraries are
not yet exploiting all the optimisations possible. Did you try giving
gcc the permission to use intrinsics and SSE? Alignment could well be a
problem though gcc *should* have chosen the best alignment for the
target, unless you specified otherwise. Are there any aspects to your
code (like choice of data types, compiler specific pragmas, struct
padding) that are perhaps selected for 32 bit systems and thus less
than optimal under 64 bit?

Did you try with the Intel compiler? If it produces better code then
that is a piece of evidence indicative, perhaps, that gcc isn't
emitting good code.
 
C

cr88192

I am porting a piece of C code to 64bit on Linux. I am using 64bit
integers. It is a floating point intensive code and when I compile
(gcc) on 64 bit machine, I don't see any runtime improvement when
optimizing -O3. If I construct a small program I can get significant
(>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
machine, it runs 5x faster on the 64 bit machine than does the 64bit
compiled code.

It seems like something is inhibiting the optimization. Someone on
comp.lang.fortran suggested it might be an alignment problem. I am
trying to go through and eliminate all 32 bit integers righ now (this
is a pretty large hunk of code). But thought I would survey this
group, in case it is something naive I am missing.

Any opinion is welcomed. I really need this to run up to speed, and I
need the big address space. Thanks in advance.


OT:

this is actually an issue related to the mismatch between current processor
performance behavior, and the calling conventions used on Linux x86-64.

they were like:
let's base everything on a variant of the "register" calling convention, and
use SSE for all the floating point math rather than crufty old x87.

the problem is that, current processors don't quite agree, and in practice
this sort of thing actually goes *slower*...

it seems, actually, that x87, lots of mem loads/stores, and complex
addressing forms, can be used to better effect wrt performance than SSE,
register-heavy approaches, and the use of "simple" addressing forms (in
seeming opposition to current "optimization wisdom").

I can't give much explanation as to why this is exactly, but it has been my
observation (periodic performance testing during the ongoing
compiler-writing task...).

my guess is because these things are heavily optimized, given that much
existing x86 code uses them heavily (this may change in the future though,
as 64 bit code becomes more prevalent...).


my guess is that the calling convention was designed according to some
misguided sense of "optimization wisdom", rather than good solid benchmarks.

better performance could probably have been achieved at present just by
pretending the x86-64 was just an x86 with more registers and gueranteed
present SSE.

not only this, but the convention is designed in such a way as to be awkward
as well, and leaves open the question of how to effectively pull off
varargs...



or, at least, this is what happens on my processor (an Athlon 64 X2 4400+).

I don't know if it is similar on Intel chips.
 
B

Bartc

I am porting a piece of C code to 64bit on Linux. I am using 64bit
integers. It is a floating point intensive code and when I compile
(gcc) on 64 bit machine, I don't see any runtime improvement when
optimizing -O3. If I construct a small program I can get significant
(>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
machine, it runs 5x faster on the 64 bit machine than does the 64bit
compiled code.

It seems like something is inhibiting the optimization. Someone on
comp.lang.fortran suggested it might be an alignment problem. I am
trying to go through and eliminate all 32 bit integers righ now (this
is a pretty large hunk of code). But thought I would survey this
group, in case it is something naive I am missing.

Any opinion is welcomed. I really need this to run up to speed, and I
need the big address space. Thanks in advance.

Hesitant to attempt an answer as I know nothing about 64-bit or gcc, but..

Does the program compiled in 32-bit mode run faster when compiled with
optimisation than without (or a 32 or 64-bit machine)? In other words, what
scale of improvement are you expecting? (This on the main program)

Is the improvement really likely to be 5x or more? If not, that sounds like
something wrong with the 64-bit-compiled version, forget the optimisation,
if the 32-bit version can run that much faster.

Do you have the capability to look at a sample of code and see what
exactly is the 64-compiler generating? I doubt it's going to be as silly as
using (and emulating) 128-bit floats, but it does sound like there's
something seriously wrong. It seems unlikely that using int32 instead of
int64 would slow things down 5 times or more.

An alignment fault would be a compiler error; but you can print out a few
data addresses and see whether they are on 8/16-byte boundaries or whatever
is recommended.

Is the small program doing anything similar to the big one? It may be
benefiting from smaller instruction/data cache requirements.

You might find that ints/pointers suddenly turn from 32-bits to 64-bits when
compiled on 64-bit (and therefore using twice the memory bandwidth if you
have a lot of them), that might hit some of the performance. You might like
to check the size of pointers, if you don't need 64-bit addressing.
 
C

cr88192

Bartc said:
Hesitant to attempt an answer as I know nothing about 64-bit or gcc, but..

Does the program compiled in 32-bit mode run faster when compiled with
optimisation than without (or a 32 or 64-bit machine)? In other words,
what
scale of improvement are you expecting? (This on the main program)

Is the improvement really likely to be 5x or more? If not, that sounds
like
something wrong with the 64-bit-compiled version, forget the optimisation,
if the 32-bit version can run that much faster.

yes, that is a bit harsh...

Do you have the capability to look at a sample of code and see what
exactly is the 64-compiler generating? I doubt it's going to be as silly
as
using (and emulating) 128-bit floats, but it does sound like there's
something seriously wrong. It seems unlikely that using int32 instead of
int64 would slow things down 5 times or more.

int32 vs int64, int32 should actually be faster on x86-64 (after all, 32-bit
ints have both less-complex instruction encodings, aka, no REX prefix, ...,
and also because the core of x86-64 is, after all, still x86...).

as for emulating 128 bit floats, it is conceivably possible. I am aware, in
any case, that on x86-64 gcc uses a 128-bit long-double, but whether or not
this is an 80-bit float stuffed into a 128 bit space (doing magic of
shuffling between SSE regs and the FPU), or whether it uses emulated 128 bit
floats, I don't know (I have not investigated gcc's output in this case).

note that SSE does not support 80 bit floats, and the conventions used on
x86-64 generally don't use the FPU (it may be used for some calculations,
but not much else), so if using long double, it is very possible something
funky is going on.

if this is the case, maybe try switching over to double and see if anything
is different.

An alignment fault would be a compiler error; but you can print out a few
data addresses and see whether they are on 8/16-byte boundaries or
whatever
is recommended.

yes. unless one is using "__attribute__((packed))" everywhere, it should not
be a problem...

Is the small program doing anything similar to the big one? It may be
benefiting from smaller instruction/data cache requirements.

You might find that ints/pointers suddenly turn from 32-bits to 64-bits
when
compiled on 64-bit (and therefore using twice the memory bandwidth if you
have a lot of them), that might hit some of the performance. You might
like
to check the size of pointers, if you don't need 64-bit addressing.


yes, I will agree here...

 
D

Dick Dowell

Thanks for all the hints and thoughts.

My small program is:

main()
{
struct timespec ts;
double x,y;
int i;
long long n;
n=15000000;
n *= 10000;
fprintf(stderr,"LONG %Ld\n",n);
/*
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
*/
printf(" _POSIX_THREAD_CPUTIME _POSIX_CPUTIME %d %d\n",
_POSIX_THREAD_CPUTIME
,_POSIX_CPUTIME);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
n = ts.tv_nsec;
fprintf(stderr,"Before %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
fprintf(stderr,"Before %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec);
y=3.3;
for(i=0;i<111100000;i++) {
x=sqrt(y);
y += 1.0;
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
fprintf(stderr,"After %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
fprintf(stderr,"After %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec-n);
}

It shows considerable improvement with -O3.

I think the problem is something less esoteric than the cache,
wordsize, etc. One thing I didn't say, I have multi threading loaded,
though no new threads are created by these runs. I have tried a newer
redhat, have not tried Intel compilers.

Dick
 
D

Dick Dowell

I think I misspoke on my timer program. That one was used to attempt
to measure thread time. You can remove the references to the timers
and run it. It only shows about a 2x improvement on optimization.

The large difference I have actually seen is 32bit compile on another
machine, run on 64bit machine (12sec) versus 64bit code compiled on
64bit machine (70sec).

Sorry for the confusion.

Dick
 
W

Walter Roberson

Thanks for all the hints and thoughts.
My small program is:
main()
{
struct timespec ts;
double x,y;
int i;
long long n;
n=15000000;
n *= 10000;
fprintf(stderr,"LONG %Ld\n",n);
/*
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
*/
printf(" _POSIX_THREAD_CPUTIME _POSIX_CPUTIME %d %d\n",
_POSIX_THREAD_CPUTIME
,_POSIX_CPUTIME);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
n = ts.tv_nsec;
fprintf(stderr,"Before %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
fprintf(stderr,"Before %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec);
y=3.3;
for(i=0;i<111100000;i++) {
x=sqrt(y);
y += 1.0;
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
fprintf(stderr,"After %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
fprintf(stderr,"After %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec-n);
}
It shows considerable improvement with -O3.

You do not do anything with x after you compute it. Any good
optimizer would optimize away the x=sqrt(y) statement. Once that
is done, the optimizer could even eliminate the loop completely
and replace it by y += 111100000. Compilers that did the one or
both of these optimizations would result in much faster code than
compilers that did not. Your problem might have nothing to do
with 64 bit integers and everything to do with which optimizations
the compiler performs.
 
D

Dick Dowell

Thanks for all the suggestions. I've discovered the ineffectiveness
of optimization is data dependent. I managed to profile the code and
78% of the runtime is spent in something called

_mul [1] (from gprof output, the [1] just means #1 cpu user)

Here's another line from gprof report
granularity: each sample hit covers 4 byte(s) for 0.01% of 109.71
seconds

index % time self children called name
<spontaneous>
[1] 78.0 85.55 0.00 __mul [1]
 
K

Keith Thompson

Dick Dowell said:
Thanks for all the suggestions. I've discovered the ineffectiveness
of optimization is data dependent. I managed to profile the code and
78% of the runtime is spent in something called

_mul [1] (from gprof output, the [1] just means #1 cpu user)

Here's another line from gprof report
granularity: each sample hit covers 4 byte(s) for 0.01% of 109.71
seconds

index % time self children called name
<spontaneous>
[1] 78.0 85.55 0.00 __mul [1]
-----------------------------------------------

It's likely (or at least plausible) that _mul is a multiplication
routine invoked by the compiler for what looks like an ordinary
multiplication in your code. Perhaps there's some form of
multiplication that your CPU doesn't directly support.

In that case, you *might* get a significant performance improvement by
re-working your algorithm to avoid multiplications. For example, a
multiplication in a loop can often be replaced by an addition (though
that's the kind of optimization a good compiler should be able to
perform).

Before you consider this, find out exactly what _mul is used for, and
*measure* the performance improvement you can get by avoiding it
(assuming you can).

It's even possible that the hardware supports the multiplication
directly, but the compiler doesn't know it; the solution might be as
simple as adding a command-line option to tell the compiler it can use
a CPU instruction rather than a subroutine call.

I'm assuming here that you're already using a high optimization level
in your compiler. Worrying about the performance of unoptimized code
would be a waste of time unless you seriously mistrust the optimizer.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,816
Latest member
nipsseyhussle

Latest Threads

Top