Performance of hand-optimised assembly

Stephen Sprunk · Jan 7, 2012

Look at this:

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result=L+R;
}

Assembly will always do better than standard C in the above specific
code. You can then use the ADDPD and add 2 integers at the same time
(under specific loop conditions).

Really? I don't know how to get GCC to assume the arrays above are
properly aligned, though I'm sure there's a way. However, if I change
the code slightly so it knows that, I get this code for the loop:

..L6:
addl $1, %edx
movapd R(%eax), %xmm0
addpd L(%eax), %xmm0
movapd %xmm0, result(%eax)
addl $16, %eax
cmpl %edx, %ecx
ja .L6
cmpl %esi, %ebx
je .L9
..L4:
leal L(,%ebx,8), %edx
xorl %ecx, %ecx
movl %ebx, %eax
.p2align 4,,7
..L8:
addl $1, %ecx
movsd (%edx), %xmm0
addsd R(,%eax,8), %xmm0
movsd %xmm0, result(,%eax,8)
leal (%ecx,%ebx), %eax
addl $8, %edx
cmpl %eax, %esi
ja .L8
..L9:

So, GCC on my system _will_ auto-vectorize that loop with the
appropriate flags (-msse2 -fpmath=sse -ftree-vectorize).

I noticed the above isn't unrolled, though, so I added -funroll-loops,
and it happily generated code that used xmm0-xmm7 and would presumably
be much faster, though it's way, way too verbose to post here.

Is your asm version really significantly better than the above?

S

jacob navia · Jan 7, 2012

Le 07/01/12 01:49, Joe keane a écrit :

orig [C] 0.535
navia 0.45
new 0.38

Hey THANKS Joe!

That is a thing I haven't done yet, and should have done before.
It is an important optimization I missed completely.

ASM now comes to a big factor from the C version.

jacob navia · Jan 7, 2012

Le 07/01/12 08:54, jacob navia a écrit :

Le 07/01/12 01:49, Joe keane a écrit :

orig [C] 0.535
navia 0.45
new 0.38

Click to expand...

Hey THANKS Joe!

That is a thing I haven't done yet, and should have done before.
It is an important optimization I missed completely.

What I didn't say is WHY did I missed that.

I am fully aware that scheduling instructions increases performance
but I do that optimization not very often...

The reason is that I have a really bad time understanding my own
programs after I do that. In my own brain circuitry I like to have
logical statements that are ordered logically like:

do operation 1
do operation 2
... etc

Scheduling breaks that, and when you reread the program later
there is a huge MESS where instructions from operations 1 and 2
(and maybe 456) are meshed together to use all available slots
what is great for the machine but terrible for my brain since
to figure out where the bug is, or how could I change that, I
have to mentally separate and detangle the mess.

Somehow I hate doing that, surely because of the way my wires
run in my brain.

And this is one of the weakness of manually programming in
assembler. I wanted to acknowledge that.

Obviously your rescheduling is not THAT bad, since you just use
another register and the program REMAINS quite readable, but
I didn't even do that because my revulsion at scheduling in
general. I think it is a bug in my brain, and it is nice to reflect
as to why I haven't seen it and avoided scheduling completely

Surely, a color editor would help, for instance if I could code
related instructions in a given color, then the next group in another,
that would help me to read the code later when I could follow a color
to group related instructions together...

Again, IDE's are important

jacob navia · Jan 7, 2012

Le 07/01/12 06:39, Stephen Sprunk a Ã©crit :

Look at this:

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result=L+R;
}

Assembly will always do better than standard C in the above specific
code. You can then use the ADDPD and add 2 integers at the same time
(under specific loop conditions).

Click to expand...

Really? I don't know how to get GCC to assume the arrays above are
properly aligned, though I'm sure there's a way.

Maybe, but that is exactly my point. Why couldn't we be able to
DECLARE that those arrays are properly aligned and are suitable
for SIMD processing?

Why should we expect that the compiler implements a HUGE and
brittle machinery to DEDUCE that instead?

Wouldn't it be simpler if we would just be able to declare those
in the same way that we say:
int i;
instead of expecting the commpiler to DEDUCE that because of the
usage of that variable it must be an int?

See that other thread I started about containers and SIMD
programming.

88888 Dihedral · Jan 7, 2012

88888 Dihedralæ–¼ 2012å¹´1æœˆ6æ—¥æ˜ŸæœŸäº”UTC+8ä¸Šåˆ5æ™‚39åˆ†31ç§’å¯«é“ï¼š

jacob naviaæ–¼ 2012å¹´1æœˆ5æ—¥æ˜ŸæœŸå››UTC+8ä¸Šåˆ3æ™‚22åˆ†25ç§’å¯«é“ï¼š

Le 04/01/12 18:59, Stephen Sprunk a Ã©crit :

Note that those instructions are available since at least 10 YEARS
and still mainstream compilers like MSVC or GCC do not perform any of
the magic you invoke...

Sure, maybe ONE DAY IN THE FAR FUTURE they will do it, but until then I
am using those instructions in my programs now...

Yes, you need ALLL THE RESOURCES of Intel Corp to build such a compiler,
no wonder nobody has done that besides Intel.

What?

Look Stephen, you are surely knowledgable in C but in assembly...
Take for instance ANDPS (AND NOT PACKED Single precision)

The corresponding AVX instructions is...
VANDNPS, with the same syntax same semantics. You just
add a "V" before the instruction mnemonic.

All you have to do is use the ymm registers with 256 bits
instead of the xmm registers with 128 bits. Your loop
counters must be adjusted (using only half as many loops)
and that was it, maybe an hour of work for a big program.

Yes, but it will be at least 10 years before gcc or MSVC propose that.
gcc doesn't even propose automatically SSE2 or SSE3 TODAY, 10 years
after it was announced by Intel...

That was in 2000, when it first appeared. Pleeeeeze, we are 2012 now,
that is no longer relevant at all.

But let's forget about that. Look at this.

problem:

You have a vector of 8 long long integers and you want to shift that
vector by a given amount left between 1 and 63 bits.

Interface:

void ShiftVector(long long *vector, int AmountToShift);
; Pointer to vector in %rdi
; Amount to shift in %rsi
; gcc calling conventions for x86-64
shiftvector:
movq %rsi,%rcx

Click to expand...

AmountToShift -> rcx

movq (%rdi),%rsi

Click to expand...

vector[0] -> rsi

movq 8(%rdi),%rax vector[1] -> rax
shldq %cl,%rax,%rsi

Click to expand...

Do the shift to the left.

the (rax, rsi) bit shifted result in rsi and rax not changed

movq %rsi,(%rdi)

Click to expand...

Save the rsi part. Leave the rax part.

movq 16(%rdi),%rsi

Click to expand...

vector[2] -> rsi

shldq %cl,%rsi,%rax

Click to expand...

Do the shift.

the (rsi, rax) bit shifted result in rax and rsi not changed
if there's an instruction to do this directly

Stephen Sprunk · Jan 7, 2012

Le 07/01/12 06:39, Stephen Sprunk a Ã©crit :

Look at this:

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result=L+R;
}

Assembly will always do better than standard C in the above specific
code. You can then use the ADDPD and add 2 integers at the same time
(under specific loop conditions).

Click to expand...

Really? I don't know how to get GCC to assume the arrays above are
properly aligned, though I'm sure there's a way.

Click to expand...

Maybe, but that is exactly my point. Why couldn't we be able to
DECLARE that those arrays are properly aligned and are suitable
for SIMD processing?

In the code above, there are three pointers, which may or may not be
suitably aligned for SIMD instructions. If they were declared as arrays
(which is the slight modification I made), the compiler will know
they're aligned.

Why should we expect that the compiler implements a HUGE and
brittle machinery to DEDUCE that instead?

Click to expand...

It's not huge or brittle; it's just one AND operation and conditional
branch per pointer. GCC doesn't bother to do it, for reasons I don't
understand, but it is not an unreasonable optimization given the
performance benefits of aligned access.

Wouldn't it be simpler if we would just be able to declare those
in the same way that we say:
int i;
instead of expecting the commpiler to DEDUCE that because of the
usage of that variable it must be an int?

Click to expand...

Huh? What I did was take L, R and result out of the parameter list and
declare them like this:

double L[256], R[256], result[256];

That is sufficient for the compiler to know they're properly aligned,
unlike a (double *).

S

Joe keane · Jan 7, 2012

do operation 1
do operation 2
... etc

My brain is warped from reading too much compiler output, so stuff like
this makes sense:

A part 1
A part 2
B part 1
A part 4
X part 8
D part 2

Ben Bacarisse · Jan 7, 2012

Stephen Sprunk said:
Le 07/01/12 06:39, Stephen Sprunk a Ã©crit :

On 25-Dec-11 14:35, jacob navia wrote:

Click to expand...

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result=L+R;
}

Click to expand...

Click to expand...

In the code above, there are three pointers, which may or may not be
suitably aligned for SIMD instructions. If they were declared as arrays
(which is the slight modification I made), the compiler will know
they're aligned.

Click to expand...

My gcc (4.6.1) generates addpd instructions with only -O3, even when L,
R and result are pointers. I wonder if the is a gcc version difference
or something to do with my system (64-bit Linux).

<snip>

Stephen Sprunk · Jan 7, 2012

Stephen Sprunk said:
Stephen Sprunk said:

Le 07/01/12 06:39, Stephen Sprunk a Ã©crit :
On 25-Dec-11 14:35, jacob navia wrote:

Click to expand...

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result=L+R;
}

Click to expand...

Click to expand...

In the code above, there are three pointers, which may or may not be
suitably aligned for SIMD instructions. If they were declared as arrays
(which is the slight modification I made), the compiler will know
they're aligned.

Click to expand...

My gcc (4.6.1) generates addpd instructions with only -O3, even when L,
R and result are pointers. I wonder if the is a gcc version difference
or something to do with my system (64-bit Linux).

Probably; my gcc is 4.2.4, which is a fairly old branch. I'll see if I
can talk my sysadmin into upgrading.

S

88888 Dihedral · Jan 10, 2012

88888 Dihedralæ–¼ 2012å¹´1æœˆ7æ—¥æ˜ŸæœŸå…UTC+8ä¸‹åˆ6æ™‚15åˆ†06ç§’å¯«é“ï¼š

88888 Dihedralæ–¼ 2012å¹´1æœˆ6æ—¥æ˜ŸæœŸäº”UTC+8ä¸Šåˆ5æ™‚39åˆ†31ç§’å¯«é“ï¼š

jacob naviaæ–¼ 2012å¹´1æœˆ5æ—¥æ˜ŸæœŸå››UTC+8ä¸Šåˆ3æ™‚22åˆ†25ç§’å¯«é“ï¼š

Le 04/01/12 18:59, Stephen Sprunk a Ã©crit :
On 26-Dec-11 14:36, jacob navia wrote:
Le 26/12/11 21:24, Stephen Sprunk a Ã©crit :
On 25-Dec-11 14:35, jacob navia wrote:
Assembly will always do better than standard C in the above specific
code. You can then use the ADDPD and add 2 integers at the same time
(under specific loop conditions).

... unless you have a vectorizing compiler, which will do the exact same
thing for you without the need for inline assembly.

Sure. That was a simple example, and no, gcc doesn't do it, at least
in the version I have

Like many GCC sub-projects, the work isn't complete, and probably isn't
on by default:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html

Note that those instructions are available since at least 10 YEARS
and still mainstream compilers like MSVC or GCC do not perform any of
the magic you invoke...

Sure, maybe ONE DAY IN THE FAR FUTURE they will do it, but until thenI
am using those instructions in my programs now...

GCC isn't the only compiler out there, either; I've heard ICC is very
good at this, for instance.

Yes, you need ALLL THE RESOURCES of Intel Corp to build such a compiler,
no wonder nobody has done that besides Intel.

Also, when AVX support comes out, that vectorizing compiler will start
using AVX instructions while your assembly is stuck using SSE.

Sure, you can recompile your program bu I can't change a few loop
`conditions since I am "stuck" with my old program, poor me.

AVX will be different instructions, different registers, etc.

What?

Look Stephen, you are surely knowledgable in C but in assembly...
Take for instance ANDPS (AND NOT PACKED Single precision)

The corresponding AVX instructions is...
VANDNPS, with the same syntax same semantics. You just
add a "V" before the instruction mnemonic.

All you have to do is use the ymm registers with 256 bits
instead of the xmm registers with 128 bits. Your loop
counters must be adjusted (using only half as many loops)
and that was it, maybe an hour of work for a big program.

It won't
be a complete rewrite from scratch, but it'll be a lot of work, whereas
someone with a vectorizing compiler can just change one option and
recompile.

Yes, but it will be at least 10 years before gcc or MSVC propose that..
gcc doesn't even propose automatically SSE2 or SSE3 TODAY, 10 years
after it was announced by Intel...

Furthermore, there is no guaranteed win for using SEE; early SSE chips
simply decoded that into two FADDs, so there was no performance benefit
to all that work you invested in writing SSE assembly.

That was in 2000, when it first appeared. Pleeeeeze, we are 2012 now,
that is no longer relevant at all.

But let's forget about that. Look at this.

problem:

You have a vector of 8 long long integers and you want to shift that
vector by a given amount left between 1 and 63 bits.

Interface:

void ShiftVector(long long *vector, int AmountToShift);
; Pointer to vector in %rdi
; Amount to shift in %rsi
; gcc calling conventions for x86-64
shiftvector:
movq %rsi,%rcx

Click to expand...

AmountToShift -> rcx

movq (%rdi),%rsi

Click to expand...

vector[0] -> rsi

movq 8(%rdi),%rax vector[1] -> rax
shldq %cl,%rax,%rsi

Click to expand...

Do the shift to the left.

Click to expand...

the (rax, rsi) bit shifted result in rsi and rax not changed

movq %rsi,(%rdi)

Click to expand...

Save the rsi part. Leave the rax part.

movq 16(%rdi),%rsi

Click to expand...

vector[2] -> rsi

shldq %cl,%rsi,%rax

Click to expand...

Do the shift.

Click to expand...

the (rsi, rax) bit shifted result in rax and rsi not changed
if there's an instruction to do this directly

Save the rax part. Leave the rsi part.

Click to expand...

RBI, RCI, and REI not used, thus it's possible to speed up more
by dual or quadro-cores in X86.

Wolfgang.Draxinger · Jan 10, 2012

I'm starting a new thread, but it is prompted by this remark from
88888 Dihedral <[email protected]> about a short binary
search function I posted:

| Do you consider to convert 5 to 10 lines of C to assembly?
| I did that 20 years ago on X86 cpu.
| It was a piece of cake to gain the 20-40% speed required if paid
| well.

I've felt (it's only a feeling) that hand coding for speed (rather
than to access features not available in C like extra wide multiplies
and so on) was a thing of the past. But maybe I'm wrong. Is it
still possible to get a significant speed-up over C by hand coding?

There are a few corner cases, where hand written assembly makes
compiled code see only the rear lights. But those are few and
far between. The problem is, that modern CPUs have so much side state,
that it's next to impossible to keep track of all the little details,
that may seem insignificant, but have a huge effect. Just the order of
independent instruction blocks can have a large impact.

However where hand written assembler is still useful is implementing
the inner loops of complex algorithms processing (very) large
datasets. Such algorithms usually involve only shuffling numbers around
in a strict layout, so it's easy to reason about this kind of task and
find patterns, that a compiler won't. And it allows to exploit
peculiarities of the used instruction set, a compiler never could do.
Like the one described by Dark Shikari here:

http://stackoverflow.com/a/98251/524368

How much depends on the processor? How good are the modern
optimising compilers that would have to be beaten?

Depends on the task at hand. If it's just about the normal execution
path of a program without many loops the compiler will most likely win,
because will put the code in a lot of configurations through some
simulation and count "clock cycles". Also there's s lot of heuristics
in it. The other case are aforementioned processing loops. You as the
algorithm, writer know about the dependencies in data access, know
where pointers can possibly point and where not (the compiler doesn't),
which allows to gain significant performance boosts in such situations.

Wolfgang

Tim Rentsch · Jan 26, 2012

Ben Bacarisse said:
James Harris said:

...

size_t binsearch(int a[], size_t length, int key)

Why use size_t instead of unsigned int or unsigned long int,
especially as you add it to sum in the test harness which is long int?
Why not use unsigned for sum in the test harness?

int is better yet because the output will then be same across machines.
Using *any* unsigned type means the "not found" -1 return becomes a
value that can vary from implementation to implementation.

Click to expand...

You mean -1 may have a strange value on a 1s-complement machine? On a
2s-complement machine wouldn't it always be all bits set even if
assigned to an unsigned variable?

Click to expand...

No, the machine representation of -1 has no bearing on the result of the
C code[1]. What I mean is just that a return of -1 has a value of
SIZE_MAX and that varies from implementation to implementation. It's a
good value to return in a real binary search because a valid index can't
ever be (quite) that large -- the largest C-supported object of SIZE_MAX
bytes can have indexes no higher than SIZE_MAX - 1.

However, it's not a good return for a portable test program like this
because the sum of the returns will vary from machine to machine.
Alternatively, changing the simple sum to something like this:

size_t idx = binsearch(data, length, key)
sum += idx == (size_t)-1 ? -1 : idx;

could have made the output more portable. [snip rest]

Presumably you meant something different for that last
line of code, since in most cases the effect would be
the same as

sum += idx;

Tim Rentsch · Jan 26, 2012

Ben Bacarisse said:
[snip] There was a thread here a while back about
whether an object can be bigger than SIZE_MAX. It probably can, so
the compiler would not be able to make the assumption [that an
index into an array can never exceed some threshold].

Since it is the implementation (ie, the compiler) that is making
the decision (as to whether it ever creates such objects), it is
perfectly free to assume that it made whatever choice it made!

Tim Rentsch · Jan 26, 2012

jacob navia said:
Le 29/12/11 02:12, Phil Carmody a @C3{A9}crit :

The original program uses "int" to index, and "int" under windows
linux and mac is at most 32 bits. This is not a problem of assembly
but of the original C program

Yes. In the parafraph "Translation limits" the standard says that
an implemetation is not required to support objects with
more than 65535 bytes.

The depth of understanding reflected in this statement
is truly awe inspiring.

Tim Rentsch · Jan 26, 2012

BartC said:
The address calculation for int-arrays usually involves multiplying by
2, 4 or 8.

Conceptually, but the actual semantics isn't defined in
terms of multiplication; it only says that the resulting
pointer must point to the appropriate array element.
(Incidental note: in fact the Standard even uses the
word "conceptually" in a footnote in the section on
array indexing.)

With an index having the top bit set, that will overflow
the size_t range. Whatever size_t happens to be.

The Standard doesn't contain any statement about the type in
which array indexing is carried out; the defining expressions
are mathematical, not given in terms of C operators. As long as
the array object referred to is big enough, array indexing has to
work even if the index is less than PTRDIFF_MIN or greater than
PTRDIFF_MAX or SIZE_MAX.

Ben Bacarisse · Jan 26, 2012

Tim Rentsch said:
Ben Bacarisse <[email protected]> writes:

Presumably you meant something different for that last
line of code, since in most cases the effect would be
the same as

sum += idx;

Yes, I intended something like

if (idx == (size_t)-1)
sum -= 1;
else sum += idx;

but turned it into an entirely not equivalent conditional expression.

Ben Bacarisse · Jan 26, 2012

Tim Rentsch said:
Ben Bacarisse said:

[snip] There was a thread here a while back about
whether an object can be bigger than SIZE_MAX. It probably can, so
the compiler would not be able to make the assumption [that an
index into an array can never exceed some threshold].

Click to expand...

Since it is the implementation (ie, the compiler) that is making
the decision (as to whether it ever creates such objects), it is
perfectly free to assume that it made whatever choice it made!

I used the word "compiler" deliberately. Where there is a unified
implementation, then I agree that one part (the optimiser) can assume
facts about other parts (for example the C library), but that's not been
my experience. Most compilers have to work without knowing what
implementation they will end up part of.

Tim Rentsch · Feb 1, 2012

Ben Bacarisse said:
Tim Rentsch said:

Ben Bacarisse said:

[snip] There was a thread here a while back about
whether an object can be bigger than SIZE_MAX. It probably can, so
the compiler would not be able to make the assumption [that an
index into an array can never exceed some threshold].

Click to expand...

Since it is the implementation (ie, the compiler) that is making
the decision (as to whether it ever creates such objects), it is
perfectly free to assume that it made whatever choice it made!

Click to expand...

I used the word "compiler" deliberately.

HA! Well aren't you the crafty old dodger?

Where there is a unified
implementation, then I agree that one part (the optimiser) can assume
facts about other parts (for example the C library), but that's not been
my experience. Most compilers have to work without knowing what
implementation they will end up part of.

Right. I assumed you were talking about a constraint imposed by
the Standard, but actually you were talking about a constraint
imposed by a desire to be included in more than one implementation,
which is rather a different kettle of fish.

Phil Carmody · Feb 12, 2012

Tim Rentsch said:
The depth of understanding reflected in this statement
is truly awe inspiring.

If we can see depth being reflected, then we must be below
that statement.

Phil

Tim Rentsch · Mar 8, 2012

Phil Carmody said:
If we can see depth being reflected, then we must be below
that statement.

Stop it, you're making my head hurt.

SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Quick sort algorithm	1	Feb 22, 2023
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
How to inline assembly in a C program?	4	Mar 3, 2013
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Performance of int/long in Python 3	187	Mar 25, 2013
BITCOIN PROGRAMMING - CODE INCLUDED - needs slight modification in linux terminal - NSA please do not block	0	Nov 2, 2024
Comparision of C Sharp and C performance	360	Dec 27, 2009

Performance of hand-optimised assembly

Stephen Sprunk

jacob navia

jacob navia

jacob navia

88888 Dihedral

Stephen Sprunk

Joe keane

Ben Bacarisse

Stephen Sprunk

88888 Dihedral

Wolfgang.Draxinger

Tim Rentsch

Tim Rentsch

Tim Rentsch

Tim Rentsch

Ben Bacarisse

Ben Bacarisse

Tim Rentsch

Phil Carmody

Tim Rentsch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads