Performance of hand-optimised assembly

jacob navia · Jan 4, 2012

Le 04/01/12 04:24, Joe keane a écrit :

It is 'available'; it may just not be the best use of time. A C coder
can type i-n-l-i-n-e-space; a compiler can say 'this function is a part
of that function', 'apply the usual algorithms' and be done in ms; where
for a human coder it may be 'inline, no prob'; so then 'suppose that we
switch esi and edi in the called code, we can use -- trick, so it gains
one cycle'; 'suppose you do loop control by this trick'; so then few
days later 'suppose that we switch eax and ecx in the called code, we
can use -- trick', so it gains one cycle', and let me go to sleep.

Yes, that is a good idea. Go to sleep.

I am finishing the assembly language core of lcc-win for extended
precision floats (448 bits) for the Macintosh version. I used for
that assembly (x86) and it shines.

Rotating a series of values up or down for instance can be done in
assembly in a few instructions,but can't be done at all in C.

Using the carry in multi-precision operations is essential for
extended floating point. Impractical in C but easy on assembly.

Etc. But I see that you use "--- trick" in your prose, probably you
have never programmed in assembly and (like most people here)
are ust showing to the world your prejudices.

Go to sleep, its a good idea.

Stephen Sprunk · Jan 4, 2012

Le 26/12/11 21:24, Stephen Sprunk a Ã©crit :

Sure. That was a simple example, and no, gcc doesn't do it, at least
in the version I have

Like many GCC sub-projects, the work isn't complete, and probably isn't
on by default:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html

GCC isn't the only compiler out there, either; I've heard ICC is very
good at this, for instance.

Sure, you can recompile your program bu I can't change a few loop
`conditions since I am "stuck" with my old program, poor me.

AVX will be different instructions, different registers, etc. It won't
be a complete rewrite from scratch, but it'll be a lot of work, whereas
someone with a vectorizing compiler can just change one option and
recompile.

So what? Those machines aren't interesting anymore.

.... but what was the benefit to making chips that could actually do
full-width operations before anyone wrote code to take advantage of it,
and who would bother writing that code before those chips existed? That
is an excellent justifcation for better compilers, even if they can't
beat hand-written assembly every time: it is cheap to modify a compiler
to take advantage of new instructions, which then justify the silicon
that implements them.

And, as I noted, the same thing will likely happen with AVX.

S

jacob navia · Jan 4, 2012

Le 04/01/12 18:59, Stephen Sprunk a Ã©crit :

Like many GCC sub-projects, the work isn't complete, and probably isn't
on by default:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html

Note that those instructions are available since at least 10 YEARS
and still mainstream compilers like MSVC or GCC do not perform any of
the magic you invoke...

Sure, maybe ONE DAY IN THE FAR FUTURE they will do it, but until then I
am using those instructions in my programs now...

GCC isn't the only compiler out there, either; I've heard ICC is very
good at this, for instance.

Yes, you need ALLL THE RESOURCES of Intel Corp to build such a compiler,
no wonder nobody has done that besides Intel.

AVX will be different instructions, different registers, etc.

What?

Look Stephen, you are surely knowledgable in C but in assembly...
Take for instance ANDPS (AND NOT PACKED Single precision)

The corresponding AVX instructions is...
VANDNPS, with the same syntax same semantics. You just
add a "V" before the instruction mnemonic.

All you have to do is use the ymm registers with 256 bits
instead of the xmm registers with 128 bits. Your loop
counters must be adjusted (using only half as many loops)
and that was it, maybe an hour of work for a big program.

It won't
be a complete rewrite from scratch, but it'll be a lot of work, whereas
someone with a vectorizing compiler can just change one option and
recompile.

Yes, but it will be at least 10 years before gcc or MSVC propose that.
gcc doesn't even propose automatically SSE2 or SSE3 TODAY, 10 years
after it was announced by Intel...

That was in 2000, when it first appeared. Pleeeeeze, we are 2012 now,
that is no longer relevant at all.

But let's forget about that. Look at this.

problem:

You have a vector of 8 long long integers and you want to shift that
vector by a given amount left between 1 and 63 bits.

Interface:

void ShiftVector(long long *vector, int AmountToShift);
; Pointer to vector in %rdi
; Amount to shift in %rsi
; gcc calling conventions for x86-64
shiftvector:
movq %rsi,%rcx

movq (%rdi),%rsi
movq 8(%rdi),%rax
shldq %cl,%rax,%rsi
movq %rsi,(%rdi)

movq 16(%rdi),%rsi
shldq %cl,%rsi,%rax
movq %rax,8(%rdi)

movq 24(%rdi),%rax
shldq %cl,%rax,%rsi
movq %rsi,16(%rdi)

movq 32(%rdi),%rsi
shldq %cl,%rsi,%rax
movq %rax,24(%rdi)

movq 40(%rdi),%rax
shldq %cl,%rax,%rsi
movq %rsi,32(%rdi)

movq 48(%rdi),%rsi
shldq %cl,%rsi,%rax
movq %rax,40(%rdi)

movq 56(%rdi),%rax
shldq %cl,%rax,%rsi
movq %rsi,48(%rdi)

shlq %cl,%rsi
movq %rsi,56(%rdi)

ret

26 instructions, 97 bytes (!!!)

Write that in C and see how big your procedure is, and how slow

Joe keane · Jan 4, 2012

probably you have never programmed in assembly

Excuse me sir, MACRO-11 was the first language i learned, after BASIC.

Dann Corbit · Jan 4, 2012

{snippity-snip}
If you examine (for instance) the MPIR project:
http://mpir.org/
it immediately becomes clear that there is a time and a place for
assembly programming even with modern compilers.

Of course, by far the best way to perform optimization is to improve the
algorithm. That having been said, and supposing that the algorithm
cannot be improved and supposing further that a profile has shown some
clump of code to be the hot spot, then assembly language improvements
are a peachy-keen idea. They can easily give a linear speedup of up to
a factor of 4.

IMO-YMMV

Downsides to assembly improvements:
Improvements for chip X may not have the same benefit on chip Y.
Assembly ages badly. I have written thousands of lines with things
like:

MOV AH, 10
INT 21H

I doubt if any machine being used to read this message could even run
that code, since it is 16 bit and uses an old bios call.

I run 64 bit programs almost exclusively on my machines now (notable
exceptions being applications that are not available in 64 bits), so 32
bit assembly is slowing my machine down, and not speeding it up. Every
assembly variant is more code to maintain. If I support 30 different
CPUs, then I may have 30 times the code volume for some particular hunk
of an algorithm, and all 30 pathways will have to be tested on every
pass of the regression testing.

So it's not all roses and sunshine.

But if you absolutely, positively have to get there sooner, sometimes
assembly language is the only answer.

Ben Bacarisse · Jan 4, 2012

jacob navia said:
But let's forget about that. Look at this.

problem:

You have a vector of 8 long long integers and you want to shift that
vector by a given amount left between 1 and 63 bits.

Interface:

void ShiftVector(long long *vector, int AmountToShift);

Your are keen to promote C99 so should that not be

void ShiftVector(long long vector[static 8], int AmountToShift);

; Pointer to vector in %rdi
; Amount to shift in %rsi
; gcc calling conventions for x86-64
shiftvector:
movq %rsi,%rcx

movq (%rdi),%rsi
movq 8(%rdi),%rax
shldq %cl,%rax,%rsi
movq %rsi,(%rdi)

movq 16(%rdi),%rsi
shldq %cl,%rsi,%rax
movq %rax,8(%rdi)

movq 24(%rdi),%rax
shldq %cl,%rax,%rsi
movq %rsi,16(%rdi)

movq 32(%rdi),%rsi
shldq %cl,%rsi,%rax
movq %rax,24(%rdi)

movq 40(%rdi),%rax
shldq %cl,%rax,%rsi
movq %rsi,32(%rdi)

movq 48(%rdi),%rsi
shldq %cl,%rsi,%rax
movq %rax,40(%rdi)

movq 56(%rdi),%rax
shldq %cl,%rax,%rsi
movq %rsi,48(%rdi)

shlq %cl,%rsi
movq %rsi,56(%rdi)

ret

This does not work for me and certainly looks wrong. I think the last
two lines before ret should be:

shlq %cl,%rax
movq %rax,56(%rdi)

instead. If I make that change, the function seems to work when tested.

26 instructions, 97 bytes (!!!)

Write that in C and see how big your procedure is, and how slow

When given the "obvious" C, gcc -O3 produces a function with 63
instructions taking up 188 bytes.

Timing is quite hard because the function is so vary quick, but I get
about 0.4s for 100 million calls of yours and 0.65s for the C version.

I don't think that's bad, given all the advantages of writing in C.
Yes, it's 1.63 times slower, but how many programs will be dominated by
such calls?

BartC · Jan 4, 2012

jacob navia said:
Le 04/01/12 04:24, Joe keane a écrit :

Yes, that is a good idea. Go to sleep.

I am finishing the assembly language core of lcc-win for extended
precision floats (448 bits) for the Macintosh version. I used for
that assembly (x86) and it shines.

Rotating a series of values up or down for instance can be done in
assembly in a few instructions,but can't be done at all in C.

Using the carry in multi-precision operations is essential for
extended floating point. Impractical in C but easy on assembly.

I'm sure it could be coded somehow in C. In that case, how much faster do
you think the assembly version might be compared to C (even using the most
advanced optimising compiler available)?

James Harris · Jan 4, 2012

....

Of course, by far the best way to perform optimization is to improve the
algorithm.
True.

....

Downsides to assembly improvements:
Improvements for chip X may not have the same benefit on chip Y.
Assembly ages badly. I have written thousands of lines with things
like:

MOV AH, 10
INT 21H

I doubt if any machine being used to read this message could even run
that code, since it is 16 bit and uses an old bios call.

Not quite. It is 16-bit but that's an old DOS call. And your x86
machine could still run it before it enters protected mode. It can
probably continue to run it in protected mode as a v86 task before you
switch to 64-bit mode. Not too sure about either 16-bit or 32-bit
support in x86 64-bit modes but the point is the x86 CPU will still
have support for earlier code.

I run 64 bit programs almost exclusively on my machines now (notable
exceptions being applications that are not available in 64 bits), so 32
bit assembly is slowing my machine down, and not speeding it up.

If you are storing lots of 64-bit values in memory where 32-bit values
would do your code could end up being slower rather than faster. The
size of a mode's default operation doesn't change the hardware's
memory bandwidth or the cache space.

Every
assembly variant is more code to maintain. If I support 30 different
CPUs, then I may have 30 times the code volume for some particular hunk
of an algorithm, and all 30 pathways will have to be tested on every
pass of the regression testing.

There is some truth in that but the emphasis is wrong. In reality
there are sensible assembly approaches that work well across many
CPUs. To illustrate, Intel's and AMD's optimisation manuals are
written in fairly generic terms and the advice doesn't change much
between CPU generations.

Regarding the instructions Intel and AMD have added to the
architecture, they can make a big difference. Don't some compilers
provide instructions which basically give explicit access to the new
instructions? So even HLL code, not just assembly, may need to be
adjusted for different CPUs if you really want best performance.

James

jacob navia · Jan 4, 2012

Le 04/01/12 23:47, Ben Bacarisse a écrit :

When given the "obvious" C, gcc -O3 produces a function with 63
instructions taking up 188 bytes.

Show please...

jacob navia · Jan 4, 2012

Le 04/01/12 23:47, Ben Bacarisse a écrit :

jacob navia said:
jacob navia said:

But let's forget about that. Look at this.

problem:

You have a vector of 8 long long integers and you want to shift that
vector by a given amount left between 1 and 63 bits.

Interface:

void ShiftVector(long long *vector, int AmountToShift);

Click to expand...

Your are keen to promote C99 so should that not be

void ShiftVector(long long vector[static 8], int AmountToShift);

YESSIR!

But I am so used to being beaten by my C99 posts that I wrote it in C89.
Now I get beaten because I didn't use C99....

This does not work for me and certainly looks wrong. I think the last
two lines before ret should be:

shlq %cl,%rax
movq %rax,56(%rdi)

instead. If I make that change, the function seems to work when tested.

Yes, the actual function is more complicated since there is a header
part in the vector (8 bytes) that you should skip. SInce that
would complicate the function for no reason, I modified the indices by
hand to post it and added a bug doing that.

Thanks for your correction Ben. And, by the way, your assembly is not
THAT rusty as you make us believe...

When given the "obvious" C, gcc -O3 produces a function with 63
instructions taking up 188 bytes.

Timing is quite hard because the function is so vary quick, but I get
about 0.4s for 100 million calls of yours and 0.65s for the C version.

I don't think that's bad, given all the advantages of writing in C.
Yes, it's 1.63 times slower, but how many programs will be dominated by
such calls?

Well, a floating point package is doing that ALL the time since you must
match the mantissas before doing ANY operation. Matching the mantissas
means shifting them to match the decimal point.

It is one of the central pieces of the package and any performance
improvement even a extremely small one has BIG consequences.

But I reserve my final opinion until I see what gcc generates and if
my gcc generates that too.

Thanks for your post Ben.

jacob

Ben Bacarisse · Jan 5, 2012

jacob navia said:
Le 04/01/12 23:47, Ben Bacarisse a Ã©crit :

Show please...

void ShiftVector(unsigned long long vector[static 8], int AmountToShift)
{
int rest = 64 - AmountToShift;
vector[0] = (vector[0] << AmountToShift) | (vector[1] >> rest);
vector[1] = (vector[1] << AmountToShift) | (vector[2] >> rest);
vector[2] = (vector[2] << AmountToShift) | (vector[3] >> rest);
vector[3] = (vector[3] << AmountToShift) | (vector[4] >> rest);
vector[4] = (vector[4] << AmountToShift) | (vector[5] >> rest);
vector[5] = (vector[5] << AmountToShift) | (vector[6] >> rest);
vector[6] = (vector[6] << AmountToShift) | (vector[7] >> rest);
vector[7] = (vector[7] << AmountToShift);
}

Note the change to unsigned long long. It's what your code was doing
but your interface had signed integers whose use would give rise
to undefined behaviour.

88888 Dihedral · Jan 5, 2012

Ben Bacarisseæ–¼ 2012å¹´1æœˆ5æ—¥æ˜ŸæœŸå››UTC+8ä¸Šåˆ8æ™‚22åˆ†52ç§’å¯«é“ï¼š

jacob navia said:
jacob navia said:

Le 04/01/12 23:47, Ben Bacarisse a Ã©crit :

Show please...

Click to expand...

void ShiftVector(unsigned long long vector[static 8], int AmountToShift)
{
int rest = 64 - AmountToShift;
vector[0] = (vector[0] << AmountToShift) | (vector[1] >> rest);
vector[1] = (vector[1] << AmountToShift) | (vector[2] >> rest);
vector[2] = (vector[2] << AmountToShift) | (vector[3] >> rest);
vector[3] = (vector[3] << AmountToShift) | (vector[4] >> rest);
vector[4] = (vector[4] << AmountToShift) | (vector[5] >> rest);
vector[5] = (vector[5] << AmountToShift) | (vector[6] >> rest);
vector[6] = (vector[6] << AmountToShift) | (vector[7] >> rest);
vector[7] = (vector[7] << AmountToShift);
}

Note the change to unsigned long long. It's what your code was doing
but your interface had signed integers whose use would give rise
to undefined behaviour.

Thus you are working on a little endian system.

Accessing those vector in the C compiler generated instructions might
be quite different from hand-optimized assembly instructions from
professionals.

BartC · Jan 5, 2012

Ben Bacarisse said:
jacob navia said:

Le 04/01/12 23:47, Ben Bacarisse a Ã©crit :

Show please...

Click to expand...

void ShiftVector(unsigned long long vector[static 8], int AmountToShift)
{
int rest = 64 - AmountToShift;
vector[0] = (vector[0] << AmountToShift) | (vector[1] >> rest);
vector[1] = (vector[1] << AmountToShift) | (vector[2] >> rest);
vector[2] = (vector[2] << AmountToShift) | (vector[3] >> rest);
vector[3] = (vector[3] << AmountToShift) | (vector[4] >> rest);
vector[4] = (vector[4] << AmountToShift) | (vector[5] >> rest);
vector[5] = (vector[5] << AmountToShift) | (vector[6] >> rest);
vector[6] = (vector[6] << AmountToShift) | (vector[7] >> rest);
vector[7] = (vector[7] << AmountToShift);
}

I've tested this in 32-bit mode.

For 100 million iterations, lcc-win32 took at least 5.5 seconds, and gcc up
to -O2 took at least 4.3 seconds.

Assembly took 2.5 seconds (shifting less than 32 bits), or 3.3 seconds
(>32). (32 bits exactly took 1.4 seconds.)

However, gcc -O3 took 1.4 to 1.6 seconds (and 0.7 seconds for exactly a
32-bit shift).

I'm still looking at how it can do that, since my assembly code is pretty
short! It's obviously using inlining, but function call overheads are only
0.4 seconds.

The strange thing is that gcc's inline version of ShiftVector is only 80
instructions, but the ShiftVector code itself is about 250. There are some
parameter overheads, but not 170 instructions' worth.

Noob · Jan 5, 2012

James said:
Can you give me any pointers on how to link gcc and nasm or another
separate assembler? I've found a few web references with useful tips
but IIRC one of them was your first attempt at this a few years ago.
Maybe you found some more info since then. I know I can do it but
maybe there are good ways and bad ways.

comp.lang.asm.x86
alt.lang.asm

BartC · Jan 5, 2012

void ShiftVector(unsigned long long vector[static 8], int AmountToShift)
{
int rest = 64 - AmountToShift;
vector[0] = (vector[0] << AmountToShift) | (vector[1] >> rest);

Click to expand...

....

I've tested this in 32-bit mode.

However, gcc -O3 took 1.4 to 1.6 seconds (and 0.7 seconds for exactly a
32-bit shift).

gcc -O3 was obviously taking advantage of some aspect of the repetitive
nature of my simple benchmark.

Varying the amount of shift in each iteration soon put paid to that!

Timings for 100 million iterations of a varying number of shifts (1 to 63)
are now:

gcc -O3 4.7 seconds
lccwin-33 -O 6.6 seconds
PellesC -Ot 6.8 seconds
DMC -o 12.4 seconds
My ASM 3.2 seconds

(And the Asm could do with some further work, this is just the first draft,
but I'm not going to bother. Having to logically swap alternate 32-bit words
in order to match the behaviour of 64-bits has already done my head in..)

So as it stands, the advantage of Asm over gcc -o3 is pretty much what
you've already found.

Phil Carmody · Jan 5, 2012

James Harris said:
On Dec 29, 1:12 am, Phil Carmody <[email protected]>
wrote:

...

I don't understand. Are you thinking of 64-bit architectures?

I was. I realise now this was not the context you were working in,
and I got myself confused.

In recompense, may I offer the following logic to address the
original question in a new light:

high = (common bits in high and low) + (bits only in high)
low = (common bits in high and low) + (bits only in low)

=> high + low
= (common bits of high and low) * 2 + (bits only in one or the other)
= (high & low) * 2 + (high ^ low)

=> middle = (high + low) / 2 = (high & low) + (high ^ low) / 2

With an optional +1 for rounding up if desired.

This is not original, it's quite well known. I think it's in the
Hackers Handbook, which I was reminded of just a day or two ago.

Phil

Joe keane · Jan 5, 2012

void ShiftVector(unsigned long long vector[static 8], int AmountToShift)
{
int rest = 64 - AmountToShift;
vector[0] = (vector[0] << AmountToShift) | (vector[1] >> rest);
vector[1] = (vector[1] << AmountToShift) | (vector[2] >> rest);
vector[2] = (vector[2] << AmountToShift) | (vector[3] >> rest);
vector[3] = (vector[3] << AmountToShift) | (vector[4] >> rest);
vector[4] = (vector[4] << AmountToShift) | (vector[5] >> rest);
vector[5] = (vector[5] << AmountToShift) | (vector[6] >> rest);
vector[6] = (vector[6] << AmountToShift) | (vector[7] >> rest);
vector[7] = (vector[7] << AmountToShift);
}

I spent a bunch of time 'optimizing' this code [in C]; everything i did
made it worse!

Here's gcc:

.file "shi.c"
.text
.p2align 4,,15
.globl ShiftVector
.type ShiftVector, @function
ShiftVector:
..LFB0:
.cfi_startproc
movq 8(%rdi), %r9
movl $64, %eax
movq (%rdi), %r8
subl %esi, %eax
movl %eax, %ecx
movq %r9, %rdx
shrq %cl, %rdx
movl %esi, %ecx
salq %cl, %r8
movl %eax, %ecx
orq %r8, %rdx
movq 16(%rdi), %r8
movq %rdx, (%rdi)
movq %r8, %rdx
shrq %cl, %rdx
movl %esi, %ecx
salq %cl, %r9
movl %eax, %ecx
orq %r9, %rdx
movq 24(%rdi), %r9
movq %rdx, 8(%rdi)
movq %r9, %rdx
shrq %cl, %rdx
movl %esi, %ecx
salq %cl, %r8
movl %eax, %ecx
orq %r8, %rdx
movq 32(%rdi), %r8
movq %rdx, 16(%rdi)
movq %r8, %rdx
shrq %cl, %rdx
movl %esi, %ecx
salq %cl, %r9
movl %eax, %ecx
orq %r9, %rdx
movq 40(%rdi), %r9
movq %rdx, 24(%rdi)
movq %r9, %rdx
shrq %cl, %rdx
movl %esi, %ecx
salq %cl, %r8
movl %eax, %ecx
orq %r8, %rdx
movq 48(%rdi), %r8
movq %rdx, 32(%rdi)
movq %r8, %rdx
shrq %cl, %rdx
movl %esi, %ecx
salq %cl, %r9
movl %eax, %ecx
orq %r9, %rdx
movq %rdx, 40(%rdi)
movq 56(%rdi), %rdx
movq %rdx, %r9
shrq %cl, %r9
movl %esi, %ecx
movq %r9, %rax
salq %cl, %r8
salq %cl, %rdx
orq %r8, %rax
movq %rdx, 56(%rdi)
movq %rax, 48(%rdi)
ret
.cfi_endproc
..LFE0:
.size ShiftVector, .-ShiftVector
.ident "GCC: (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1"
.section .note.GNU-stack,"",@progbits

88888 Dihedral · Jan 5, 2012

jacob naviaæ–¼ 2012å¹´1æœˆ5æ—¥æ˜ŸæœŸå››UTC+8ä¸Šåˆ3æ™‚22åˆ†25ç§’å¯«é“ï¼š

Le 04/01/12 18:59, Stephen Sprunk a Ã©crit :

Note that those instructions are available since at least 10 YEARS
and still mainstream compilers like MSVC or GCC do not perform any of
the magic you invoke...

Sure, maybe ONE DAY IN THE FAR FUTURE they will do it, but until then I
am using those instructions in my programs now...

Yes, you need ALLL THE RESOURCES of Intel Corp to build such a compiler,
no wonder nobody has done that besides Intel.

What?

Look Stephen, you are surely knowledgable in C but in assembly...
Take for instance ANDPS (AND NOT PACKED Single precision)

The corresponding AVX instructions is...
VANDNPS, with the same syntax same semantics. You just
add a "V" before the instruction mnemonic.

All you have to do is use the ymm registers with 256 bits
instead of the xmm registers with 128 bits. Your loop
counters must be adjusted (using only half as many loops)
and that was it, maybe an hour of work for a big program.

Yes, but it will be at least 10 years before gcc or MSVC propose that.
gcc doesn't even propose automatically SSE2 or SSE3 TODAY, 10 years
after it was announced by Intel...

That was in 2000, when it first appeared. Pleeeeeze, we are 2012 now,
that is no longer relevant at all.

But let's forget about that. Look at this.

problem:

You have a vector of 8 long long integers and you want to shift that
vector by a given amount left between 1 and 63 bits.

Interface:

void ShiftVector(long long *vector, int AmountToShift);
; Pointer to vector in %rdi
; Amount to shift in %rsi
; gcc calling conventions for x86-64
shiftvector:
movq %rsi,%rcx

AmountToShift -> rcx

movq (%rdi),%rsi

vector[0] -> rsi

movq 8(%rdi),%rax vector[1] -> rax
shldq %cl,%rax,%rsi

Do the shift to the left.

movq %rsi,(%rdi)

Save the rsi part. Leave the rax part.

movq 16(%rdi),%rsi

vector[2] -> rsi

shldq %cl,%rsi,%rax Do the shift.
movq %rax,8(%rdi)

Save the rax part. Leave the rsi part.

Stephen Sprunk · Jan 6, 2012

Not quite. It is 16-bit but that's an old DOS call. And your x86
machine could still run it before it enters protected mode. It can
probably continue to run it in protected mode as a v86 task before you
switch to 64-bit mode. Not too sure about either 16-bit or 32-bit
support in x86 64-bit modes but the point is the x86 CPU will still
have support for earlier code.

The CPU is capable of executing the above instructions in any mode,
including Long Mode, but the resulting behavior may vary--more because
of changes in the OS than in the CPU.

If you are storing lots of 64-bit values in memory where 32-bit values
would do your code could end up being slower rather than faster. The
size of a mode's default operation doesn't change the hardware's
memory bandwidth or the cache space.

Long Mode actually doesn't change the operand size of instructions; it
remains at 8-bit for some instructions and either 16- or 32-bit
(depending on how the page is marked) for others. You have to twiddle a
bit in the REX prefix to get 64-bit operands.

Also, most apps use the "small" code model, i.e. 32-bit pointers for
both code and data; a handful use the "medium" model, i.e. 32-bit
pointers for code but 64-bit for data. (GCC doesn't even bother to
implement the "large" model, i.e. 64-bit pointers for both.)

The real win is having twice as many GPRs and a register calling
convention; there should be fewer stack-related memory stalls and more
opportunities to find IPC, which for most workloads should easily offset
the larger pointers. The dominant memory problem for modern CPUs is
usually latency anyway, not bandwidth, so the real issue is cache line
utilization: larger pointers means lower hit rates and more stalls.

Regarding the instructions Intel and AMD have added to the
architecture, they can make a big difference. Don't some compilers
provide instructions which basically give explicit access to the new
instructions? So even HLL code, not just assembly, may need to be
adjusted for different CPUs if you really want best performance.

Yes, though the result is pretty much assembly with C syntax, so it has
most of the same disadvantages.

FWIW, GCC will feed inline assembly (not just the pseudo-assembly above)
into its instruction scheduler, along with the surrounding assembly that
it generated itself, so that's one advantage to doing it that way vs.
linking in pure assembly.

S

Joe keane · Jan 7, 2012

You have a vector of 8 long long integers and you want to shift that
vector by a given amount left between 1 and 63 bits.

I resheduled your code...

new code:
.text
.global ShiftVector
ShiftVector:
movq 0(%rdi),%rax
movq 8(%rdi),%rdx
movq %rsi,%rcx
pushq %rbx
shldq %cl,%rdx,%rax
movq 16(%rdi),%rbx
shldq %cl,%rbx,%rdx
movq %rax,0(%rdi)
movq 24(%rdi),%rax
shldq %cl,%rax,%rbx
movq %rdx,8(%rdi)
movq 32(%rdi),%rdx
shldq %cl,%rdx,%rax
movq %rbx,16(%rdi)
movq 40(%rdi),%rbx
shldq %cl,%rbx,%rdx
movq %rax,24(%rdi)
movq 48(%rdi),%rax
shldq %cl,%rax,%rbx
movq %rdx,32(%rdi)
movq 56(%rdi),%rdx
shldq %cl,%rdx,%rax
movq %rbx,40(%rdi)
shlq %cl,%rdx
popq %rbx
movq %rax,48(%rdi)
movq %rdx,56(%rdi)
ret

orig [C] 0.535
navia 0.45
new 0.38

SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Quick sort algorithm	1	Feb 22, 2023
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
How to inline assembly in a C program?	4	Mar 3, 2013
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Performance of int/long in Python 3	187	Mar 25, 2013
BITCOIN PROGRAMMING - CODE INCLUDED - needs slight modification in linux terminal - NSA please do not block	0	Nov 2, 2024
Comparision of C Sharp and C performance	360	Dec 27, 2009

Performance of hand-optimised assembly

jacob navia

Stephen Sprunk

jacob navia

Joe keane

Dann Corbit

Ben Bacarisse

BartC

James Harris

jacob navia

jacob navia

Ben Bacarisse

88888 Dihedral

BartC

Noob

BartC

Phil Carmody

Joe keane

88888 Dihedral

Stephen Sprunk

Joe keane

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads