AVX in Visual Studio

T

Thomas

Hello

I've just started to convert some c++ code which is optimised for sse4 to
avx.

After a few false starts with Visual Studio, I've finally started to
generate avx code. Unfortunately the result is that my application is
running slower rather than faster - it does at least produce the correct
results.

I suspect that the main reason for the decrease in speed is that the
compiler is mixing up sse and avx code which I believe is a real performance
killer?

For example the following line:

const embree::avxf eps(1.0e-20f);

Generates the following assembly code

0000000000884F5B vmovss xmm0,dword ptr [__real@1e3ce508 (98E970h)]
0000000000884F63 vmovss dword ptr [rbp],xmm0
0000000000884F68 lea rax,[rbp]
0000000000884F6C vbroadcastss ymm0,dword ptr [rax]
0000000000884F71 vmovaps ymmword ptr [rbp+20h],ymm0
0000000000884F76 vmovaps ymm0,ymmword ptr [rbp+20h]
0000000000884F7B vmovaps ymmword ptr [eps],ymm0

As you can see, I'm using the intel embree intrinsic library.

Any idea how to avoid this - do I need to hand code the lower level
intrinsics?

Thanks for any help.

Thomas
 
M

Melzzzzz

Hello

I've just started to convert some c++ code which is optimised for
sse4 to avx.

After a few false starts with Visual Studio, I've finally started to
generate avx code. Unfortunately the result is that my application is
running slower rather than faster - it does at least produce the
correct results.

I suspect that the main reason for the decrease in speed is that the
compiler is mixing up sse and avx code which I believe is a real
performance killer?

For example the following line:

const embree::avxf eps(1.0e-20f);

Generates the following assembly code

0000000000884F5B vmovss xmm0,dword ptr [__real@1e3ce508
(98E970h)] 0000000000884F63 vmovss dword ptr [rbp],xmm0
0000000000884F68 lea rax,[rbp]
0000000000884F6C vbroadcastss ymm0,dword ptr [rax]
0000000000884F71 vmovaps ymmword ptr [rbp+20h],ymm0
0000000000884F76 vmovaps ymm0,ymmword ptr [rbp+20h]
0000000000884F7B vmovaps ymmword ptr [eps],ymm0

As you can see, I'm using the intel embree intrinsic library.

This is not mixing sse with avx. All intructions are prefixed with v.
Any idea how to avoid this - do I need to hand code the lower level
intrinsics?

I think that problem lies somewhere else.
 
T

Thomas

Melzzzzz said:
On Fri, 19 Apr 2013 18:55:16 +0100


This is not mixing sse with avx. All intructions are prefixed with v.


I think that problem lies somewhere else.

Thanks, that's really useful - I also had a look at an intel doc about
mixing sse and avx which made the same point.

In my case I have a function for intersecting a ray with a triangle. The
function was written so that all the floats corresponding to the triangle
could be converted to embree::ssef's so that 4 triangles at a time could be
intersected. That gave almost a 4x speed-up. The next step was to convert
the floats to embree::avxf's - but instead of something approaching an 8x
speed up I got a slow-down. I still suspect that this is caused by mixing
sse and avx since the profiler still points to the intersect routine as
being 90+% of the run-time.

So, what about the following two c++ to asembly conversions which were
created by the Visual Studio c++ compiler with speed optimization enabled.
They each seem to contain a mix of avx and non-avx instructions (??) - they
also seem much more verbose than I would have expected, but maybe I'm
missing something?

I can see that mixing avx and non-avx mighr be unavoidable in the second
example where an avx variable is being reduced to a scalar. But the first
example looks like a classic case for avx, so why the non avx instructions?

Again, Many Thanks
Thomas


embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);

0000000001315939 mov rax,qword ptr [this]

0000000001315941 vmovaps ymm0,ymmword ptr [Qz]

0000000001315949 vmulps ymm0,ymm0,ymmword ptr [rax+100h]

0000000001315951 vmovaps ymmword ptr [rbp+1000h],ymm0

0000000001315959 vmovaps ymm0,ymmword ptr [rbp+1000h]

0000000001315961 vmovaps ymmword ptr [rbp+1020h],ymm0

0000000001315969 vmovaps ymm0,ymmword ptr [rbp+1020h]

0000000001315971 vmovaps ymmword ptr [rbp+1040h],ymm0

0000000001315979 mov rax,qword ptr [this]

0000000001315981 vmovaps ymm0,ymmword ptr [Qy]

0000000001315989 vmulps ymm0,ymm0,ymmword ptr [rax+0E0h]

0000000001315991 vmovaps ymmword ptr [rbp+1060h],ymm0

0000000001315999 vmovaps ymm0,ymmword ptr [rbp+1060h]

00000000013159A1 vmovaps ymmword ptr [rbp+1080h],ymm0

00000000013159A9 vmovaps ymm0,ymmword ptr [rbp+1080h]

00000000013159B1 vmovaps ymmword ptr [rbp+10A0h],ymm0

00000000013159B9 mov rax,qword ptr [this]

00000000013159C1 vmovaps ymm0,ymmword ptr [Qx]

00000000013159C9 vmulps ymm0,ymm0,ymmword ptr [rax+0C0h]

00000000013159D1 vmovaps ymmword ptr [rbp+10C0h],ymm0

00000000013159D9 vmovaps ymm0,ymmword ptr [rbp+10C0h]

00000000013159E1 vmovaps ymmword ptr [rbp+10E0h],ymm0

00000000013159E9 vmovaps ymm0,ymmword ptr [rbp+10E0h]

00000000013159F1 vmovaps ymmword ptr [rbp+1100h],ymm0

00000000013159F9 vmovaps ymm0,ymmword ptr [rbp+1100h]

0000000001315A01 vaddps ymm0,ymm0,ymmword ptr [rbp+10A0h]

0000000001315A09 vmovaps ymmword ptr [rbp+1120h],ymm0

0000000001315A11 vmovaps ymm0,ymmword ptr [rbp+1120h]

0000000001315A19 vmovaps ymmword ptr [rbp+1140h],ymm0

0000000001315A21 vmovaps ymm0,ymmword ptr [rbp+1140h]

0000000001315A29 vmovaps ymmword ptr [rbp+1160h],ymm0

0000000001315A31 vmovaps ymm0,ymmword ptr [rbp+1160h]

0000000001315A39 vaddps ymm0,ymm0,ymmword ptr [rbp+1040h]

0000000001315A41 vmovaps ymmword ptr [rbp+1180h],ymm0

0000000001315A49 vmovaps ymm0,ymmword ptr [rbp+1180h]

0000000001315A51 vmovaps ymmword ptr [rbp+11A0h],ymm0

0000000001315A59 vmovaps ymm0,ymmword ptr [rbp+11A0h]

0000000001315A61 vmovaps ymmword ptr [rbp+11C0h],ymm0

0000000001315A69 vmovaps ymm0,ymmword ptr [PuInv]

0000000001315A71 vmulps ymm0,ymm0,ymmword ptr [rbp+11C0h]

0000000001315A79 vmovaps ymmword ptr [rbp+11E0h],ymm0

0000000001315A81 vmovaps ymm0,ymmword ptr [rbp+11E0h]

0000000001315A89 vmovaps ymmword ptr [rbp+1200h],ymm0

0000000001315A91 vmovaps ymm0,ymmword ptr [rbp+1200h]

0000000001315A99 vmovaps ymmword ptr [t],ymm0







if(embree::reduce_or(valid)==false)

00000000013161C7 vmovaps ymm0,ymmword ptr [valid]

00000000013161CF vtestps ymm0,ymmword ptr [valid]

00000000013161D8 mov eax,1

00000000013161DD mov ecx,0

00000000013161E2 cmove ecx,eax

00000000013161E5 test ecx,ecx

00000000013161E7 jne 00000000013161F5

00000000013161E9 mov dword ptr [rbp+1D80h],1

00000000013161F3 jmp 00000000013161FF

00000000013161F5 mov dword ptr [rbp+1D80h],0

00000000013161FF movzx eax,byte ptr [rbp+1D80h]

0000000001316206 test eax,eax

0000000001316208 jne 0000000001316214

return -1;

000000000131620A mov eax,0FFFFFFFFh

000000000131620F jmp 000000000131650A
 
T

Thomas

Andy Champ said:
I'm not familiar with AVX. But the instructions I do recognise imply to me
that vx_, vy_ and vz_ are member variables of the current object, so it is
loading the address of this into rax for address calculation.

Are you sure you have all the optimisation turned on? It seems odd that
it is doing it three times. And the rest of the instructions look as
though they are copying data in and out of memory a lot.

Andy

Thanks

Yes, vx_, vy_ and vz_ are member variables, and yes I have optimize speed
turned on for this module in VS-10.

The question is; does the "mov rax,qword ptr [this]" amount to a non
avx-instruction which will significantly hit the performance of the
subsequent avx instructions (the v-prefixed instructions)?

Thanks
Thomas
 
M

Melzzzzz

The question is; does the "mov rax,qword ptr [this]" amount to a non
avx-instruction which will significantly hit the performance of the
subsequent avx instructions (the v-prefixed instructions)?
No. Only mixing sse instructions with ones prefixed with 'v'
slows down performance. Since you are using compiler to
compile I wouldn't suspect to mixing of sse with avx rather
to unoptimal code.
 
M

Melzzzzz

Thanks, that's really useful - I also had a look at an intel doc
about mixing sse and avx which made the same point.

In my case I have a function for intersecting a ray with a triangle.
The function was written so that all the floats corresponding to the
triangle could be converted to embree::ssef's so that 4 triangles at
a time could be intersected. That gave almost a 4x speed-up. The next
step was to convert the floats to embree::avxf's - but instead of
something approaching an 8x speed up I got a slow-down. I still
suspect that this is caused by mixing sse and avx since the profiler
still points to the intersect routine as being 90+% of the run-time.

Could you post sse version?
So, what about the following two c++ to asembly conversions which
were created by the Visual Studio c++ compiler with speed
optimization enabled. They each seem to contain a mix of avx and
non-avx instructions (??) - they also seem much more verbose than I
would have expected, but maybe I'm missing something?

No. They do not contain mixing of avx with *sse* instructions.
I can see that mixing avx and non-avx mighr be unavoidable in the
second example where an avx variable is being reduced to a scalar.
But the first example looks like a classic case for avx, so why the
non avx instructions?

There is no problem with that. Non avx instructions are normally mixed
with avx, but *sse* causes slow down.
Again, Many Thanks
Thomas


embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);

0000000001315939 mov rax,qword ptr [this]

0000000001315941 vmovaps ymm0,ymmword ptr [Qz]

0000000001315949 vmulps ymm0,ymm0,ymmword ptr [rax+100h]

0000000001315951 vmovaps ymmword ptr [rbp+1000h],ymm0

0000000001315959 vmovaps ymm0,ymmword ptr [rbp+1000h]

What is this ;)
Are you sure this is with optimisations on?
0000000001315961 vmovaps ymmword ptr [rbp+1020h],ymm0

0000000001315969 vmovaps ymm0,ymmword ptr [rbp+1020h]
again


0000000001315971 vmovaps ymmword ptr [rbp+1040h],ymm0

0000000001315979 mov rax,qword ptr [this]

and again
0000000001315981 vmovaps ymm0,ymmword ptr [Qy]

0000000001315989 vmulps ymm0,ymm0,ymmword ptr [rax+0E0h]

0000000001315991 vmovaps ymmword ptr [rbp+1060h],ymm0

0000000001315999 vmovaps ymm0,ymmword ptr [rbp+1060h]

00000000013159A1 vmovaps ymmword ptr [rbp+1080h],ymm0

00000000013159A9 vmovaps ymm0,ymmword ptr [rbp+1080h]

00000000013159B1 vmovaps ymmword ptr [rbp+10A0h],ymm0

00000000013159B9 mov rax,qword ptr [this]

00000000013159C1 vmovaps ymm0,ymmword ptr [Qx]

00000000013159C9 vmulps ymm0,ymm0,ymmword ptr [rax+0C0h]

00000000013159D1 vmovaps ymmword ptr [rbp+10C0h],ymm0

00000000013159D9 vmovaps ymm0,ymmword ptr [rbp+10C0h]

00000000013159E1 vmovaps ymmword ptr [rbp+10E0h],ymm0

00000000013159E9 vmovaps ymm0,ymmword ptr [rbp+10E0h]

00000000013159F1 vmovaps ymmword ptr [rbp+1100h],ymm0

00000000013159F9 vmovaps ymm0,ymmword ptr [rbp+1100h]

0000000001315A01 vaddps ymm0,ymm0,ymmword ptr [rbp+10A0h]

0000000001315A09 vmovaps ymmword ptr [rbp+1120h],ymm0

0000000001315A11 vmovaps ymm0,ymmword ptr [rbp+1120h]

0000000001315A19 vmovaps ymmword ptr [rbp+1140h],ymm0

0000000001315A21 vmovaps ymm0,ymmword ptr [rbp+1140h]

0000000001315A29 vmovaps ymmword ptr [rbp+1160h],ymm0

0000000001315A31 vmovaps ymm0,ymmword ptr [rbp+1160h]

0000000001315A39 vaddps ymm0,ymm0,ymmword ptr [rbp+1040h]

0000000001315A41 vmovaps ymmword ptr [rbp+1180h],ymm0

0000000001315A49 vmovaps ymm0,ymmword ptr [rbp+1180h]

0000000001315A51 vmovaps ymmword ptr [rbp+11A0h],ymm0

0000000001315A59 vmovaps ymm0,ymmword ptr [rbp+11A0h]

0000000001315A61 vmovaps ymmword ptr [rbp+11C0h],ymm0

0000000001315A69 vmovaps ymm0,ymmword ptr [PuInv]

0000000001315A71 vmulps ymm0,ymm0,ymmword ptr [rbp+11C0h]

0000000001315A79 vmovaps ymmword ptr [rbp+11E0h],ymm0

0000000001315A81 vmovaps ymm0,ymmword ptr [rbp+11E0h]

0000000001315A89 vmovaps ymmword ptr [rbp+1200h],ymm0

0000000001315A91 vmovaps ymm0,ymmword ptr [rbp+1200h]

0000000001315A99 vmovaps ymmword ptr [t],ymm0

This code is really, really unoptimized...
if(embree::reduce_or(valid)==false)

00000000013161C7 vmovaps ymm0,ymmword ptr [valid]

00000000013161CF vtestps ymm0,ymmword ptr [valid]

00000000013161D8 mov eax,1

00000000013161DD mov ecx,0

00000000013161E2 cmove ecx,eax

00000000013161E5 test ecx,ecx

00000000013161E7 jne 00000000013161F5

What is this ;)
00000000013161E9 mov dword ptr [rbp+1D80h],1

00000000013161F3 jmp 00000000013161FF

00000000013161F5 mov dword ptr [rbp+1D80h],0

00000000013161FF movzx eax,byte ptr [rbp+1D80h]

0000000001316206 test eax,eax

0000000001316208 jne 0000000001316214

return -1;

000000000131620A mov eax,0FFFFFFFFh

000000000131620F jmp 000000000131650A

I guess that this routine does something nonsensical....

Your compiler does not produce optimized code at all...
 
M

Melzzzzz

embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);
This is how it should be written:
mov rax,qword ptr [this]
vmovaps ymm0,ymmword ptr [Qz]
vmulps ymm1,ymm0,ymmword ptr [rax+100h]
vmovaps ymm2,ymmword ptr [Qy]
vmulps ymm3,ymm2,ymmword ptr [rax+0E0h]
vmovaps ymm4,ymmword ptr [Qx]
vmulps ymm5,ymm4,ymmword ptr [rax+0C0h]
vaddps ymm6,ymm1,ymm3
vaddps ymm6,ymm6,ymm5
vmulps ymm6,ymm6,ymmword ptr [PuInv]
vmovaps ymmword ptr [t],ymm6
if(embree::reduce_or(valid)==false)
and this one:
vmovaps ymm0,ymmword ptr [valid]
vtestps ymm0,ymm0
jne address
 
T

Thomas

Melzzzzz said:
embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);
This is how it should be written:
mov rax,qword ptr [this]
vmovaps ymm0,ymmword ptr [Qz]
vmulps ymm1,ymm0,ymmword ptr [rax+100h]
vmovaps ymm2,ymmword ptr [Qy]
vmulps ymm3,ymm2,ymmword ptr [rax+0E0h]
vmovaps ymm4,ymmword ptr [Qx]
vmulps ymm5,ymm4,ymmword ptr [rax+0C0h]
vaddps ymm6,ymm1,ymm3
vaddps ymm6,ymm6,ymm5
vmulps ymm6,ymm6,ymmword ptr [PuInv]
vmovaps ymmword ptr [t],ymm6
if(embree::reduce_or(valid)==false)
and this one:
vmovaps ymm0,ymmword ptr [valid]
vtestps ymm0,ymm0
jne address

Thanks a lot for your help.

I ran some stripped down tests and established that on my machine an avxf
multiply consistently costs about 1.4x an ssef multiply - meaning the cost
of an avxf flop is about 0.7x that of an ssef flop.

Of course, you only see the benefit of avxf multiplies if you are
multiplying more than 4 floats - if you are multiplying 4 or fewer then
you'll see a slowdown.

It turns out that the problem does indeed lie elsewhere.

Thanks
Thomas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top