T
Thomas
Hello
I've just started to convert some c++ code which is optimised for sse4 to
avx.
After a few false starts with Visual Studio, I've finally started to
generate avx code. Unfortunately the result is that my application is
running slower rather than faster - it does at least produce the correct
results.
I suspect that the main reason for the decrease in speed is that the
compiler is mixing up sse and avx code which I believe is a real performance
killer?
For example the following line:
const embree::avxf eps(1.0e-20f);
Generates the following assembly code
0000000000884F5B vmovss xmm0,dword ptr [__real@1e3ce508 (98E970h)]
0000000000884F63 vmovss dword ptr [rbp],xmm0
0000000000884F68 lea rax,[rbp]
0000000000884F6C vbroadcastss ymm0,dword ptr [rax]
0000000000884F71 vmovaps ymmword ptr [rbp+20h],ymm0
0000000000884F76 vmovaps ymm0,ymmword ptr [rbp+20h]
0000000000884F7B vmovaps ymmword ptr [eps],ymm0
As you can see, I'm using the intel embree intrinsic library.
Any idea how to avoid this - do I need to hand code the lower level
intrinsics?
Thanks for any help.
Thomas
I've just started to convert some c++ code which is optimised for sse4 to
avx.
After a few false starts with Visual Studio, I've finally started to
generate avx code. Unfortunately the result is that my application is
running slower rather than faster - it does at least produce the correct
results.
I suspect that the main reason for the decrease in speed is that the
compiler is mixing up sse and avx code which I believe is a real performance
killer?
For example the following line:
const embree::avxf eps(1.0e-20f);
Generates the following assembly code
0000000000884F5B vmovss xmm0,dword ptr [__real@1e3ce508 (98E970h)]
0000000000884F63 vmovss dword ptr [rbp],xmm0
0000000000884F68 lea rax,[rbp]
0000000000884F6C vbroadcastss ymm0,dword ptr [rax]
0000000000884F71 vmovaps ymmword ptr [rbp+20h],ymm0
0000000000884F76 vmovaps ymm0,ymmword ptr [rbp+20h]
0000000000884F7B vmovaps ymmword ptr [eps],ymm0
As you can see, I'm using the intel embree intrinsic library.
Any idea how to avoid this - do I need to hand code the lower level
intrinsics?
Thanks for any help.
Thomas