S
Stefano Teso
Hi,
First of all I don't know if this is the right group to post, sorry if
it is not; any suggestions on the right place for this kind of
discussion is warmly welcome.
I am writing a stochastic optimization algorithm which, as a part of
the energy calculation procedure, requires the computation of the
squared distance between a 3D vector from an array of n other 3D
vectors. This is the most computational intensive part of the
algorithm, so I wrote it using SSE2 intrinsics as defined in
<emmintr.h>, using GCC 4.4. The vector array is stored in SoA format,
with all components (x, y, z) aligned to a the 16-byte boundary, and
the same goes for the distance array.
The current code looks basically as follows:
static void
calc_distance_row (double *d, const double *xarray, const double
*yarray, const double *zarray, const int i, const int lo, const int
hi)
{
/* computes the *square* of the distance between a fixed vector i and
all vectors j in the range [lo, hi), whose coords are specified in
xarray, yarray, zarray */
/* NOTE: dist2 is a rather obvious C function that uses no intrinsics
*/
int j, lo_aligned, hi_aligned;
__m128d ix, iy, ix, jx, jy, jz, dx, dy, dz, mx, my, mz, s0, s1;
double *x, *y, *z;
lo_aligned = (lo & ~3);
hi_aligned = (hi & ~3);
for (j = lo; j < lo_aligned; j++)
darray[j] = dist2 (xarray, yarray, zarray, xarray[j],
yarray[j], zarray[j]);
x = &xarray[lo_aligned];
y = &yarray[lo_aligned];
z = &zarray[lo_aligned];
d = &darray[lo_aligned];
ix = _mm_load1_pd (&xarray);
iy = _mm_load1_pd (&yarray);
iz = _mm_load1_pd (&zarray);
for (j = lo_aligned; j < hi_aligned; j++) {
jx = _mm_load_pd (&x[0]);
jy = _mm_load_pd (&y[0]);
jz = _mm_load_pd (&z[0]);
dx = _mm_sub_pd (ix, jx);
dy = _mm_sub_pd (iy, iy);
dz = _mm_sub_pd (iz, iz);
mx = _mm_mul_pd (dx, dx);
my = _mm_mul_pd (dy, dy);
mz = _mm_mul_pd (dz, dz);
s0 = _mm_add_pd (mx, my);
s1 = _mm_add_pd (s0, mz);
_mm_store_pd (&d[0], s1);
/* the above chunk of code is replicated once with '2' in
place of '0' in the array indices */
x = &x[4];
y = &y[4];
z = &z[4];
d = &d[4];
}
for (j = hi_aligned; j < hi; j++)
d[j] = dist2 (xarray, yarray, zarray, xarray[j],
yarray[j], zarray[j]);
}
(There may be typos in the above routine, anyway the real code was
tested against a simple minded C version, and proved to work as
expected, outperforming it by about 2x.)
The first and third loops are used to handle non-unrollable cases,
while the second loop -- the only one using SSE2 intrinsics -- is
unrolled twice, for a whopping 4 square distances computed per
iteration. The average array size is 100 to 1000 vectors, so no huge
number crunching here.
Is this the best way to perform this operation? Is there any obvious
pitfall in the above code? In particular, how do I check that I am
incurring in unaligned access penalties, and if that is the case, how
do I avoid them? And even more, is there any way I could improve the
performance by better cache usage / prefetching? I already tried
unrolling the middle loop more, but it held no improvement.
I would like to squeeze as much performance as possible from this
routine. Any suggestion is appreciated.
Thanks in advance,
First of all I don't know if this is the right group to post, sorry if
it is not; any suggestions on the right place for this kind of
discussion is warmly welcome.
I am writing a stochastic optimization algorithm which, as a part of
the energy calculation procedure, requires the computation of the
squared distance between a 3D vector from an array of n other 3D
vectors. This is the most computational intensive part of the
algorithm, so I wrote it using SSE2 intrinsics as defined in
<emmintr.h>, using GCC 4.4. The vector array is stored in SoA format,
with all components (x, y, z) aligned to a the 16-byte boundary, and
the same goes for the distance array.
The current code looks basically as follows:
static void
calc_distance_row (double *d, const double *xarray, const double
*yarray, const double *zarray, const int i, const int lo, const int
hi)
{
/* computes the *square* of the distance between a fixed vector i and
all vectors j in the range [lo, hi), whose coords are specified in
xarray, yarray, zarray */
/* NOTE: dist2 is a rather obvious C function that uses no intrinsics
*/
int j, lo_aligned, hi_aligned;
__m128d ix, iy, ix, jx, jy, jz, dx, dy, dz, mx, my, mz, s0, s1;
double *x, *y, *z;
lo_aligned = (lo & ~3);
hi_aligned = (hi & ~3);
for (j = lo; j < lo_aligned; j++)
darray[j] = dist2 (xarray, yarray, zarray, xarray[j],
yarray[j], zarray[j]);
x = &xarray[lo_aligned];
y = &yarray[lo_aligned];
z = &zarray[lo_aligned];
d = &darray[lo_aligned];
ix = _mm_load1_pd (&xarray);
iy = _mm_load1_pd (&yarray);
iz = _mm_load1_pd (&zarray);
for (j = lo_aligned; j < hi_aligned; j++) {
jx = _mm_load_pd (&x[0]);
jy = _mm_load_pd (&y[0]);
jz = _mm_load_pd (&z[0]);
dx = _mm_sub_pd (ix, jx);
dy = _mm_sub_pd (iy, iy);
dz = _mm_sub_pd (iz, iz);
mx = _mm_mul_pd (dx, dx);
my = _mm_mul_pd (dy, dy);
mz = _mm_mul_pd (dz, dz);
s0 = _mm_add_pd (mx, my);
s1 = _mm_add_pd (s0, mz);
_mm_store_pd (&d[0], s1);
/* the above chunk of code is replicated once with '2' in
place of '0' in the array indices */
x = &x[4];
y = &y[4];
z = &z[4];
d = &d[4];
}
for (j = hi_aligned; j < hi; j++)
d[j] = dist2 (xarray, yarray, zarray, xarray[j],
yarray[j], zarray[j]);
}
(There may be typos in the above routine, anyway the real code was
tested against a simple minded C version, and proved to work as
expected, outperforming it by about 2x.)
The first and third loops are used to handle non-unrollable cases,
while the second loop -- the only one using SSE2 intrinsics -- is
unrolled twice, for a whopping 4 square distances computed per
iteration. The average array size is 100 to 1000 vectors, so no huge
number crunching here.
Is this the best way to perform this operation? Is there any obvious
pitfall in the above code? In particular, how do I check that I am
incurring in unaligned access penalties, and if that is the case, how
do I avoid them? And even more, is there any way I could improve the
performance by better cache usage / prefetching? I already tried
unrolling the middle loop more, but it held no improvement.
I would like to squeeze as much performance as possible from this
routine. Any suggestion is appreciated.
Thanks in advance,