~/imallett (Ian Mallett)   # Optimized SIMD Cross-Product

A cross-product is a useful operation, particularly in graphics. The computation is this:

$\vec{x} \times \vec{y} := \begin{bmatrix} x_1 y_2 - x_2 y_1\\ x_2 y_0 - x_0 y_2\\ x_0 y_1 - x_1 y_0 \end{bmatrix}$

Basically all commodity hardware supports 4D vector registers of one type or another. For typical x86-64 computers, this would be the SSE registers "xmmi". 3D vectors can be stored in a vector register with the fourth component ignored, and you can do 3D math that way very efficiently. This post looks at how to do a cross product.

The simple, direct translation of the math is (compile with -msse at least; probably more like -msse4.2, -mavx, or even -mavx2 today):

//Method 1: Simple SSE
[[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) {
__m128 tmp0 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1));
__m128 tmp1 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,1,0,2));
__m128 tmp2 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,1,0,2));
__m128 tmp3 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,0,2,1));
return _mm_sub_ps(
_mm_mul_ps(tmp0,tmp1),
_mm_mul_ps(tmp2,tmp3)
);
}

This is also an obvious use case for the FMA instructions, available starting with Intel Haswell or AMD Piledriver. This produces (compile also with -mfma):

//Method 2: Simple SSE (FMA instructions)
[[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) {
__m128 tmp0 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1));
__m128 tmp1 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,1,0,2));
__m128 tmp2 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,1,0,2));
__m128 tmp3 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,0,2,1));
__m128 tmp4 = _mm_mul_ps(tmp2,tmp3);
return _mm_fmsub_ps(
tmp0,tmp1, tmp4
);
}

This starts with four shuffle instructions, then two multiplies, then a subtract. Not bad, but perhaps we can do better?

Observe that the purpose of the shuffles is to arrange the temporary vectors for the subsequent multiplication. Clearly, e.g. tmp0 and tmp1 must be shuffled with respect to each other, but shouldn't this only require one shuffle? Indeed, but then the resulting vector will be in the wrong order! This is fine though: we can just add a single shuffle back at the end to fix it. That is:

$\vec{tmp(\vec{x},\vec{y})} := \begin{bmatrix} x_0 y_1 - x_1 y_0\\ x_1 y_2 - x_2 y_1\\ x_2 y_0 - x_0 y_2 \end{bmatrix}\\ \vec{x} \times \vec{y} = \begin{bmatrix} tmp(\vec{x},\vec{y})_1\\ tmp(\vec{x},\vec{y})_2\\ tmp(\vec{x},\vec{y})_0 \end{bmatrix}$
//Method 3: Fewer swizzles, swizzle after subtraction
[[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) {
__m128 tmp0 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,0,2,1));
__m128 tmp1 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1));
tmp0 = _mm_mul_ps(tmp0,vec0);
tmp1 = _mm_mul_ps(tmp1,vec1);
__m128 tmp2 = _mm_sub_ps(tmp0,tmp1);
return _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1));
}
//Method 4: Fewer swizzles, swizzle after subtraction (FMA instructions)
[[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) {
__m128 tmp0 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,0,2,1));
__m128 tmp1 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1));
tmp1 = _mm_mul_ps(tmp1,vec1);
__m128 tmp2 = _mm_fmsub_ps( tmp0,vec0, tmp1 );
return _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1));
}

I derived this method, but it had been discovered a few other times before. It still isn't super widely known.

My younger sibling derived an alternate version, where the shuffle comes before the subtraction:

//Method 5: Fewer swizzles, swizzle before subtraction
[[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) {
__m128 tmp0 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1));
__m128 tmp1 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,1,0,2));
__m128 tmp2 = _mm_mul_ps(tmp0,vec1);
__m128 tmp3 = _mm_mul_ps(tmp0,tmp1);
__m128 tmp4 = _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1));
return _mm_sub_ps(tmp3,tmp4);
}
//Method 6: Fewer swizzles, swizzle before subtraction (FMA instructions)
[[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) {
__m128 tmp0 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1));
__m128 tmp1 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,1,0,2));
__m128 tmp2 = _mm_mul_ps(tmp0,vec1);
__m128 tmp4 = _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1));
return _mm_fmsub_ps( tmp0,tmp1, tmp4 );
}

We now have six methods to compare. Yes, compare. All of these variants have been tested to be correct, but just because the latter four methods use fewer instructions does not necessarily mean they are faster!

To analyze the performance of these approaches, we basically have to turn to static analysis. In earlier versions of this article I used Intel Architecture Code Analyzer (IACA). That tool is obsolete, though, so in updating this article, I switched to LLVM Machine Code Analyzer (llvm-mca).

I generated results for both unpipelined code (-iterations=1, as I understand it, code running from a cold cache, no pipelining) and for "pipelined" code (the default, 100 iterations). I put that in quotations because although real code will basically always be pipelined, it seems that the analyzer simulates this by just repeating this same instruction sequence over and over again, which produces quite pathological patterns. My guess is that in practice the performance would actually be closer to the pipelined/iterated code (i.e., lower in overall magnitude), but the relative timings would be better represented by the unpipelined code. I spent quite a while selecting the architectures to test upon, but I've selected a few based upon the intersection of being interesting, mainstream, and supported by the tool.

Architecture Unpipelined Iterated
Method 123456 123456
Nehalem 17 15 14 8 6 11
Sandy Bridge 17 15 14 8 6 11
Haswell 151714161315 101210131012
Skylake 151514141313 101010111010
Goldmont 16 16 15 13 13 12
Cannonlake 151514141313 101010111010
Zen 111311131113 7 9 810 810
Zen 2 111311131113 7 9 810 810
Zen 3 111311131113 7 9 810 810

From the unpipelined results, we are able to conclude that method 5 is the best. Even if we considered the iterated results more representative, method 5 is still a good choice: it is worse only on old Intel architectures and (by one cycle) on AMD (that is, method 1 might possibly be better on AMD hardware). Both methods work everywhere there's SSE, which is everything and your toaster these days. It's quite interesting that adding FMA instructions actually hurts in most cases, and never helps. It's unclear exactly why this happens.

TL;DR: We conclude that you should use method 5 on Intel architectures, and probably in general. If you know for sure you're only on AMD, sticking with method 1 might be marginally better. The difference in any case is at most just a few cycles.

References: COMMENTS
 Ian Mallett - Contact - Donate - 2021 - 