

Optimized SIMD CrossProductA crossproduct is a useful operation, particularly in graphics. The computation is this: \[ \vec{x} \times \vec{y} := \begin{bmatrix} x_1 y_2  x_2 y_1\\ x_2 y_0  x_0 y_2\\ x_0 y_1  x_1 y_0 \end{bmatrix} \]Basically all commodity hardware supports 4D vector registers of one type or another. For typical x8664 computers, this would be the SSE registers "xmmi". 3D vectors can be stored in a vector register with the fourth component ignored, and you can do 3D math that way very efficiently. This post looks at how to do a cross product. The simple, direct translation of the math is (compile with //Method 1: Simple SSE [[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) { __m128 tmp0 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,1,0,2)); __m128 tmp2 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,1,0,2)); __m128 tmp3 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,0,2,1)); return _mm_sub_ps( _mm_mul_ps(tmp0,tmp1), _mm_mul_ps(tmp2,tmp3) ); } This is also an obvious use case for the FMA instructions, available starting with Intel Haswell or AMD Piledriver. This produces (compile also with //Method 2: Simple SSE (FMA instructions) [[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) { __m128 tmp0 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,1,0,2)); __m128 tmp2 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,1,0,2)); __m128 tmp3 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,0,2,1)); __m128 tmp4 = _mm_mul_ps(tmp2,tmp3); return _mm_fmsub_ps( tmp0,tmp1, tmp4 ); } This starts with four shuffle instructions, then two multiplies, then a subtract. Not bad, but perhaps we can do better? Observe that the purpose of the shuffles is to arrange the temporary vectors for the subsequent multiplication. Clearly, e.g. //Method 3: Fewer swizzles, swizzle after subtraction [[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) { __m128 tmp0 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1)); tmp0 = _mm_mul_ps(tmp0,vec0); tmp1 = _mm_mul_ps(tmp1,vec1); __m128 tmp2 = _mm_sub_ps(tmp0,tmp1); return _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1)); } //Method 4: Fewer swizzles, swizzle after subtraction (FMA instructions) [[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) { __m128 tmp0 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1)); tmp1 = _mm_mul_ps(tmp1,vec1); __m128 tmp2 = _mm_fmsub_ps( tmp0,vec0, tmp1 ); return _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1)); } I derived this method, but it had been discovered a few other times before. It still isn't super widely known. My younger sibling derived an alternate version, where the shuffle comes before the subtraction: //Method 5: Fewer swizzles, swizzle before subtraction [[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) { __m128 tmp0 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,1,0,2)); __m128 tmp2 = _mm_mul_ps(tmp0,vec1); __m128 tmp3 = _mm_mul_ps(tmp0,tmp1); __m128 tmp4 = _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1)); return _mm_sub_ps(tmp3,tmp4); } //Method 6: Fewer swizzles, swizzle before subtraction (FMA instructions) [[nodiscard]] inline static __m128 cross_product( __m128 const& vec0, __m128 const& vec1 ) { __m128 tmp0 = _mm_shuffle_ps(vec0,vec0,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(vec1,vec1,_MM_SHUFFLE(3,1,0,2)); __m128 tmp2 = _mm_mul_ps(tmp0,vec1); __m128 tmp4 = _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1)); return _mm_fmsub_ps( tmp0,tmp1, tmp4 ); } We now have six methods to compare. Yes, compare. All of these variants have been tested to be correct, but just because the latter four methods use fewer instructions does not necessarily mean they are faster! To analyze the performance of these approaches, we basically have to turn to static analysis. In earlier versions of this article I used Intel Architecture Code Analyzer (IACA). That tool is obsolete, though, so in updating this article, I switched to LLVM Machine Code Analyzer (llvmmca). I generated results for both unpipelined code ( Without further ado, the results:
From the unpipelined results, we are able to conclude that method 5 is the best. Even if we considered the iterated results more representative, method 5 is still a good choice: it is worse only on old Intel architectures and (by one cycle) on AMD (that is, method 1 might possibly be better on AMD hardware). Both methods work everywhere there's SSE, which is everything and your toaster these days. It's quite interesting that adding FMA instructions actually hurts in most cases, and never helps. It's unclear exactly why this happens. TL;DR: We conclude that you should use method 5 on Intel architectures, and probably in general. If you know for sure you're only on AMD, sticking with method 1 might be marginally better. The difference in any case is at most just a few cycles. References:


