

Optimized SIMD CrossProductTL;DR: use the function A crossproduct is a useful operation, particularly in graphics. The computation is this: \[ \vec{x} \times \vec{y} := \begin{bmatrix} x_1 y_2  x_2 y_1\\ x_2 y_0  x_0 y_2\\ x_0 y_1  x_1 y_0 \end{bmatrix} \]3D vectors are typically stored (at least onchip) in a hardware 4D vector register with the fourth component ignored. On typical (x8664) desktops, this would be the SSE registers "xmmi". The simple and common implementation (such as here) then becomes: inline __m128 cross_simple(__m128 const& x, __m128 const& y) { __m128 tmp0 = _mm_shuffle_ps(x,x,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(y,y,_MM_SHUFFLE(3,1,0,2)); __m128 tmp2 = _mm_shuffle_ps(x,x,_MM_SHUFFLE(3,1,0,2)); __m128 tmp3 = _mm_shuffle_ps(y,y,_MM_SHUFFLE(3,0,2,1)); return _mm_sub_ps( _mm_mul_ps(tmp0,tmp1), _mm_mul_ps(tmp2,tmp3) ); } This starts with four shuffle instructions, then two multiplies, then a subtract. Not bad, but can we do better? To see how, observe that the purpose of the shuffles is to arrange the temporary vectors for the subsequent multiplication. Clearly, e.g. Indeed, but then the resulting vector will be in the wrong order! This is fine though: we can just add a single shuffle back at the end to fix it. Hence: \[ \vec{T(\vec{x},\vec{y})} := \begin{bmatrix} x_0 y_1  x_1 y_0\\ x_1 y_2  x_2 y_1\\ x_2 y_0  x_0 y_2 \end{bmatrix}\\ \vec{x} \times \vec{y} = \begin{bmatrix} T(\vec{x},\vec{y})_1\\ T(\vec{x},\vec{y})_2\\ T(\vec{x},\vec{y})_0 \end{bmatrix} \]inline __m128 cross_improved(__m128 const& x, __m128 const& y) { __m128 tmp0 = _mm_shuffle_ps(y,y,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(x,x,_MM_SHUFFLE(3,0,2,1)); tmp0 = _mm_mul_ps(tmp0,x); tmp1 = _mm_mul_ps(tmp1,y); __m128 tmp2 = _mm_sub_ps(tmp0,tmp1); return _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1)); } We can also use a fusedmultiplyadd instruction to combine one of the multiplications with the subtraction if the CPU supports it (feature "FMA", present in Haswell and later, probably synonymous with "AVX" inpractice (?)). Note: compile with inline __m128 cross_improvedfma(__m128 const& x, __m128 const& y) { __m128 tmp0 = _mm_shuffle_ps(y,y,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(x,x,_MM_SHUFFLE(3,0,2,1)); tmp1 = _mm_mul_ps(tmp1,y); __m128 tmp2 = _mm_fmsub_ps( tmp0,x, tmp1 ); return _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1)); } How much faster does this all translate to? Well, it's insufficient to look at the C++ code. It's insufficient even to count the instructions in the disassembly. We'll have to look at the timing information of the instructions' microops. Although I initially tried to do this by hand, it turns out that this is nearly impossible, due to the complex interaction of various pieces of the CPU's microarchitecture, which often aren't even documented anywhere. Fortunately, Intel comes to our rescue with the Intel Architecture Code Analyzer which, with some finagling, can produce fairly accurate timing static analysis. The current version (as of this writing, 3.0) inexplicably doesn't support a latency analysis. I manually parsed some results out of its throughput analysis, but I'm not confident in them. Instead, here is the timing with the latency analysis in version 2.1 for Haswell:
What's interesting here is that adding FMA instructions actually hurts the improved version (even though the improved version without FMA is the best of any algorithm, with or without FMA). It's unclear exactly why this happens. I should probably look into the details of the traces some more at some point, but I suspect the FMA instruction's complexity is the culprit. We conclude that you should use For completeness, the key pieces of the disassembly code look like:
It seems several others have discovered similar solutions:


