Optimized SIMD Cross-Product

The cross-product is a useful operation. In graphics, they are vital and, although they are almost trivially simple to compute, they are also computed a lot—millions of times in a typical scene, if not far, far more. Thus, any small optimization is impactful. On the GPU, one generally depends on a builtin function, which is likely to be efficient. On the CPU, we generally must implement it ourselves. Here, we'll look at SIMD optimizations suitable for this.

First, to review, the computation is this:

\[ \vec{x} \times \vec{y} := \begin{bmatrix} x_1 y_2 - x_2 y_1\\ x_2 y_0 - x_0 y_2\\ x_0 y_1 - x_1 y_0 \end{bmatrix} \]

Basically all commodity hardware supports 4D vector registers of one type or another. For typical x86-64 computers, this would be the SSE registers xmmi. 3D vectors can be stored in a vector register with the fourth component ignored, and you can do 3D math that way very efficiently. Obviously ought to use an abstraction for that in real code, but we'll omit it here for clarity.

The simple, direct translation of the math is (compile with -msse at least; probably more like -msse4.2, -mavx, or even -mavx2 today):

1
2
3
4
5
6
7
8
9
10
11
12
13
//Method 1: Simple SSE

[[nodiscard]] inline static __m128 cross_product(
	__m128 const& vec0, __m128 const& vec1
) noexcept {
	__m128 tmp0 = _mm_shuffle_ps( vec0,vec0, _MM_SHUFFLE(3,0,2,1) );
	__m128 tmp1 = _mm_shuffle_ps( vec1,vec1, _MM_SHUFFLE(3,1,0,2) );
	__m128 tmp2 = _mm_shuffle_ps( vec0,vec0, _MM_SHUFFLE(3,1,0,2) );
	__m128 tmp3 = _mm_shuffle_ps( vec1,vec1, _MM_SHUFFLE(3,0,2,1) );
	return _mm_sub_ps(
		_mm_mul_ps( tmp0, tmp1 ),
		_mm_mul_ps( tmp2, tmp3 )
	);
}

This is also an obvious use case for the FMA instructions, available starting with Intel Haswell or AMD Piledriver. This produces (compile also with -mfma):

1
2
3
4
5
6
7
8
9
10
11
//Method 2: Simple SSE (FMA instructions)

[[nodiscard]] inline static __m128 cross_product(
	__m128 const& vec0, __m128 const& vec1
) noexcept {
	__m128 tmp0 = _mm_shuffle_ps( vec0,vec0, _MM_SHUFFLE(3,0,2,1) );
	__m128 tmp1 = _mm_shuffle_ps( vec1,vec1, _MM_SHUFFLE(3,1,0,2) );
	__m128 tmp2 = _mm_shuffle_ps( vec0,vec0, _MM_SHUFFLE(3,1,0,2) );
	__m128 tmp3 = _mm_shuffle_ps( vec1,vec1, _MM_SHUFFLE(3,0,2,1) );
	__m128 tmp4 = _mm_mul_ps( tmp2, tmp3 );
	return _mm_fmsub_ps( tmp0,tmp1, tmp4 );
}

This starts with four shuffle instructions, then two multiplies, then a subtract. Not bad, but perhaps we can do better?

Observe that the purpose of the shuffles is to arrange the temporary vectors for the subsequent multiplication. Clearly, e.g. tmp0 and tmp1 must be shuffled with respect to each other, but shouldn't this only require one shuffle? Indeed, but then the resulting vector will be in the wrong order! This is fine though: we can just add a single shuffle back at the end to fix it. That is:

\[ \vec{x} \times \vec{y} = \begin{bmatrix} t_1\\ t_2\\ t_0 \end{bmatrix}\text{, where } \vec{t} := \begin{bmatrix} x_0 y_1 - x_1 y_0\\ x_1 y_2 - x_2 y_1\\ x_2 y_0 - x_0 y_2 \end{bmatrix} \]

1
2
3
4
5
6
7
8
9
10
11
//Method 3: Fewer swizzles, swizzle after subtraction

[[nodiscard]] inline static __m128 cross_product(
	__m128 const& vec0, __m128 const& vec1
) noexcept {
	__m128 tmp0 = _mm_shuffle_ps( vec1,vec1, _MM_SHUFFLE(3,0,2,1) );
	__m128 tmp1 = _mm_shuffle_ps( vec0,vec0, _MM_SHUFFLE(3,0,2,1) );
	tmp0 = _mm_mul_ps( tmp0, vec0 );
	tmp1 = _mm_mul_ps( tmp1, vec1 );
	__m128 tmp2 = _mm_sub_ps( tmp0, tmp1 );
	return _mm_shuffle_ps( tmp2,tmp2, _MM_SHUFFLE(3,0,2,1) );
}

1
2
3
4
5
6
7
8
9
10
//Method 4: Fewer swizzles, swizzle after subtraction (FMA instructions)

[[nodiscard]] inline static __m128 cross_product(
	__m128 const& vec0, __m128 const& vec1
) noexcept {
	__m128 tmp0 = _mm_shuffle_ps( vec1,vec1, _MM_SHUFFLE(3,0,2,1) );
	__m128 tmp1 = _mm_shuffle_ps( vec0,vec0, _MM_SHUFFLE(3,0,2,1) );
	tmp1 = _mm_mul_ps( tmp1, vec1 );
	__m128 tmp2 = _mm_fmsub_ps( tmp0,vec0, tmp1 );
	return _mm_shuffle_ps( tmp2,tmp2, _MM_SHUFFLE(3,0,2,1) );
}

I derived this method, but it had been discovered a few other times before. It still isn't super widely known.

My younger sibling derived an alternate version, where the shuffle comes before the subtraction:

1
2
3
4
5
6
7
8
9
10
11
//Method 5: Fewer swizzles, swizzle before subtraction

[[nodiscard]] inline static __m128 cross_product(
	__m128 const& vec0, __m128 const& vec1
) noexcept {
	__m128 tmp0 = _mm_shuffle_ps( vec0,vec0, _MM_SHUFFLE(3,0,2,1) );
	__m128 tmp1 = _mm_shuffle_ps( vec1,vec1, _MM_SHUFFLE(3,1,0,2) );
	__m128 tmp2 = _mm_mul_ps( tmp0, vec1 );
	__m128 tmp3 = _mm_mul_ps( tmp0, tmp1 );
	__m128 tmp4 = _mm_shuffle_ps( tmp2,tmp2, _MM_SHUFFLE(3,0,2,1) );
	return _mm_sub_ps( tmp3, tmp4 );
}

1
2
3
4
5
6
7
8
9
10
//Method 6: Fewer swizzles, swizzle before subtraction (FMA instructions)

[[nodiscard]] inline static __m128 cross_product(
	__m128 const& vec0, __m128 const& vec1
) noexcept {
	__m128 tmp0 = _mm_shuffle_ps( vec0,vec0, _MM_SHUFFLE(3,0,2,1) );
	__m128 tmp1 = _mm_shuffle_ps( vec1,vec1, _MM_SHUFFLE(3,1,0,2) );
	__m128 tmp2 = _mm_mul_ps( tmp0, vec1 );
	__m128 tmp4 = _mm_shuffle_ps( tmp2,tmp2, _MM_SHUFFLE(3,0,2,1) );
	return _mm_fmsub_ps( tmp0,tmp1, tmp4 );
}

Profiling

We now have six methods to compare. Yes, compare. All of these variants have been tested to be correct, but just because the latter four methods use fewer instructions does not necessarily mean they are faster!

To analyze the performance of these approaches, we initially turn to static analysis. In earlier versions of this article I used Intel Architecture Code Analyzer (IACA). That tool is obsolete, though, so in updating this article, I switched to LLVM Machine Code Analyzer (llvm-mca).

I generated results for both unpipelined code (-iterations=1, as I understand it, code running from a cold cache, no pipelining) and for "pipelined" code (the default, 100 iterations). I put that in quotations because although real code will basically always be pipelined, it seems that the analyzer simulates this by just repeating this same instruction sequence over and over again, which produces quite pathological patterns. My guess is that in practice the performance would actually be closer to the pipelined/iterated code (i.e., lower in overall magnitude), but the relative timings would be better represented by the unpipelined code. I spent quite a while selecting the architectures to test upon, but I've selected a few based upon the intersection of being interesting, mainstream, and supported by the tool.

Without further ado, the results:

Architecture	Unpipelined						Iterated
Architecture	Method 1	2	3	4	5	6	1	2	3	4	5	6
Nehalem	17		15		14		8		6		11
Sandy Bridge	17		15		14		8		6		11
Haswell	15	17	14	16	13	15	10	12	10	13	10	12
Skylake	15	15	14	14	13	13	10	10	10	11	10	10
Goldmont	16		16		15		13		13		12
Cannonlake	15	15	14	14	13	13	10	10	10	11	10	10
Zen	11	13	11	13	11	13	7	9	8	10	8	10
Zen 2	11	13	11	13	11	13	7	9	8	10	8	10
Zen 3	11	13	11	13	11	13	7	9	8	10	8	10

From the unpipelined results, we are able to conclude that method 5 is the best. Even if we considered the iterated results more representative, method 5 is still a good choice: it is worse only on old Intel architectures and (by one cycle) on AMD (that is, method 1 might possibly be better on AMD hardware). Both methods work everywhere there's SSE, which is everything and your toaster these days. It's quite interesting that adding FMA instructions actually hurts in most cases, and never helps. While the multiply-subtract seems like a perfect use-case for FMA instructions, pipelining issues make it less-efficient in practice.

TL;DR: We conclude that you should use method 5 on Intel architectures, and probably in general. If you know for sure you're only on AMD, sticking with method 1 might be marginally better. The difference in any case is at most just a few cycles.

References

This article, which uses IACA and discusses methods 1 and 3.
Blog post on method 1; link points to comment suggesting method 3.