Fast Square Roots

A lot has been written about computing inverse-square-roots \(1/\!\sqrt{s}\), particularly a certain well-known bithack. However, much less has been written about the ordinary square root \(\sqrt{s}\). Here, I'll tackle issues of performance and precision.

Bit of a spoiler—the best approach is usually to just use the built-in std::sqrt(⋯) function, vitally with math optimization flags enabled. Something like:

/*
To avoid overhead, should specify:
	"-fno-math-errno -fno-trapping-math" (Clang, GCC)
	"/fp:fast" (MSVC)
*/
#include <cmath>

inline float sqrtf_std( float x ) { return std::sqrt(x); }

—which will compile (without the inline, which you should leave in for production^[1]) to^[2]:

x86	x86 (VEX)	ARM	ARM64
`sqrtf_std(float): sqrtss xmm0, xmm0 ret`	`sqrtf_std(float): vsqrtss xmm0, xmm0, xmm0 ret`	`sqrtf_std(float): vsqrt.f32 s0, s0 bx lr`	`sqrtf_std(float): fsqrt s0, s0 ret`

This will give a result that is bitwise perfect (i.e. correctly rounded from the infinite precision result), and it will do it in the 12-ish cycle range end-to-end^[3], and more like 3–5 cycles pipelined^[4]. That's pretty dang fast. Plus, the compiler can auto-vectorize it—which should not be overlooked and is not true of some other methods discussed here. Finally, and perhaps most importantly, the code is clear and maintainable.

This is a good, solid choice, and I recommend it for most uses.

We will explore alternatives to the above systematically, however. The result is in general that they trade-off accuracy for better performance. The tradeoff is probably not useful outside of certain situations. Sometimes (as for the case of certain dated techniques), the result is just strictly worse.

Optimization Flags

The C++ standard says that if the standard bills itself as supporting IEEE 754, then std::sqrt(⋯) must be able to detect negative input. Specifically, in addition to returning NaN, a floating-point exception can be configured to fire FE_INVALID and/or set errno.

Regardless of how it's used, a floating-point exception is not terribly problematic, because it is generally implemented by the hardware itself as a hardware trap. There ought to only be an overhead to trap into the handler when the input is negative. Much worse, however, is the potential setting of errno. In practice, supporting that means that every call to square root gets bracketed by error check code at the assembly level. For example, without disabling this behavior, the code above compiles to:

sqrtf_std(float):
	vxorps     xmm1, xmm1, xmm1
	vucomiss   xmm1, xmm0
	ja         .L10
	vsqrtss    xmm0, xmm0, xmm0
	ret
.L10:
	jmp        sqrtf

If your assembly is not great, basically, that checks whether \(0\) is greater than the input, and if so jumps to a tailcall to a platform-specific function that will presumably do the check again and set errno as needed. Even on the "happy"^[5] path, you eat maybe a 50% latency hit, which really hurts if you call square-root a lot (e.g. in computer graphics). The instruction alone already does the Right Thing (i.e., return NaN), and a 50% latency hit to support something most people will never use is deeply unwelcome.

The solution is to relax the requirements of the standard, which you want to do anyway because errno is simply a bad design choice. For Clang and GCC, you need to at least disable the errno support. Floating-point exceptions are in practice disabled by default, but code is still generated assuming they might be enabled, which has pipelining implications. To disable both, the flags you want to use are therefore (minimally) "-fno-math-errno -fno-trapping-math". MSVC has less control, but "/fp:fast" seems sufficient.

You'll also want to consider other floating-point optimization flags, and of course also general optimization flags.

The Methods

Now we turn to actually looking at the methods of computing the square-root itself. There are only a few basic methods that aren't obviously terrible.

Native Hardware

The first category gives the right answer using hardware.

We can use the SSE hardware that is present basically everywhere, which boils down to the sqrtss instruction. Running std::sqrt(⋯) will get you this, but with a major performance overhead unless you pass the optimization flags specified above^[6]. You can also call intrinsic functions to access the instruction directly. Although this has disadvantages, it will work without the flags (advantageous especially for MSVC).

/*
Note clarity, auto-vectorization, and bitwise-perfect answer for all inputs.
Change `inline` -> `constexpr` in C++26.

To avoid overhead, should specify:
	"-fno-math-errno -fno-trapping-math" (Clang, GCC)
	"/fp:fast" (MSVC)
*/
#include <cmath>

[[nodiscard]] inline float sqrtf_std( float val ) noexcept
{
	return std::sqrt(val);
}

//x86(-64) only, doesn't require special flags

#include <immintrin.h>

[[nodiscard]] inline float sqrtf_sse( float val ) noexcept
{
	return _mm_cvtss_f32( _mm_sqrt_ss(_mm_set_ps1(val)) );
}

There is also the x87 FPU, which does the operation in 80-bit precision. This is overkill for 32-bit, and the use of the x87 FPU is discouraged anyway. Since compilers (finally, quite sensibly) default to using the SSE unit instead, we must access the x87 through inline assembly or extra compiler flags^[7].

//x87 only (x86(-64) discouraged hardware)

[[nodiscard]] inline float sqrtf_x87( float val ) noexcept
{
	float result;
	#if   defined __clang__ //Clang
		asm(
			"fsqrt\n"
			: "=t"(result)
			: "t"(val)
			: "cc"
		);
	#elif defined __GNUC__ //GCC, etc.
		asm(
			"fld dword ptr %1\n"
			"fsqrt\n"
			: "=&t"(result)
			: "m"(val)
			: "cc"
		);
	#elif defined _MSC_VER && defined _M_IX86 //32-bit MSVC
		__asm
		{
			fld dword ptr val
			fsqrt
			fstp dword ptr result
		}
	#else
		#error
	#endif
	return result;
}

Approximation

The second category gives an approximate answer using a result that we compute ourselves, possibly aided by hardware.

The hardware provides an inverse-square-root function rsqrtss, which by itself is faster. We can therefore compute by \(\sqrt{s}~\)\(= (s)(1/\!\sqrt{s})\):

//Note incorrect result for 0 (and subnorms)

#include <immintrin.h>

[[nodiscard]] inline float rsqrtf_sse( float val ) noexcept
{
	return _mm_cvtss_f32( _mm_rsqrt_ss(_mm_set_ps1(val)) );
}
[[nodiscard]] inline float sqrtf_ssersqrt( float val ) noexcept
{
	return val * rsqrtf_sse(val);
}

There is also an ingenious bithack way to get a square-root. This comes from Wikipedia, perhaps as a simplification of a method first derived by Kahan and Ng in the same unpublished work that led to inverse-square-root.

The explanation (in both locations) is lacking, but it's not notable enough to deserve corrections and the article it needs for a proper treatment, in either academia or on Wikipedia. Roughly speaking, the algorithm works by computing \(\sqrt{s}~\)\(= \exp_2\!\left(\log_2(s)/2\right)\), where we can compute the \(\log_2\) (and so also \(\exp_2\)) efficiently with \(\log_2(s)~\)\(\approx s_\text{int}/2^{23} - 127\). However, I have written an article with the complete explanation, which has been moved to here.

#include <bit>

#include <cstdint>

[[nodiscard]] constexpr float sqrtf_bithack( float val ) noexcept
{
	uint32_t tmp = std::bit_cast<uint32_t>(val);

	constexpr int TWEAK = 0;
	tmp = ((1<<29)-(1<<22)+TWEAK) + (tmp>>1);

	return std::bit_cast<float>(tmp);
}

Note the std::bit_cast<⋯>(⋯) reinterprets the bits to and from floating-point and integers. Commonplace pointer-reinterpreting is dangerously broken in both C and C++, and union hacks are problematic in C++. Before bit-cast, the recommended way was memcpy(⋯). This works surprisingly well, but it does break constexpr-ness.

Newton–Raphson Improvements

Since for this second category the answer is approximate, there is also the option to improve the result using Newton–Raphson iteration (NR, N–R), which for square-root is:

\[ x_{k+1} := \frac{1}{2}\left( x_k + \frac{s}{x_k} \right) \]

—where \(x_k\) is the current guess and \(s=x^2\) is the number we're trying to take the square root of.

The basic case for NR is of course to do zero iterations:

[[nodiscard]] constexpr float nr0( float /*val*/, float xk ) noexcept { return xk; }

For one iteration, we can just do the formula blithely. The coefficient is nominally 0.5, but small perturbations can result in more accuracy, so we pull it out for later tweaking (see below!):

#include <bit>

[[nodiscard]] constexpr float nr1( float val, float xk ) noexcept
{
	constexpr float coeff = std::bit_cast<float>( 1'056'964'608 ); //`0.5f`

	return coeff*( xk + val/xk );
}

Floating-point division is infamously slow, so we can do the same thing but using the SSE rcpss instruction:

#include <bit>

#include <immintrin.h>

[[nodiscard]] inline float rcp_sse( float val ) noexcept
{
	return _mm_cvtss_f32( _mm_rcp_ss( _mm_set_ps1(val) ) );
}
[[nodiscard]] inline float nr1_ssercp( float val, float xk ) noexcept
{
	constexpr float coeff = std::bit_cast<float>( 1'056'964'608 ); //`0.5f`

	return coeff*( xk + val*rcp_sse(xk) );
}

This is a good use-case for a fused multiply-add (FMA) instruction, which is supported on some AVX and all AVX2 processors. To access it, pass "-mfma"^[8] (Clang, GCC, etc.) or "/arch:AVX2" (MSVC). You could also use the _mm_fmadd_ss(⋯) intrinsic, which might be useful when you have FMA but not AVX2.

//Pass "-mfma" (Clang, GCC) or "/arch:AVX2" (MSVC)

#include <cmath>

#include <bit>

[[nodiscard]] inline float rcp_sse( float val ) noexcept;
[[nodiscard]] inline float nr1_ssercp_fmaA( float val, float xk ) noexcept
{
	constexpr float coeff = std::bit_cast<float>( 1'056'964'608 ); //`0.5f`

	return coeff*std::fma( val,rcp_sse(xk), xk );
}

The final multiplication by the coefficient has to wait for the FMA to finish. To get better pipelining, then, we can try distributing the coefficient. The idea is that even though there are now two multiplies, they can be executing at the same time as the reciprocal:

//Pass "-mfma" (Clang, GCC) or "/arch:AVX2" (MSVC)

#include <cmath>

#include <bit>

[[nodiscard]] inline float rcp_sse( float val ) noexcept;
[[nodiscard]] inline float nr1_ssercp_fmaB( float val, float xk ) noexcept
{
	constexpr float coeff = std::bit_cast<float>( 1'056'964'608 ); //`0.5f`

	return std::fma( coeff*val,rcp_sse(xk), coeff*xk );
}

Finally, here's a version that does two NR iterations. The two 0.5 factors can be combined into a single factor of 0.25:

[[nodiscard]] constexpr float nr2( float val, float xk ) noexcept
{
	constexpr float coeff = std::bit_cast<float>( 1'048'576'000 ); //`0.25f`

	//xk = 0.5f*( xk + val/xk );

	//xk = 0.5f*( xk + val/xk );

	xk =       xk + val/xk;
	xk = coeff*xk + val/xk;
	return xk;
}

Comparison

Here we compare all square-root functions, with each NR variant as appropriate.

We test every possible valid float in both the error evaluation and profiling. I didn't profile or test negative numbers or NaNs; these are outside the domain of the real-valued branch of the square-root, or invalid, respectively. However, do note that the implementation's behavior for these values may be important to you.

Summary Table

Here is a summary table comparing the variants:

Algorithm	Pipelined Cycles	Avg. Relative Error		Results for Special Values
Algorithm	Pipelined Cycles	\(\sqrt{\text{(subnorm)}}\)	\(\sqrt{\text{(norm)}}\)	\(\sqrt{\text{(neg)}}\)	\(\sqrt{0}\)	\(\sqrt{\infty}\)	\(\sqrt{\text{(NaN)}}\)
`sqrtf_std`	5	0	0	NaN	0	∞	NaN
`sqrtf_sse`	5	0	0	NaN	0	∞	NaN
`sqrtf_x87`	23	0	0	NaN	0	∞	NaN
`sqrtf_ssersqrt` + `nr0`	2	∞	9.359e-05	NaN	NaN	NaN	NaN
`sqrtf_ssersqrt` + `nr1`	7	∞	2.146e-08	NaN	NaN	NaN	NaN
`sqrtf_ssersqrt` + `nr1_ssercp`	5	∞	4.787e-05	NaN	NaN	NaN	NaN
`sqrtf_ssersqrt` + `nr1_ssercp_fmaA`	5	∞	4.787e-05	NaN	NaN	NaN	NaN
`sqrtf_ssersqrt` + `nr1_ssercp_fmaB`	4	∞	4.787e-05	NaN	NaN	NaN	NaN
`sqrtf_ssersqrt` + `nr2`	10	∞	2.108e-08	NaN	NaN	NaN	NaN
`sqrtf_bithack` + `nr0`	1	0.6447	0.01505	3.365e+38	8.012e-20	1.824e+19	2.265e+19
`sqrtf_bithack` + `nr1`	5	0.1963	0.0001201	1.674e+38	3.979e-20	∞	NaN
`sqrtf_bithack` + `nr1_ssercp`	4	0.1961	0.000131	1.674e+38	3.977e-20	∞	NaN
`sqrtf_bithack` + `nr1_ssercp_fmaA`	4	0.1961	0.000131	1.674e+38	3.977e-20	∞	NaN
`sqrtf_bithack` + `nr1_ssercp_fmaB`	3	0.1961	0.000131	1.674e+38	3.977e-20	∞	NaN
`sqrtf_bithack` + `nr2`	8	0.05331	3.799e-08	8.366e+37	1.988e-20	NaN	NaN

I am providing the raw data for the general statistics and latency: statistics.csv, cycles_100iters.csv. These include more statistics than are presented above.

Pipelined Cycles

The performance cost is presented as "Pipelined Cycles", computed using "llvm-mca" 16.0.6, a cycle-accurate hardware simulator. This allows us to get accurate results on many different architectures without physically possessing them or dealing with timing issues^[9]. The results above are for the "znver4" (i.e. AMD Zen 4) architecture, but results for several others can be downloaded.

An end-to-end latency of the operation would be deeply misleading since other instructions interleave around it, so a standard practice is to repeat the operation many times and take the average (I used 100 times). Additionally, to reduce the influence of register dependencies, I unrolled the operation 4 times and disabled auto-vectorization to prevent optimization between iterations^[10].

Relative Error

The relative error is fairly straightforward. The result of std::sqrt(⋯) is bitwise-perfect, and so is taken as the reference. The relative error of a value is the absolute difference of the value from the reference, divided by the reference. All error calculations were done in double-precision.

There are a few edge cases to consider. A correct result of 0 or ∞ would result in a relative error of NaN, but is defined instead to be 0. A finite result where ∞ is expected is an error of ∞. The relative error of a NaN results in (and is left as) a NaN (though this doesn't show up in the table above; see the raw data).

The approximate methods have different parameters used to tweak the result to be more accurate. It is sensible to tweak the result either to minimize the average relative error or the maximum relative error^[11]. The results above are for coefficient values that minimize the average relative error^[12] (i.e. the middle column below).

Algorithm	Min Avg Rel Err		Min Max Rel Err
Algorithm	`TWEAK`	`coeff`	`TWEAK`	`coeff`
`sqrtf_ssersqrt` + `nr1`	0	1056964608	0	1056964608
`sqrtf_ssersqrt` + `nr1_ssercp`	0	1056964602	0	1056964606
`sqrtf_ssersqrt` + `nr1_ssercp_fmaA`	0	1056964602	0	1056964606
`sqrtf_ssersqrt` + `nr1_ssercp_fmaB`	0	1056964602	0	1056964607
`sqrtf_ssersqrt` + `nr2`	0	1048576000	0	1048576000
`sqrtf_bithack` + `nr0`	-185516	0	-307410	0
`sqrtf_bithack` + `nr1`	-266985	1056962641	-328307	1056958655
`sqrtf_bithack` + `nr1_ssercp`	-273073	1056962594	-336149	1056958363
`sqrtf_bithack` + `nr1_ssercp_fmaA`	-273065	1056962594	-331659	1056958524
`sqrtf_bithack` + `nr1_ssercp_fmaB`	-272998	1056962597	-318966	1056958965
`sqrtf_bithack` + `nr2`	-278695	1048576000	-295683	1048575999

Analysis

The built-in (sqrtf_std(⋯)) function is a good place to start. It just calls the sqrtss instruction (as does sqrtf_sse(⋯), which is identical), producing the exact answer for each value. On the Zen 4 architecture, sqrtss has a latency of 15 cycles, but it can issue a new square-root every 5 cycles^[13], meaning that, if we have a lot of square-roots, on average we'll be running three at once and the runtime will be a third as long^[14]. If you don't have a lot of square-roots, then it will be other instructions interleaving. Thus, the pipelined cycles is reported as 5.

The sqrtf_x87(⋯) also gives the exact answers. It is 80-bit, whereas only 32-bit is required, and so it takes commensurately longer. The x87 on the Zen 4 is not pipelined, so it doesn't get the benefit of interleaving, either—however, even with pipelining (e.g. on Skylake), the result is still slower, as we'd expect.

The sqrtf_ssersqrt(⋯) variants launch us into approximate algorithms. The only variant here worth considering is the NR nr0(⋯) one (i.e. no NR at all). Every variant incorrectly gives NaN for 0 and ∞ for subnormals, but at least it is faster (5 → 2 cycles on Zen and 3 → 2 on most Intel architectures). Significantly, it also would auto-vectorize just as well. Still, the fact that it can't handle 0 or subnormals makes it hard to recommend. The NR implementations are not worth it, either; they don't improve the error by much, and they cut into the performance benefit.

The star of the show here is sqrtf_bithack(⋯), which can crank out a square root every 1 cycle on average, on every architecture. With nr0(⋯), the largest relative error for normals is 1.5%, which is pretty good. For 0 and subnormals it doesn't produce a great result, but at least it does produce a small value which is unlikely to cause trouble. The other variant worth considering is nr1_ssercp_fmaB(⋯): a definite slowdown, but the error is much less, improving normals by two orders of magnitude.

Conclusion

The main takeaway is that sqrtss is really good. It gives the bitwise-perfect result, and it does it quickly, especially when pipelining is considered. You can access sqrtss through std::sqrt(⋯), which has the important benefits of being clear and vectorizing easily. However, one must be careful to pass the right compiler flags to remove errno and trapping. In cases where that is not possible (e.g. you don't want to pass "/fp:fast" on MSVC), you can use an intrinsics-based version, which is almost as good.

The only other implementation that seems worth considering overall is the bithack one. It is faster, even when you consider 4⨯ vectorization of sqrtss—although it pays for it with some error. The ideal use-case would be when you just need one square-root, you know you have a valid input, and you don't need too much accuracy—especially when you're already doing floating-point-heavy stuff alongside. NR iteration is not worth it.

Here's an implementation demonstrating these conclusions. It should work optimally for Clang, GCC, and MSVC.

/*
On Clang and GCC, you should pass "-fno-math-errno -fno-trapping-math"!  You probably want to do
this anyway.  If you don't, you'll get warning(s), but it will still compile with a fallback.

On MSVC, you may pass "/fp:fast", which will produce a slightly better result.  However, this has
potentially unwanted effects, so you don't have to, in which case a fallback is used.

Note that on GCC and MSVC, the fallback adds overhead and impedes auto-vectorization!
*/

#include <cmath>

#include <cstdint>

#include <bit>


#if defined _M_IX86 || defined _M_AMD64 || defined __i386__ || defined __amd64__
	#include <immintrin.h>

#endif

namespace MyMath
{

#ifdef __GNUC__
	#ifndef __NO_MATH_ERRNO__
		#warning "This implementation should be compiled with '-fno-math-errno'!"
	#endif
	#if !defined __NO_TRAPPING_MATH__ && !defined __clang__
		//Note: unfortunately, clang doesn't appear to define `__NO_TRAPPING_MATH__`; I made a bug

		//	report: https://github.com/llvm/llvm-project/issues/78496

		#warning "This implementation should be compiled with '-fno-trapping-math'!"
	#endif
#endif

//General-purpose square-root, bitwise-perfect, surprisingly fast

#ifdef __cpp_lib_constexpr_cmath
[[nodiscard]] constexpr float sqrt( float val ) noexcept
#else
[[nodiscard]] inline    float sqrt( float val ) noexcept
#endif
{
	#if defined _M_IX86 || defined _M_AMD64 || defined __i386__ || defined __amd64__
		#if !defined __NO_MATH_ERRNO__ && !defined _M_FP_FAST
			//Intrinsic fallback to avoid overhead of `errno`, etc.  Just as good on Clang, but on

			//	GCC and MSVC, prevents auto-vectorization and adds a slight amount of overhead due

			//	to pointless shuffles!

			#ifdef __cpp_if_consteval
			if !consteval
			{
			#endif
				return _mm_cvtss_f32( _mm_sqrt_ss(_mm_set_ps1(val)) );
			#ifdef __cpp_if_consteval
			}
			#endif
		#endif
	#endif

	return std::sqrt(val);
}

//Fast approximate square root, 1.5% rel-err on normals and okay for 0 and subnormals (but don't

//	expect good NaN behavior).  Use when speed matters most, especially when (auto-)vectorization

//	isn't relevant and the floating-point pipeline is already stressed.  Could be hand-vectorized.

[[nodiscard]] constexpr float sqrt_fast( float val ) noexcept
{
	uint32_t tmp = std::bit_cast<uint32_t>(val);

	constexpr int TWEAK = -185'516;
	tmp = ((1<<29)-(1<<22)+TWEAK) + (tmp>>1);

	return std::bit_cast<float>(tmp);
}

}

Notes

[1]	If you wrap the function at all, which you perhaps shouldn't. Also, in C++26 we will use `constexpr` instead of `inline`. I myself also like to add`[[nodiscard]]` and `noexcept`.
[2]	Note: the x86 version is what compilers will generate without VEX-encoded instructions (basically, machines from roughly around the time AVX was introduced). Although not every x86 PC supports the SSE hardware SIMD required for `sqrtss` (and SSE2 for `sqrtsd`), so ubiquitous is support (almost all IA-32 (i.e. 32-bit x86) and all AMD64 (i.e. 64-bit x86)) that compilers now assume it by default.
[3]	There's a good range on this. Figure 10–30 ish cycle for x86, with newer processors doing better, and 7–17 cycles specifically for the ARM Cortex-A57.
[4]	This wasn't always the case, as noted on Some Assembly Required.
[5]	(though unfortunately not statically predicted)
[6]	If you don't do this, it can be slower than the 80-bit x87!
[7]	The MSVC compiler only supports inline assembly in 32-bit, for some reason.
[8]	Weirdly, "-mavx2" doesn't seem to be sufficient.
[9]	A previous version of this article made a valiant effort to measure the result. Hat tip to (the dated) Some Assembly Required for some good ideas.
[10]	Note that in production, you wouldn't disable auto-vectorization. Indeed, it is a major reason why `std::sqrt(⋯)` is usually the best choice.
[11]	It doesn't make as much sense to minimize absolute errors, since these can be quite large, and the behavior at large values can make the behavior at small values rather poor.
[12]	These values were found empirically by simulated annealing. They have been checked by a brute-force search ±10 in each direction.
[13]	Assuming no register conflicts. This is a perfect example of why unrolling in the test is important.
[14]	Skylake, in particular, is better, with 12 cycles latency and 3 cycles CPI, meaning on average we'll run four at once.