Fast Square Roots

A lot has been written about computing inverse-square-roots \(1/\!\sqrt{s}\), particularly a certain well-known bithack. However, much less has been written about the ordinary square root \(\sqrt{s}\). Here, I'll tackle issues of performance and precision.

Bit of a spoiler—the best approach is usually to just use the built-in std::sqrt(⋯) function, vitally with math optimization flags enabled. Something like:

/*
To avoid overhead, should specify:
	"-fno-math-errno -fno-trapping-math" (Clang, GCC)
	"/fp:fast" (MSVC)
*/
#include <cmath>
inline float sqrtf_std( float x ) { return std::sqrt(x); }

—which will compile (without the inline, which you should leave in for production[1]) to[2]:

x86x86 (VEX)ARMARM64
sqrtf_std(float):
	sqrtss   xmm0, xmm0
	ret
sqrtf_std(float):
	vsqrtss   xmm0, xmm0, xmm0
	ret
sqrtf_std(float):
	vsqrt.f32   s0, s0
	bx          lr
sqrtf_std(float):
	fsqrt   s0, s0
	ret

This will give a result that is bitwise perfect (i.e. correctly rounded from the infinite precision result), and it will do it in the 12-ish cycle range end-to-end[3], and more like 3–5 cycles pipelined[4]. That's pretty dang fast. Plus, the compiler can auto-vectorize it—which should not be overlooked and is not true of some other methods discussed here. Finally, and perhaps most importantly, the code is clear and maintainable.

This is a good, solid choice, and I recommend it for most uses.

We will explore alternatives to the above systematically, however. The result is in general that they trade-off accuracy for better performance. The tradeoff is probably not useful outside of certain situations. Sometimes (as for the case of certain dated techniques), the result is just strictly worse.


Optimization Flags

The C++ standard says that if the standard bills itself as supporting IEEE 754, then std::sqrt(⋯) must be able to detect negative input. Specifically, in addition to returning NaN, a floating-point exception can be configured to fire FE_INVALID and/or set errno.

Regardless of how it's used, a floating-point exception is not terribly problematic, because it is generally implemented by the hardware itself as a hardware trap. There ought to only be an overhead to trap into the handler when the input is negative. Much worse, however, is the potential setting of errno. In practice, supporting that means that every call to square root gets bracketed by error check code at the assembly level. For example, without disabling this behavior, the code above compiles to:

sqrtf_std(float):
	vxorps     xmm1, xmm1, xmm1
	vucomiss   xmm1, xmm0
	ja         .L10
	vsqrtss    xmm0, xmm0, xmm0
	ret
.L10:
	jmp        sqrtf

If your assembly is not great, basically, that checks whether \(0\) is greater than the input, and if so jumps to a tailcall to a platform-specific function that will presumably do the check again and set errno as needed. Even on the "happy"[5] path, you eat maybe a 50% latency hit, which really hurts if you call square-root a lot (e.g. in computer graphics). The instruction alone already does the Right Thing (i.e., return NaN), and a 50% latency hit to support something most people will never use is deeply unwelcome.

The solution is to relax the requirements of the standard, which you want to do anyway because errno is simply a bad design choice. For Clang and GCC, you need to at least disable the errno support. Floating-point exceptions are in practice disabled by default, but code is still generated assuming they might be enabled, which has pipelining implications. To disable both, the flags you want to use are therefore (minimally) "-fno-math-errno -fno-trapping-math". MSVC has less control, but "/fp:fast" seems sufficient.

You'll also want to consider other floating-point optimization flags, and of course also general optimization flags.


The Methods

Now we turn to actually looking at the methods of computing the square-root itself. There are only a few basic methods that aren't obviously terrible.


Native Hardware

The first category gives the right answer using hardware.

We can use the SSE hardware that is present basically everywhere, which boils down to the sqrtss instruction. Running std::sqrt(⋯) will get you this, but with a major performance overhead unless you pass the optimization flags specified above[6]. You can also call intrinsic functions to access the instruction directly. Although this has disadvantages, it will work without the flags (advantageous especially for MSVC).

/*
Note clarity, auto-vectorization, and bitwise-perfect answer for all inputs.
Change `inline` -> `constexpr` in C++26.

To avoid overhead, should specify:
	"-fno-math-errno -fno-trapping-math" (Clang, GCC)
	"/fp:fast" (MSVC)
*/
#include <cmath>
[[nodiscard]] inline float sqrtf_std( float val ) noexcept
{
	return std::sqrt(val);
}
//x86(-64) only, doesn't require special flags
#include <immintrin.h>
[[nodiscard]] inline float sqrtf_sse( float val ) noexcept
{
	return _mm_cvtss_f32( _mm_sqrt_ss(_mm_set_ps1(val)) );
}

There is also the x87 FPU, which does the operation in 80-bit precision. This is overkill for 32-bit, and the use of the x87 FPU is discouraged anyway. Since compilers (finally, quite sensibly) default to using the SSE unit instead, we must access the x87 through inline assembly or extra compiler flags[7].

//x87 only (x86(-64) discouraged hardware)
[[nodiscard]] inline float sqrtf_x87( float val ) noexcept
{
	float result;
	#if   defined __clang__ //Clang
		asm(
			"fsqrt\n"
			: "=t"(result)
			: "t"(val)
			: "cc"
		);
	#elif defined __GNUC__ //GCC, etc.
		asm(
			"fld dword ptr %1\n"
			"fsqrt\n"
			: "=&t"(result)
			: "m"(val)
			: "cc"
		);
	#elif defined _MSC_VER && defined _M_IX86 //32-bit MSVC
		__asm
		{
			fld dword ptr val
			fsqrt
			fstp dword ptr result
		}
	#else
		#error
	#endif
	return result;
}

Approximation

The second category gives an approximate answer using a result that we compute ourselves, possibly aided by hardware.

The hardware provides an inverse-square-root function rsqrtss, which by itself is faster. We can therefore compute by \(\sqrt{s}~\)\(= (s)(1/\!\sqrt{s})\):

//Note incorrect result for 0 (and subnorms)
#include <immintrin.h>
[[nodiscard]] inline float rsqrtf_sse( float val ) noexcept
{
	return _mm_cvtss_f32( _mm_rsqrt_ss(_mm_set_ps1(val)) );
}
[[nodiscard]] inline float sqrtf_ssersqrt( float val ) noexcept
{
	return val * rsqrtf_sse(val);
}

There is also an ingenious bithack way to get a square-root. This comes from Wikipedia, perhaps as a simplification of a method first derived by Kahan and Ng in the same unpublished work that led to inverse-square-root.

The explanation (in both locations) is lacking, but it's not notable enough to deserve corrections and the article it needs for a proper treatment, in either academia or on Wikipedia. Roughly speaking, the algorithm works by computing \(\sqrt{s}~\)\(= \exp_2\!\left(\log_2(s)/2\right)\), where we can compute the \(\log_2\) (and so also \(\exp_2\)) efficiently with \(\log_2(s)~\)\(\approx s_\text{int}/2^{23} - 127\). However, I have written an article with the complete explanation, which has been moved to here.

#include <bit>
#include <cstdint>
[[nodiscard]] constexpr float sqrtf_bithack( float val ) noexcept
{
	uint32_t tmp = std::bit_cast<uint32_t>(val);

	constexpr int TWEAK = 0;
	tmp = ((1<<29)-(1<<22)+TWEAK) + (tmp>>1);

	return std::bit_cast<float>(tmp);
}

Note the std::bit_cast<⋯>(⋯) reinterprets the bits to and from floating-point and integers. Commonplace pointer-reinterpreting is dangerously broken in both C and C++, and union hacks are problematic in C++. Before bit-cast, the recommended way was memcpy(⋯). This works surprisingly well, but it does break constexpr-ness.


Newton–Raphson Improvements

Since for this second category the answer is approximate, there is also the option to improve the result using Newton–Raphson iteration (NR, N–R), which for square-root is:

\[ x_{k+1} := \frac{1}{2}\left( x_k + \frac{s}{x_k} \right) \]

—where \(x_k\) is the current guess and \(s=x^2\) is the number we're trying to take the square root of.

The basic case for NR is of course to do zero iterations:

[[nodiscard]] constexpr float nr0( float /*val*/, float xk ) noexcept { return xk; }

For one iteration, we can just do the formula blithely. The coefficient is nominally 0.5, but small perturbations can result in more accuracy, so we pull it out for later tweaking (see below!):

#include <bit>
[[nodiscard]] constexpr float nr1( float val, float xk ) noexcept
{
	constexpr float coeff = std::bit_cast<float>( 1'056'964'608 ); //`0.5f`
	return coeff*( xk + val/xk );
}

Floating-point division is infamously slow, so we can do the same thing but using the SSE rcpss instruction:

#include <bit>
#include <immintrin.h>
[[nodiscard]] inline float rcp_sse( float val ) noexcept
{
	return _mm_cvtss_f32( _mm_rcp_ss( _mm_set_ps1(val) ) );
}
[[nodiscard]] inline float nr1_ssercp( float val, float xk ) noexcept
{
	constexpr float coeff = std::bit_cast<float>( 1'056'964'608 ); //`0.5f`
	return coeff*( xk + val*rcp_sse(xk) );
}

This is a good use-case for a fused multiply-add (FMA) instruction, which is supported on some AVX and all AVX2 processors. To access it, pass "-mfma"[8] (Clang, GCC, etc.) or "/arch:AVX2" (MSVC). You could also use the _mm_fmadd_ss(⋯) intrinsic, which might be useful when you have FMA but not AVX2.

//Pass "-mfma" (Clang, GCC) or "/arch:AVX2" (MSVC)
#include <cmath>
#include <bit>
[[nodiscard]] inline float rcp_sse( float val ) noexcept;
[[nodiscard]] inline float nr1_ssercp_fmaA( float val, float xk ) noexcept
{
	constexpr float coeff = std::bit_cast<float>( 1'056'964'608 ); //`0.5f`
	return coeff*std::fma( val,rcp_sse(xk), xk );
}

The final multiplication by the coefficient has to wait for the FMA to finish. To get better pipelining, then, we can try distributing the coefficient. The idea is that even though there are now two multiplies, they can be executing at the same time as the reciprocal:

//Pass "-mfma" (Clang, GCC) or "/arch:AVX2" (MSVC)
#include <cmath>
#include <bit>
[[nodiscard]] inline float rcp_sse( float val ) noexcept;
[[nodiscard]] inline float nr1_ssercp_fmaB( float val, float xk ) noexcept
{
	constexpr float coeff = std::bit_cast<float>( 1'056'964'608 ); //`0.5f`
	return std::fma( coeff*val,rcp_sse(xk), coeff*xk );
}

Finally, here's a version that does two NR iterations. The two 0.5 factors can be combined into a single factor of 0.25:

[[nodiscard]] constexpr float nr2( float val, float xk ) noexcept
{
	constexpr float coeff = std::bit_cast<float>( 1'048'576'000 ); //`0.25f`
	//xk = 0.5f*( xk + val/xk );
	//xk = 0.5f*( xk + val/xk );
	xk =       xk + val/xk;
	xk = coeff*xk + val/xk;
	return xk;
}

Comparison

Here we compare all square-root functions, with each NR variant as appropriate.

We test every possible valid float in both the error evaluation and profiling. I didn't profile or test negative numbers or NaNs; these are outside the domain of the real-valued branch of the square-root, or invalid, respectively. However, do note that the implementation's behavior for these values may be important to you.


Summary Table

Here is a summary table comparing the variants:

AlgorithmPipelined
Cycles
Avg. Relative ErrorResults for Special Values
\(\sqrt{\text{(subnorm)}}\)\(\sqrt{\text{(norm)}}\)\(\sqrt{\text{(neg)}}\)\(\sqrt{0}\)\(\sqrt{\infty}\)\(\sqrt{\text{(NaN)}}\)
sqrtf_std500NaN0NaN
sqrtf_sse500NaN0NaN
sqrtf_x872300NaN0NaN
sqrtf_ssersqrt +
nr0
29.359e-05NaNNaNNaNNaN
sqrtf_ssersqrt +
nr1
72.146e-08NaNNaNNaNNaN
sqrtf_ssersqrt +
nr1_ssercp
54.787e-05NaNNaNNaNNaN
sqrtf_ssersqrt +
nr1_ssercp_fmaA
54.787e-05NaNNaNNaNNaN
sqrtf_ssersqrt +
nr1_ssercp_fmaB
44.787e-05NaNNaNNaNNaN
sqrtf_ssersqrt +
nr2
102.108e-08NaNNaNNaNNaN
sqrtf_bithack +
nr0
10.64470.015053.365e+388.012e-201.824e+192.265e+19
sqrtf_bithack +
nr1
50.19630.00012011.674e+383.979e-20NaN
sqrtf_bithack +
nr1_ssercp
40.19610.0001311.674e+383.977e-20NaN
sqrtf_bithack +
nr1_ssercp_fmaA
40.19610.0001311.674e+383.977e-20NaN
sqrtf_bithack +
nr1_ssercp_fmaB
30.19610.0001311.674e+383.977e-20NaN
sqrtf_bithack +
nr2
80.053313.799e-088.366e+371.988e-20NaNNaN

I am providing the raw data for the general statistics and latency: statistics.csv, cycles_100iters.csv. These include more statistics than are presented above.


Pipelined Cycles

The performance cost is presented as "Pipelined Cycles", computed using "llvm-mca" 16.0.6, a cycle-accurate hardware simulator. This allows us to get accurate results on many different architectures without physically possessing them or dealing with timing issues[9]. The results above are for the "znver4" (i.e. AMD Zen 4) architecture, but results for several others can be downloaded.

An end-to-end latency of the operation would be deeply misleading since other instructions interleave around it, so a standard practice is to repeat the operation many times and take the average (I used 100 times). Additionally, to reduce the influence of register dependencies, I unrolled the operation 4 times and disabled auto-vectorization to prevent optimization between iterations[10].


Relative Error

The relative error is fairly straightforward. The result of std::sqrt(⋯) is bitwise-perfect, and so is taken as the reference. The relative error of a value is the absolute difference of the value from the reference, divided by the reference. All error calculations were done in double-precision.

There are a few edge cases to consider. A correct result of 0 or ∞ would result in a relative error of NaN, but is defined instead to be 0. A finite result where ∞ is expected is an error of ∞. The relative error of a NaN results in (and is left as) a NaN (though this doesn't show up in the table above; see the raw data).

The approximate methods have different parameters used to tweak the result to be more accurate. It is sensible to tweak the result either to minimize the average relative error or the maximum relative error[11]. The results above are for coefficient values that minimize the average relative error[12] (i.e. the middle column below).

AlgorithmMin Avg Rel ErrMin Max Rel Err
TWEAKcoeffTWEAKcoeff
sqrtf_ssersqrt + nr10105696460801056964608
sqrtf_ssersqrt + nr1_ssercp0105696460201056964606
sqrtf_ssersqrt + nr1_ssercp_fmaA0105696460201056964606
sqrtf_ssersqrt + nr1_ssercp_fmaB0105696460201056964607
sqrtf_ssersqrt + nr20104857600001048576000
sqrtf_bithack + nr0-1855160-3074100
sqrtf_bithack + nr1-2669851056962641-3283071056958655
sqrtf_bithack + nr1_ssercp-2730731056962594-3361491056958363
sqrtf_bithack + nr1_ssercp_fmaA-2730651056962594-3316591056958524
sqrtf_bithack + nr1_ssercp_fmaB-2729981056962597-3189661056958965
sqrtf_bithack + nr2-2786951048576000-2956831048575999

Analysis

The built-in (sqrtf_std(⋯)) function is a good place to start. It just calls the sqrtss instruction (as does sqrtf_sse(⋯), which is identical), producing the exact answer for each value. On the Zen 4 architecture, sqrtss has a latency of 15 cycles, but it can issue a new square-root every 5 cycles[13], meaning that, if we have a lot of square-roots, on average we'll be running three at once and the runtime will be a third as long[14]. If you don't have a lot of square-roots, then it will be other instructions interleaving. Thus, the pipelined cycles is reported as 5.

The sqrtf_x87(⋯) also gives the exact answers. It is 80-bit, whereas only 32-bit is required, and so it takes commensurately longer. The x87 on the Zen 4 is not pipelined, so it doesn't get the benefit of interleaving, either—however, even with pipelining (e.g. on Skylake), the result is still slower, as we'd expect.

The sqrtf_ssersqrt(⋯) variants launch us into approximate algorithms. The only variant here worth considering is the NR nr0(⋯) one (i.e. no NR at all). Every variant incorrectly gives NaN for 0 and ∞ for subnormals, but at least it is faster (5 → 2 cycles on Zen and 3 → 2 on most Intel architectures). Significantly, it also would auto-vectorize just as well. Still, the fact that it can't handle 0 or subnormals makes it hard to recommend. The NR implementations are not worth it, either; they don't improve the error by much, and they cut into the performance benefit.

The star of the show here is sqrtf_bithack(⋯), which can crank out a square root every 1 cycle on average, on every architecture. With nr0(⋯), the largest relative error for normals is 1.5%, which is pretty good. For 0 and subnormals it doesn't produce a great result, but at least it does produce a small value which is unlikely to cause trouble. The other variant worth considering is nr1_ssercp_fmaB(⋯): a definite slowdown, but the error is much less, improving normals by two orders of magnitude.


Conclusion

The main takeaway is that sqrtss is really good. It gives the bitwise-perfect result, and it does it quickly, especially when pipelining is considered. You can access sqrtss through std::sqrt(⋯), which has the important benefits of being clear and vectorizing easily. However, one must be careful to pass the right compiler flags to remove errno and trapping. In cases where that is not possible (e.g. you don't want to pass "/fp:fast" on MSVC), you can use an intrinsics-based version, which is almost as good.

The only other implementation that seems worth considering overall is the bithack one. It is faster, even when you consider 4⨯ vectorization of sqrtss—although it pays for it with some error. The ideal use-case would be when you just need one square-root, you know you have a valid input, and you don't need too much accuracy—especially when you're already doing floating-point-heavy stuff alongside. NR iteration is not worth it.

Here's an implementation demonstrating these conclusions. It should work optimally for Clang, GCC, and MSVC.

/*
On Clang and GCC, you should pass "-fno-math-errno -fno-trapping-math"!  You probably want to do
this anyway.  If you don't, you'll get warning(s), but it will still compile with a fallback.

On MSVC, you may pass "/fp:fast", which will produce a slightly better result.  However, this has
potentially unwanted effects, so you don't have to, in which case a fallback is used.

Note that on GCC and MSVC, the fallback adds overhead and impedes auto-vectorization!
*/

#include <cmath>
#include <cstdint>
#include <bit>

#if defined _M_IX86 || defined _M_AMD64 || defined __i386__ || defined __amd64__
	#include <immintrin.h>
#endif

namespace MyMath
{

#ifdef __GNUC__
	#ifndef __NO_MATH_ERRNO__
		#warning "This implementation should be compiled with '-fno-math-errno'!"
	#endif
	#if !defined __NO_TRAPPING_MATH__ && !defined __clang__
		//Note: unfortunately, clang doesn't appear to define `__NO_TRAPPING_MATH__`; I made a bug
		//	report: https://github.com/llvm/llvm-project/issues/78496
		#warning "This implementation should be compiled with '-fno-trapping-math'!"
	#endif
#endif

//General-purpose square-root, bitwise-perfect, surprisingly fast
#ifdef __cpp_lib_constexpr_cmath
[[nodiscard]] constexpr float sqrt( float val ) noexcept
#else
[[nodiscard]] inline    float sqrt( float val ) noexcept
#endif
{
	#if defined _M_IX86 || defined _M_AMD64 || defined __i386__ || defined __amd64__
		#if !defined __NO_MATH_ERRNO__ && !defined _M_FP_FAST
			//Intrinsic fallback to avoid overhead of `errno`, etc.  Just as good on Clang, but on
			//	GCC and MSVC, prevents auto-vectorization and adds a slight amount of overhead due
			//	to pointless shuffles!
			#ifdef __cpp_if_consteval
			if !consteval
			{
			#endif
				return _mm_cvtss_f32( _mm_sqrt_ss(_mm_set_ps1(val)) );
			#ifdef __cpp_if_consteval
			}
			#endif
		#endif
	#endif

	return std::sqrt(val);
}

//Fast approximate square root, 1.5% rel-err on normals and okay for 0 and subnormals (but don't
//	expect good NaN behavior).  Use when speed matters most, especially when (auto-)vectorization
//	isn't relevant and the floating-point pipeline is already stressed.  Could be hand-vectorized.
[[nodiscard]] constexpr float sqrt_fast( float val ) noexcept
{
	uint32_t tmp = std::bit_cast<uint32_t>(val);

	constexpr int TWEAK = -185'516;
	tmp = ((1<<29)-(1<<22)+TWEAK) + (tmp>>1);

	return std::bit_cast<float>(tmp);
}

}

Notes