~/imallett (Ian Mallett)

# Fast Square Roots

A lot has been written about computing inverse square roots $1/\sqrt{x}$, particularly a well-known bithack. However, much less has been written about the ordinary square root $\sqrt{x}$.

Fortunately, I've done the work for you. Here are most of the variants I could find, and the reasonable versions I came up with, along with timing and accuracy statistics, broken down by the type of float argument passed in:

Variant Code Average
Latency
Error Statistics Square Root Values
$\sqrt{0}$ $\sqrt{\text{(denormal)}}$ $\sqrt{\text{(normal)}}$ $\sqrt{\infty}$ $\sqrt{0}$ $\sqrt{\text{(denormal)}}$ $\sqrt{\text{(normal)}}$ $\sqrt{\infty}$
Standard Library
#include <cmath>
inline float sqrt_std(float arg) {
return std::sqrt(arg);
}
$3.034777$ ns

$0$
(N/A %)

Minimum:
$0$
($0$ %)

Average:
$0$
($0$ %)

Maximum:
$0$
($0$ %)

Minimum:
$0$
($0$ %)

Average:
$0$
($0$ %)

Maximum:
$0$
($0$ %)

$0$
(N/A %)

$0$

Minimum:
$3.74339\cdot 10^{-23}$

Maximum:
$1.0842\cdot 10^{-19}$

Minimum:
$1.0842\cdot 10^{-19}$

Maximum:
$1.84467\cdot 10^{19}$

$\infty$

x87 FPU
inline float sqrt_x87(float arg) {
#ifdef _MSC_VER //and compile in x86-mode
__asm {
fld arg
fsqrt
}
#else
float result;
__asm__(
"flds %1\n"
"fsqrt\n"
: "=&t"(result) : "m"(arg)
);
return result;
#endif
}
$2.077535$ ns

$0$
(N/A %)

Minimum:
$0$
($0$ %)

Average:
$0$
($0$ %)

Maximum:
$0$
($0$ %)

Minimum:
$0$
($0$ %)

Average:
$0$
($0$ %)

Maximum:
$0$
($0$ %)

$0$
(N/A %)

$0$

Minimum:
$3.74339\cdot 10^{-23}$

Maximum:
$1.0842\cdot 10^{-19}$

Minimum:
$1.0842\cdot 10^{-19}$

Maximum:
$1.84467\cdot 10^{19}$

$\infty$

SSE Intrinsics
#include <xmmintrin.h>
inline float sqrt_intrin(float arg) {
return _mm_cvtss_f32(
_mm_sqrt_ss(_mm_set_ps1(arg))
);
}
$0.555357$ ns

$0$
(N/A %)

Minimum:
$0$
($0$ %)

Average:
$0$
($0$ %)

Maximum:
$0$
($0$ %)

Minimum:
$0$
($0$ %)

Average:
$0$
($0$ %)

Maximum:
$0$
($0$ %)

$0$
(N/A %)

$0$

Minimum:
$3.74339\cdot 10^{-23}$

Maximum:
$1.0842\cdot 10^{-19}$

Minimum:
$1.0842\cdot 10^{-19}$

Maximum:
$1.84467\cdot 10^{19}$

$\infty$

Reciprocal Multiply
#include <xmmintrin.h>
inline float rsqrt_sse(float x) {
return _mm_cvtss_f32(_mm_rsqrt_ss(_mm_set_ps1(x)));
}
inline float sqrt_rcpmul(float arg) {
return arg * rsqrt_sse(arg);
}
$0.983849$ ns

$\infty$
($\infty$ %)

Minimum:
$\infty$
($\infty$ %)

Average:
$\infty$
($\infty$ %)

Maximum:
$\infty$
($\infty$ %)

Minimum:
$0$
($0$ %)

Average:
$1.98488\cdot 10^{13}$
($0.00935919$ %)

Maximum:
$4.5025\cdot 10^{15}$
($0.0326134$ %)

$\infty$
($\infty$ %)

NaN

Minimum:
$\infty$

Maximum:
$\infty$

Minimum:
$1.08394\cdot 10^{-19}$

Maximum:
$1.84512\cdot 10^{19}$

NaN

Reciprocal Multiply
(With Newton Iteration)
#include <xmmintrin.h>
inline float rsqrt_sse(float x) {
return _mm_cvtss_f32(_mm_rsqrt_ss(_mm_set_ps1(x)));
}
inline float sqrt_rcpmul_nr1(float arg) {
float x0 = arg * rsqrt_sse(arg);
float x1 = 0.5f*(x0 + arg/x0);
return x1;
}
$1.139981$ ns

$\infty$
($\infty$ %)

Minimum:
$\infty$
($\infty$ %)

Average:
$\infty$
($\infty$ %)

Maximum:
$\infty$
($\infty$ %)

Minimum:
$0$
($0$ %)

Average:
$4.41993\cdot 10^{9}$
($2.1459\cdot 10^{-6}$ %)

Maximum:
$1.09951\cdot 10^{12}$
($1.19209\cdot 10^{-5}$ %)

$\infty$
($\infty$ %)

NaN

Minimum:
$\infty$

Maximum:
$\infty$

Minimum:
$1.0842\cdot 10^{-19}$

Maximum:
$1.84467\cdot 10^{19}$

NaN

Reciprocal Multiply
(With Newton Iteration
via Reciprocal)
#include <xmmintrin.h>
inline float rsqrt_sse(float x) {
return _mm_cvtss_f32(_mm_rsqrt_ss(_mm_set_ps1(x)));
}
inline float rcp_sse(float x) {
return _mm_cvtss_f32(_mm_rcp_ss(_mm_set_ps1(x)));
}
inline float sqrt_rcpmul_nr2(float arg) {
float x0 = arg * rsqrt_sse(arg);
float x1 = 0.5f*(x0 + arg*rcp_sse(x0));
return x1;
}
$1.098898$ ns

$\infty$
($\infty$ %)

Minimum:
$\infty$
($\infty$ %)

Average:
$\infty$
($\infty$ %)

Maximum:
$\infty$
($\infty$ %)

Minimum:
$0$
($0$ %)

Average:
$9.98957\cdot 10^{12}$
($0.00478692$ %)

Maximum:
$2.2507\cdot 10^{15}$
($0.0150172$ %)

$\infty$
($\infty$ %)

NaN

Minimum:
$\infty$

Maximum:
$\infty$

Minimum:
$1.0842\cdot 10^{-19}$

Maximum:
$1.84467\cdot 10^{19}$

NaN

Wikipedia Bithack
#include <cstdint>
#include <cstring>
inline float sqrt_bithack(float arg) {
uint32_t tmp; memcpy(&tmp,&arg,4);
tmp = ((1<<29)-(1<<22)) + (tmp>>1);
float result; memcpy(&result,&tmp,4);
return result;
}
$0.441818$ ns

$8.13152\cdot 10^{-20}$
($\infty$ %)

Minimum:
$0$
($0$ %)

Average:
$2.25875\cdot 10^{-20}$
($66.6289$ %)

Maximum:
$8.12777\cdot 10^{-20}$
($217123$ %)

Minimum:
$0$
($0$ %)

Average:
$4.15349\cdot 10^{15}$
($2.022$ %)

Maximum:
$7.9124\cdot 10^{17}$
($6.06602$ %)

$\infty$
($\infty$ %)

$8.13152\cdot 10^{-20}$

Minimum:
$8.13152\cdot 10^{-20}$

Maximum:
$1.0842\cdot 10^{-19}$

Minimum:
$1.0842\cdot 10^{-19}$

Maximum:
$1.84467\cdot 10^{19}$

$1.84467\cdot 10^{19}$

Wikipedia Shifted
Bithack
(Avg. Optimized)
#include <cstdint>
#include <cstring>
inline float sqrt_bithack_sh(float arg) {
uint32_t tmp; memcpy(&tmp,&arg,4);
tmp = ((1<<29)-(1<<22)-0x0002D4ACu) + (tmp>>1);
float result; memcpy(&result,&tmp,4);
return result;
}
$0.387357$ ns

$8.01163\cdot 10^{-20}$
($\infty$ %)

Minimum:
$0$
($0$ %)

Average:
$2.14409\cdot 10^{-20}$
($64.4664$ %)

Maximum:
$8.00789\cdot 10^{-20}$
($213921$ %)

Minimum:
$0$
($0$ %)

Average:
$3.09811\cdot 10^{15}$
($1.50473$ %)

Maximum:
$5.87263\cdot 10^{17}$
($4.50224$ %)

$\infty$
($\infty$ %)

$8.01163\cdot 10^{-20}$

Minimum:
$8.01163\cdot 10^{-20}$

Maximum:
$1.07221\cdot 10^{-19}$

Minimum:
$1.07221\cdot 10^{-19}$

Maximum:
$1.82428\cdot 10^{19}$

$1.82428\cdot 10^{19}$

Wikipedia Shifted
Bithack With
Newton Iteration
(Avg. Optimized)
#include <cstdint>
#include <cstring>
inline float sqrt_bithack_shnr1(float arg) {
uint32_t tmp; memcpy(&tmp,&arg,4);
tmp = ((1<<29)-(1<<22)-0x0002D4ACu) + (tmp>>1);
float x0; memcpy(&x0,&tmp,4);
float x1 = 0.5f*(x0 + arg/x0);
return x1;
}
$1.640135$ ns

$4.00581\cdot 10^{-20}$
($\infty$ %)

Minimum:
$0$
($0$ %)

Average:
$4.61726\cdot 10^{-21}$
($19.9176$ %)

Maximum:
$4.00207\cdot 10^{-20}$
($106910$ %)

Minimum:
$0$
($0$ %)

Average:
$3.45979\cdot 10^{13}$
($0.0171038$ %)

Maximum:
$1.2651\cdot 10^{16}$
($0.0969886$ %)

$0$
($0$ %)

$4.00581\cdot 10^{-20}$

Minimum:
$4.00582\cdot 10^{-20}$

Maximum:
$1.08427\cdot 10^{-19}$

Minimum:
$1.08427\cdot 10^{-19}$

Maximum:
$1.84479\cdot 10^{19}$

$\infty$

Wikipedia Shifted
Bithack With
Newton Iteration
via Reciprocal
(Avg. Optimized)
#include <cstdint>
#include <cstring>
#include <xmmintrin.h>
inline float rcp_sse(float x) {
return _mm_cvtss_f32(_mm_rcp_ss(_mm_set_ps1(x)));
}
inline float sqrt_bithack_shnr2(float arg) {
uint32_t tmp; memcpy(&tmp,&arg,4);
tmp = ((1<<29)-(1<<22)-0x0002D4ACu) + (tmp>>1);
float x0; memcpy(&x0,&tmp,4);
float x1 = 0.5f*(x0 + arg*rcp_sse(x0));
return x1;
}
$1.446041$ ns

$4.00581\cdot 10^{-20}$
($\infty$ %)

Minimum:
$0$
($0$ %)

Average:
$4.61744\cdot 10^{-21}$
($19.9177$ %)

Maximum:
$4.00207\cdot 10^{-20}$
($106910$ %)

Minimum:
$0$
($0$ %)

Average:
$3.74158\cdot 10^{13}$
($0.018427$ %)

Maximum:
$1.40155\cdot 10^{16}$
($0.107381$ %)

$0$
($0$ %)

$4.00581\cdot 10^{-20}$

Minimum:
$4.00582\cdot 10^{-20}$

Maximum:
$1.0843\cdot 10^{-19}$

Minimum:
$1.0843\cdot 10^{-19}$

Maximum:
$1.84483\cdot 10^{19}$

$\infty$

## Error Evaluation

The error evaluation is fairly straightforward. The standard square root is considered to be the "ground truth". For relative error, I adopt the convention that infinite or NaN absolute error means infinite relative error (even if the division would produce an indeterminate form). Similarly, zero absolute error means zero relative error.

## Performance Evaluation

The performance evaluation was easier to code but requires more attention to detail.

First, the square root functions were not actually marked inline. I in-fact marked them as explicitly no-inline. I subtracted-off the measured cost of an identity function from all the timing results to get a semblance of the actual work done by the square root. I used the C++ standard high-performance timers (i.e. std::chrono::high_resolution_clock::now()) to get the actual timing numbers. The test system was an Intel i7-5930K (3.5 GHz), and the code was compiled with MSVC 15.8.1 in Release x86 mode.

Pipelining seems to make a big difference; in some cases, the latency is significantly below the rated costs of the actual instructions. As with any profiling of low-level code, don't take the performance numbers too literally, nor blithely compare them percentage-wise; instead view them as defining a rough ordering of relative efficiency. The standard deviation is probably on the order of at least 50 ps.

## General Notes

The error statistics collection and the profiling were done with separate code (we don't want to profile the cost of computing the statistics).

Every non-NaN, non-negative float was tested in both the error evaluation and profiling, simply ordered by increasing value. I didn't profile or test negative numbers or NaNs, because these are outside the domain of the real-valued branch of the square root, or invalid, respectively—and therefore nonsensical.

A lot of the overhead from std::sqrt(...) comes from checking for these cases, but the programmer will usually be guaranteeing such inputs will not be generated anyway (and even if they were, it would typically indicate a failure condition anyway). Therefore, considering them seems pointless. In my opinion, checks for them should be added only if there's a reason to add them (and adding them regardless is a failure on the part of the standard library in my opinion). On a related note: I haven't been able to find documentation for what the square root intrinsic does when passed a bogus value. This is sortof a moot point, because again you should not be doing that.

The memcpy(...) thing in the bithack variants is the recommended way to work around the strict-aliasing problem of floats vs. uint32_ts. In-particular, the usual pointer-reinterpreting is dangerously broken in both C and C++, and the union hacks are problematic in C++.

Hat tip to (the somewhat dated) Some Assembly Required for some good ideas.

## What's Up With the Bithack Variant?

This comes from Wikipedia, and I haven't been able to find it attested anywhere else (I tried searching the revision history, but gave up after a while) (though Googling for it is nearly impossible).

Unfortunately, the given explanation, while detailed, doesn't really explain what's going on, and is confusing besides. Instead, I worked backward from the algorithm to figure out what it does. It took a while—in no small part because understanding it requires extreme attention to detail, as the operations are highly nontrivial.

First, let's break down the key operation in the middle (note one of the shifts changed from 22 to 23; this is because it happens first):

tmp -= 1 << 23;
tmp >>= 1;
tmp += 1 << 29;
tmp -= 0x0002D4ACu;

To understand this, as you might expect, you need to understand the bit encoding of float (picture courtesy Wikipedia for reference, but you should read their article if you're not familiar with the IEEE 754 float encoding, since I won't explain it here):

For nonnegative, normalized, noninfinite, non-NaN floats, the signbit is zero, the exponent byte is between 1 and 254 inclusive (and represents a value 127 less than is stored), and the mantissa has an implicit leading one:

$(-1)^{\{0\}} \times \left(2^{\{e_7 e_6\cdots e_0\}-127}\right) \times \{1.m_{22}m_{21}\cdots m_0\}$

The correct answer for the square root is obviously just the square root of this (the last step is done because fractional exponents cannot be represented):

\begin{align} & \sqrt{ (-1)^{\{0\}} \times \left(2^{\{e_7 e_6\cdots e_0\}-127}\right) \times \{1.m_{22}m_{21}\cdots m_0\} }\\ &= \sqrt{ \left(2^{\{e_7 e_6\cdots e_0\}-127}\right) } \sqrt{ \{1.m_{22}m_{21}\cdots m_0\} }\\ &= \left(2^{(\{e_7 e_6\cdots e_0\}-127)/2 }\right) \sqrt{ \{1.m_{22}m_{21}\cdots m_0\} }\\ &= \left(2^{(\{e_7 e_6\cdots e_0\}+127)/2-127}\right) \sqrt{ \{1.m_{22}m_{21}\cdots m_0\} }\\ &= \begin{cases} \left(2^{(\{e_7 e_6\cdots e_0\}+127)/2-127}\right) \sqrt{ ~~\{1.m_{22}m_{21}\cdots m_0\} } & \text{ if } \{e_0\}=\{1\}\\ \left(2^{(\{e_7 e_6\cdots e_0\}+126)/2-127}\right) \sqrt{ 2 \{1.m_{22}m_{21}\cdots m_0\} } & \text{ if } \{e_0\}=\{0\} \end{cases} \end{align}

Let's consider the operations done by the code in-order, and see how they correspond to computing an approximation of this ideal.

Starting from our original expression:

$(-1)^{\{0\}} \times \left(2^{\{e_7 e_6\cdots e_0\}-127}\right) \times \{1.m_{22}m_{21}\cdots m_0\}$

We subtract 1<<23 (this subtracts off the exponent field; notice that since the exponent field is at least one, this cannot do any nasty underflows or anything):

$\left(2^{(\{e_7 e_6\cdots e_0\}-1)-127}\right) \times \{1.m_{22}m_{21}\cdots m_0\}$

Then we shift right (remember that for unsigned shifts this is roughly equivalent to dividing by two, and that since the signbit is zero, a zero is shifted into the exponent field) and add 1<<29 (i.e., 64 into the exponent). To apprehend what these two operations do in-combination with the previous, we have to break it down case-by-case.

### Low-Order Exponent Bit is One

In the (easier) case where the lower-order bit was one, the above subtraction turned it into a zero (and didn't change the other exponent bits). The zero is shifted into the mantissa and the original upper seven bits are left to form the exponent. Finally, we add in 64 to the exponent. The sequence of steps looks like this:

\begin{alignat*}{5} & \left(2^{ \{e_7 e_6 \cdots e_1 1 \} -127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}~~~~~~~~&&\text{ if } \{e_0\}=\{1\}\\ & \left(2^{(\{e_7 e_6 \cdots e_1 1 \}-1)-127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}~~~~~~~~&&\text{ (subtract)}\\ & \left(2^{ \{e_7 e_6 \cdots e_1 0 \} -127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}\\ & \left(2^{ \{0 e_7 e_6\cdots e_2 e_1\} -127}\right) && \times \{1.0 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~&&\text{ (shift)}\\ & \left(2^{ \{0 e_7 e_6\cdots e_2 e_1\}+64-127}\right) && \times \{1.0 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~&&\text{ (add)} \end{alignat*}

When we rearrange this to represent doing the whole thing in one go, we find that the exponent exactly matches the desired one from above!

$\left(2^{(\{e_7 e_6\cdots e_0\}+127)/2 - 127}\right) \times \{1.0 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~\text{ if } \{e_0\}=\{1\}$

We now examine the mantissa:

\begin{align} \{1.0 m_{22}m_{21}\cdots m_2 m_1\} &= 1 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 0 \} - 1)\5pt] & \approx 1 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 m_0\} - 1)\\[5pt] & \approx \sqrt{ \{1.m_{22}m_{21}\cdots m_1 m_0\} } \end{align} And this is the desired mantissa! Note that in the last step we used the Taylor expansion of $\sqrt{x}$ at $x=1$:\[ \sqrt{x} = \sum_{k=0}^{\infty} (x-1)^k \binom{1/2}{k} = 1 + \frac{1}{2}(x-1) - O\left( (x-1)^2 \right)

That is, the computed mantissa is (and neglecting the last bit which gets shifted to oblivion) a first-order Taylor approximation to the desired mantissa.

### Low-Order Exponent Bit is Zero

In the case where the lower-order bit was zero, the above subtraction turned it into a one. Viewed as a division, the shift rounds down (i.e., it subtracts an additional one (so, minus two)) from the exponent before it is divided by two. Whatever the result of the subtraction was for the upper seven bits remain to form the exponent. The one is shifted into the mantissa, and then as-before we add 64 to the exponent.

\begin{alignat*}{5} & \left(2^{ \{e_7 e_6 \cdots e_1 0 \} - 127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}~~~~~~~~&&\text{ if } \{e_0\}=\{0\}\\ & \left(2^{(\{e_7 e_6 \cdots e_1 0 \}-1) - 127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}~~~~~~~~&&\text{ (subtract)}\\ & \left(2^{ \{e_7' e_6' \cdots e_1' 1 \} - 127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}\\ & \left(2^{ \{0 e_7' e_6'\cdots e_2' e_1'\} - 127}\right) && \times \{1.1 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~&&\text{ (shift)}\\ = & \left(2^{(\{e_7 e_6 \cdots e_1 0 \}-2)/2 - 127}\right) && \times \{1.1 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~&&\text{ (same, in terms of original bits)}\\ & \left(2^{(\{e_7 e_6 \cdots e_2 0 \}-2)/2 + 64 - 127}\right) && \times \{1.1 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~&&\text{ (add)} \end{alignat*}

Again, we write the whole thing in one go and find that the exponent matches the desired result from above:

$\left(2^{(\{e_7 e_6\cdots e_1 e_0\} + 126)/2 - 127}\right) \times \{1.1 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~\text{ if } \{e_0\}=\{0\}\\$

We again examine the mantissa:

\begin{align} \{1.1 m_{22}m_{21}\cdots m_2 m_1\} &= 1.5 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 0 \} - 1)\\[5pt] & \approx 1.5 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 m_0\} - 1)\\[5pt] & \approx (1.5811\cdots) + (0.63245\cdots)\left(\{1.m_{22}m_{21}\cdots m_1 m_0\} - (1.25)\right)\\[5pt] & \approx \sqrt{\frac{5}{2}} + \sqrt{\frac{2}{5}}(\{1.m_{22}m_{21}\cdots m_1 m_0\} - \frac{5}{4})\\[5pt] & \approx \sqrt{ 2 \{1.m_{22}m_{21}\cdots m_1 m_0\} } \end{align}

The Taylor series (for $\sqrt{2 x}$ at $x=5/4$) doesn't work out as-nicely here, but it should be apparent that the mantissa is at least sortof right.

### Tidying Up

The last thing that's done is (optionally) add a constant to our value. This simply reduces the average error somewhat by shifting the approximation down slightly.

The original number on Wikipedia, 0x0004B0D2u, is supposed to minimize the maximum relative error (and I have confirmed that it does: the maximum relative error for the normals is 3.47475% (and average is 1.65573%)). It was probably chosen empirically.

However, I think that minimizing the average relative error is more-desirable. The number I use, 0x0002D4ACu, was found by my code to do this. It increases the maximum relative error (to 4.50224%), but reduces the average error (to 1.50473%).

## Conclusion

Although I've said a lot, this really boils down to three things:

1. std::sqrt(...) is slow because it checks for floats that the programmer probably won't be passing anyway.
2. On SSE-enabled architectures, the built-in hardware instruction is fast and accurate. You should use it as your general go-to square-root function.
3. You can get slightly faster square roots with the (corrected) bithack function. However, it's not very much faster (and in-fact may not pipeline or parallelize as-well) and has terrible error for denormalized values (and infinity, although this is typically less-important). Use this, but cautiously and in-conjunction with profiling. Don't bother with Newton iteration on it, because although it will improve, you'll lose all the speed advantage, even if you use reciprocals.