Bithack for Square-Roots

Let's talk about an interesting bithack method for computing (ordinary!) square-roots, relying on IEEE 754 floating-point. (One of many methods for computing square-roots, which I have compared.)

In modern C++, the bithack looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
#include <bit>

#include <cstdint>


[[nodiscard]] constexpr float sqrtf_bithack( float val ) noexcept
{
	uint32_t tmp = std::bit_cast<uint32_t>(val);

	constexpr int TWEAK = 0; //see adjustments below

	tmp = ((1<<29)-(1<<22)+TWEAK) + (tmp>>1);

	return std::bit_cast<float>(tmp);
}

x86	ARM
`sqrtf_bithack(float): vmovd eax, xmm0 shr eax add eax, 532676608 vmovd xmm0, eax ret`	`sqrtf_bithack(float): fmov w0, s0 mov w1, 532676608 add w0, w1, w0, lsr 1 fmov s0, w0 ret`

Operationally, the code reinterprets the bits of the input float as a 32-bit unsigned integer uint32_t. Then it shifts right and adds a specially crafted constant. Then it reinterprets it back into a float. The reinterpretations are with C++20's std::bit_cast<⋯>(⋯), which finally allow us to reinterpret bits in a fast, standard-compliant, and constant-expression way.

I first saw this method attested on Wikipedia, but the earliest similar thing I found was in the same unpublished work that led to the very widely known inverse-square-root bithack. This one appears to be a simplification of a more elaborate method for double-precision^[1]. Googling for it is nearly impossible, given the many explanations for inverse-square-root and, apart from the rather sparse Wikipedia presentation, there don't seem to be any anyway. That's a shame, because it's a very interesting little snippet that epitomizes floating-point cleverness.

In this article, we'll go through exactly how it works, from two perspectives, in detail.

Reminder About Floating-Point

Diagram of bits in a 32-bit float: 1 bit sign, 8 bits biased exponent, and 23 bits of mantissa. — Figure 0
: Example of bits in a `float` (altered from image source).

I'm going to assume you are generally familiar with how floating-point works at a low level. You should be, if you're a programmer. If not, Wikipedia's explanation of float is pretty good, and I made a webtool to play around with their bits, here^[2].

However, I will also give you Figure 0 to refer to, as well as quickly remind you that for a positive, normal (i.e. not subnormal, NaN, or infinite) float, the value stored is:

\[ (-1)^{\{0\}} \times \left(2^{\{e_7 e_6\cdots e_0\}-127}\right) \times \{1.m_{22}m_{21}\cdots m_0\} \]

where braces denote construction from bit concatenation. The biased-exponent bits are \(\{e_7 e_6\cdots e_0\}\) and the coefficient^[3]/significand bits including the implicit leading \(1\) are \(\{1.m_{22}m_{21}\cdots m_0\}\).

Floats and Logs

The best way to understand the algorithm is as operating on the logarithm. Before we delve into that, though, we need to talk about why we have logarithms available^[4]. Roughly speaking, it comes down to the exponent field. Specifically, the exponent is roughly the base-2 logarithm of the stored number, and it is exactly equal for powers-of-2.

For example, suppose we have a float storing the value \(1234\). The exponent is \(10\) (though \(10+127=137\) is actually stored). And indeed, \(\log_2 1234 \approx 10.27\), which is pretty close. When you store a power-of-2, it is exact. Take \(4096\) (i.e., \(2^{12}\)). The exponent is, unsurprisingly, \(12\) (though \(12+127=139\) is actually stored) and \(\log_2 4096 = 12\), exactly.

Graph showing that the mantissa bits of a floating point number are very similar to a true logarithm. — Figure 1
: Sticking a radix point between the (unbiased) exponent and the mantissa bits makes a surprisingly good approximation to \(\log_2(x)\).

More than this, between powers-of-2, the stored value increases linearly. As the significand ticks up from all-zeros to all-ones, we essentially travel from one power-of-2 to the next. The magnitude of the smallest bit isn't changing; it only changes when the exponent that scales it changes, which is only at powers-of-2.

Given a positive, normal float storing value \(x\), whose bits have been reinterpreted as an integer \(x_\text{int}\), we can therefore write the logarithm mathematically as:

\begin{equation} \log_2(x) \approx \frac{x_\text{int}}{2^{23}} - 127 \end{equation}

The division by \(2^{23}\) basically just puts a radix point between the biased-exponent and significand fields (you can think of it as bitshifting right until the exponent is at the \(1\)s place). Then, subtracting \(127\) converts the exponent into the true, unbiased form. The result is a fractional exponent, a logarithm. (See Figure 1.)

Mathematical Approach

Finally, we are ready.

From a logarithmic perspective, the gist of the algorithm is just to divide the logarithm by \(2\), which has the effect of taking the square-root. Since, as above, floating-point numbers are already close to being logarithmic representations, it's not surprising this is operationally simple. Mathematically:

\begin{equation} \sqrt{x} = \exp_2\!\left(\log_2\sqrt{x}\right) = \exp_2\!\left(\frac{1}{2}\log_2(x)\right) \end{equation}

From the previous section, we can write this in terms of the bits reinterpreted as an integer \(x_\text{int}\). We substitute equation 1 into equation 2 once to replace the logarithm, and (inversely) again to replace the exponentiation:

\begin{align*} \sqrt{x} &\approx \left( \frac{1}{2} \left( \frac{x_\text{int}}{2^{23}} - 127 \right) + 127 \right) 2^{23}\\ &= \left( \frac{x_\text{int}}{2} + \frac{2^{23}127}{2} \right) ~=~ \left( \frac{2^{23}128}{2} - \frac{2^{23}}{2} + \frac{x_\text{int}}{2} \right)\\ &= 2^{29} - 2^{22} + \left(x_\text{int}/2\right) \end{align*}

And this is indeed the line of code that we have:

tmp = ((1<<29)-(1<<22)+TWEAK) + (tmp>>1);

The only omission is the TWEAK constant. Intuitively, this is just zero, but it can empirically be made slightly nonzero^[5] to give the algorithm different properties. Here are some interesting values:

`TWEAK` Value	Note
0	Gives exact results for powers-of-2, directly matches analysis above.
-185516	Minimizes average relative error.
-307410	Minimizes maximum relative error.

Low-Level Approach

It is also possible to analyze the algorithm from a low-level perspective, combining mathematics and bit operations^[6]. This is more tedious, but offers an alternate way to see the algorithm.

First, we have to separate out the one code line:

tmp = ((1<<29)-(1<<22)+TWEAK) + (tmp>>1);

into four:

//Performance hit, but easier to understand

tmp -= 1 << 23; //Note 23 instead of 22

tmp >>= 1;
tmp += 1 << 29;
tmp += TWEAK;

The change from 22 to 23 deserves to be called out specifically. This isn't a typo. This is because the line below it immediately shifts everything right again. In the original code, where we add constants after the shift, that doesn't happen.

Bit 23 is the lowest bit of the exponent field, which for normal floats is at least 1. Thus, for normal numbers, the first subtraction doesn't cause a wraparound. However, a compiler cannot assume that^[7], and so this compiles to a worse result. This expanded form should therefore only be used for analyzing the algorithm in this way.

Given the floating-point representation of a positive, normal float (see above), we can just whack a square-root symbol onto it and see what pops out:

\begin{align*} \sqrt{x} &= \sqrt{ (-1)^{\{0\}} \times \left(2^{\{e_7 e_6\cdots e_0\}-127}\right) \times \{1.m_{22}m_{21}\cdots m_0\} }\\ &= \left(2^{(\{e_7 e_6\cdots e_0\}-127)/2 }\right) \sqrt{ \{1.m_{22}m_{21}\cdots m_0\} }\\ \end{align*}

From here, we rewrite the exponent a bit and break it down by cases. In the latter case, we also factor a \(\sqrt{2}\) from the first term into the second^[8].

\begin{align*} \sqrt{x} &= \left(2^{(\{e_7 e_6\cdots e_0\}+127)/2-127}\right) \sqrt{ \{1.m_{22}m_{21}\cdots m_0\} }\\ &= \begin{cases} \left(2^{(\{e_7 e_6\cdots e_1 1\}+127)/2-127}\right) \sqrt{ \hspace{1.2em} \{1.m_{22}m_{21}\cdots m_0\} } & \text{ if } \{e_0\}=\{1\}\\ \left(2^{(\{e_7 e_6\cdots e_1 0\}+126)/2-127}\right) \sqrt{ 2 \cdot \{1.m_{22}m_{21}\cdots m_0\} } & \text{ if } \{e_0\}=\{0\} \end{cases} \end{align*}

We now analyze each case separately.

Lowest-Order Exponent Bit is 1

In the (easier) case where the lowest-order exponent bit \(e_0 = 1\), each line of code in the main part of the algorithm works as follows:

Code	Stored Value
	\( \left(2^{ \{e_7 e_6 \cdots e_1 1 \} - 127 }\right) \times \{1.m_{22}m_{21} \cdots m_0\} \)
`tmp -= 1 << 23;`	\( \left(2^{ \{e_7 e_6 \cdots e_1 0 \} - 127 }\right) \times \{1.m_{22}m_{21} \cdots m_0\} \)
`tmp >>= 1;`	\( \left(2^{ \{e_7 e_6 \cdots e_1 0 \}/2 - 127 }\right) \times \{1.0 m_{22}m_{21} \cdots m_1\} \)
`tmp += 1 << 29;`	\( \left(2^{ \{e_7 e_6 \cdots e_1 0 \}/2 + 64 - 127 }\right) \times \{1.0 m_{22}m_{21} \cdots m_1\} \)

For the exponent field, we only have to rearrange it a bit, and then we can see it is exactly the same as the desired exponent above:

\begin{align*} \{e_7 e_6 \cdots e_1 0 \}/2 + 64 - 127 &= (\{e_7 e_6 \cdots e_1 0 \}+128)/2 - 127\\ &= (\{e_7 e_6 \cdots e_1 1 \}+127)/2 - 127 \end{align*}

We also examine the significand:

\begin{align*} \{1.0 m_{22}m_{21}\cdots m_1\} &= 1 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 0 \} - 1)\\[5pt] &\approx 1 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 m_0\} - 1)\\[5pt] &\approx \sqrt{ \{1.m_{22}m_{21}\cdots m_0\} } \end{align*}

And this is the desired significand! Note that in the last step we used the Taylor expansion of \(\sqrt{x}\) at \(x=1\):

\[ \sqrt{x} = \sum_{k=0}^{\infty} (x-1)^k \binom{1/2}{k} = 1 + \frac{1}{2}(x-1) - O\left( (x-1)^2 \right) \]

That is, the computed significand is (and neglecting the last bit which gets shifted to oblivion) a first-order Taylor approximation to the desired significand. This checks with our result above that the square-root can be computed via logarithm, which is linear between powers-of-2.

Lowest-Order Exponent Bit is 0

The case where the lowest-order exponent bit \(e_0 = 0\) is similar. The main difference is that the subtraction changes \(e_0\) into a \(1\) instead. The shift then eats it, so we have to be careful to account for the loss of \(1/2\). The \(1\) is shifted into the significand.

Code	Stored Value
	\( \left(2^{ \{e_7 e_6 \cdots e_1 0 \} - 127 }\right) \times \{1.m_{22}m_{21} \cdots m_0\} \)
`tmp -= 1 << 23;`	\( \left(2^{ \{e_7 e_6 \cdots e_1 0 \} - 1 - 127 }\right) \times \{1.m_{22}m_{21} \cdots m_0\} \)
`tmp >>= 1;`	\( \left(2^{ (\{e_7 e_6 \cdots e_1 0 \}-1)/2 - (1/2) - 127 }\right) \times \{1.1 m_{22}m_{21} \cdots m_1\} \)
`tmp += 1 << 29;`	\( \left(2^{ (\{e_7 e_6 \cdots e_1 0 \}-1)/2 + 63.5 - 127 }\right) \times \{1.1 m_{22}m_{21} \cdots m_1\} \)

We simplify the exponent field, again getting exactly the desired exponent above:

\begin{align*} (\{e_7 e_6 \cdots e_1 0 \}-1)/2 + 63.5 - 127 &= (\{e_7 e_6 \cdots e_1 0 \}+126)/2 - 127 \end{align*}

We again examine the significand:

\begin{align*} \{1.1 m_{22}m_{21}\cdots m_1\} &= 1.5 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 0\} - 1)\\[5pt] &\approx \sqrt{2} \bigl(~ 1 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 m_0\} - 1) ~\bigr)\\[5pt] &\approx \sqrt{ 2 \cdot \{1.m_{22}m_{21}\cdots m_1 m_0\} } \end{align*}

Which is again what was desired. The Taylor series (for \(\sqrt{2 x}\)) doesn't work out as-nicely here, but it should be apparent that the significand is at least sortof right.

Notes

[1]	Who knows where the simplification came from. The algorithm was first added to Wikipedia by an anonymous user on 2006-03-31, in the simplified form—though extremely incorrect. It was fixed on 2006-12-02.
[2]	Of course, for more formality, one simply must mention the famous What Every Computer Scientist Should Know About Floating-Point Arithmetic (from edited original).
[3]	See also In Defense of 'Mantissa'.
[4]	This is my treatment, but it has been explained in other places. See e.g. here.
[5]	In a proportional sense. Some of these look pretty big, but remember, the significand is in the range \([8388608,16777215]\).
[6]	This was my first, and formerly only, analysis. The Wikipedia article was unclear, so I resorted to the line-by-line method, thinking, erroneously, that it would be easier.
[7]	Nor does hinting that it can't to the compiler by using `int32_t` (i.e. signed) seem to work. Signed overflow is classic UB, which the compiler can assume doesn't happen for the purposes of optimization.
[8]	Yes, these manipulations are pretty arbitrary-looking. They were pretty obnoxious to find. We'll soon see how they are useful, though.