

Fast Square RootsA lot has been written about computing inverse square roots \(1/\sqrt{x}\), particularly a wellknown bithack. However, much less has been written about the ordinary square root \(\sqrt{x}\). Fortunately, I've done the work for you. Here are most of the variants I could find, and the reasonable versions I came up with, along with timing and accuracy statistics, broken down by the type of `float` argument passed in:
Error EvaluationThe error evaluation is fairly straightforward. The standard square root is considered to be the "ground truth". For relative error, I adopt the convention that infinite or NaN absolute error means infinite relative error (even if the division would produce an indeterminate form). Similarly, zero absolute error means zero relative error. Performance EvaluationThe performance evaluation was easier to code but requires more attention to detail. First, the square root functions were not actually marked `inline`. I infact marked them as explicitly noinline. I subtractedoff the measured cost of an identity function from all the timing results to get a semblance of the actual work done by the square root. I used the C++ standard highperformance timers (i.e. `std::chrono::high_resolution_clock::now()`) to get the actual timing numbers. The test system was an Intel i75930K (3.5 GHz), and the code was compiled with MSVC 15.8.1 in Release x86 mode. Pipelining seems to make a big difference; in some cases, the latency is significantly below the rated costs of the actual instructions. As with any profiling of lowlevel code, don't take the performance numbers too literally, nor blithely compare them percentagewise; instead view them as defining a rough ordering of relative efficiency. The standard deviation is probably on the order of at least 50 ps. General NotesThe error statistics collection and the profiling were done with separate code (we don't want to profile the cost of computing the statistics). Every nonNaN, nonnegative `float` was tested in both the error evaluation and profiling, simply ordered by increasing value. I didn't profile or test negative numbers or NaNs, because these are outside the domain of the realvalued branch of the square root, or invalid, respectively—and therefore nonsensical. A lot of the overhead from `std::sqrt(...)` comes from checking for these cases, but the programmer will usually be guaranteeing such inputs will not be generated anyway (and even if they were, it would typically indicate a failure condition anyway). Therefore, considering them seems pointless. In my opinion, checks for them should be added only if there's a reason to add them (and adding them regardless is a failure on the part of the standard library in my opinion). On a related note: I haven't been able to find documentation for what the square root intrinsic does when passed a bogus value. This is sortof a moot point, because again you should not be doing that. The `memcpy(...)` thing in the bithack variants is the recommended way to work around the strictaliasing problem of `float`s vs. `uint32_t`s. Inparticular, the usual pointerreinterpreting is dangerously broken in both C and C++, and the `union` hacks are problematic in C++. Hat tip to (the somewhat dated) Some Assembly Required for some good ideas. What's Up With the Bithack Variant?This comes from Wikipedia, and I haven't been able to find it attested anywhere else (I tried searching the revision history, but gave up after a while) (though Googling for it is nearly impossible). Unfortunately, the given explanation, while detailed, doesn't really explain what's going on, and is confusing besides. Instead, I worked backward from the algorithm to figure out what it does. It took a while—in no small part because understanding it requires extreme attention to detail, as the operations are highly nontrivial. First, let's break down the key operation in the middle (note one of the shifts changed from `22` to `23`; this is because it happens first): tmp = 1 << 23; tmp >>= 1; tmp += 1 << 29; tmp = 0x0002D4ACu; To understand this, as you might expect, you need to understand the bit encoding of `float` (picture courtesy Wikipedia for reference, but you should read their article if you're not familiar with the IEEE 754 `float` encoding, since I won't explain it here): For nonnegative, normalized, noninfinite, nonNaN `float`s, the signbit is zero, the exponent byte is between 1 and 254 inclusive (and represents a value 127 less than is stored), and the mantissa has an implicit leading one: \[ (1)^{\{0\}} \times \left(2^{\{e_7 e_6\cdots e_0\}127}\right) \times \{1.m_{22}m_{21}\cdots m_0\} \]The correct answer for the square root is obviously just the square root of this (the last step is done because fractional exponents cannot be represented): \begin{align} & \sqrt{ (1)^{\{0\}} \times \left(2^{\{e_7 e_6\cdots e_0\}127}\right) \times \{1.m_{22}m_{21}\cdots m_0\} }\\ &= \sqrt{ \left(2^{\{e_7 e_6\cdots e_0\}127}\right) } \sqrt{ \{1.m_{22}m_{21}\cdots m_0\} }\\ &= \left(2^{(\{e_7 e_6\cdots e_0\}127)/2 }\right) \sqrt{ \{1.m_{22}m_{21}\cdots m_0\} }\\ &= \left(2^{(\{e_7 e_6\cdots e_0\}+127)/2127}\right) \sqrt{ \{1.m_{22}m_{21}\cdots m_0\} }\\ &= \begin{cases} \left(2^{(\{e_7 e_6\cdots e_0\}+127)/2127}\right) \sqrt{ ~~\{1.m_{22}m_{21}\cdots m_0\} } & \text{ if } \{e_0\}=\{1\}\\ \left(2^{(\{e_7 e_6\cdots e_0\}+126)/2127}\right) \sqrt{ 2 \{1.m_{22}m_{21}\cdots m_0\} } & \text{ if } \{e_0\}=\{0\} \end{cases} \end{align}Let's consider the operations done by the code inorder, and see how they correspond to computing an approximation of this ideal. Starting from our original expression: \[ (1)^{\{0\}} \times \left(2^{\{e_7 e_6\cdots e_0\}127}\right) \times \{1.m_{22}m_{21}\cdots m_0\} \]We subtract `1<<23` (this subtracts off the exponent field; notice that since the exponent field is at least one, this cannot do any nasty underflows or anything): \[ \left(2^{(\{e_7 e_6\cdots e_0\}1)127}\right) \times \{1.m_{22}m_{21}\cdots m_0\} \]Then we shift right (remember that for unsigned shifts this is roughly equivalent to dividing by two, and that since the signbit is zero, a zero is shifted into the exponent field) and add `1<<29` (i.e., 64 into the exponent). To apprehend what these two operations do incombination with the previous, we have to break it down casebycase. LowOrder Exponent Bit is OneIn the (easier) case where the lowerorder bit was one, the above subtraction turned it into a zero (and didn't change the other exponent bits). The zero is shifted into the mantissa and the original upper seven bits are left to form the exponent. Finally, we add in 64 to the exponent. The sequence of steps looks like this: \begin{alignat*}{5} & \left(2^{ \{e_7 e_6 \cdots e_1 1 \} 127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}~~~~~~~~&&\text{ if } \{e_0\}=\{1\}\\ & \left(2^{(\{e_7 e_6 \cdots e_1 1 \}1)127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}~~~~~~~~&&\text{ (subtract)}\\ & \left(2^{ \{e_7 e_6 \cdots e_1 0 \} 127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}\\ & \left(2^{ \{0 e_7 e_6\cdots e_2 e_1\} 127}\right) && \times \{1.0 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~&&\text{ (shift)}\\ & \left(2^{ \{0 e_7 e_6\cdots e_2 e_1\}+64127}\right) && \times \{1.0 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~&&\text{ (add)} \end{alignat*}When we rearrange this to represent doing the whole thing in one go, we find that the exponent exactly matches the desired one from above! \[ \left(2^{(\{e_7 e_6\cdots e_0\}+127)/2  127}\right) \times \{1.0 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~\text{ if } \{e_0\}=\{1\} \]We now examine the mantissa: \begin{align} \{1.0 m_{22}m_{21}\cdots m_2 m_1\} &= 1 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 0 \}  1)\\[5pt] & \approx 1 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 m_0\}  1)\\[5pt] & \approx \sqrt{ \{1.m_{22}m_{21}\cdots m_1 m_0\} } \end{align}And this is the desired mantissa! Note that in the last step we used the Taylor expansion of \(\sqrt{x}\) at \(x=1\):\[ \sqrt{x} = \sum_{k=0}^{\infty} (x1)^k \binom{1/2}{k} = 1 + \frac{1}{2}(x1)  O\left( (x1)^2 \right) \] That is, the computed mantissa is (and neglecting the last bit which gets shifted to oblivion) a firstorder Taylor approximation to the desired mantissa. LowOrder Exponent Bit is ZeroIn the case where the lowerorder bit was zero, the above subtraction turned it into a one. Viewed as a division, the shift rounds down (i.e., it subtracts an additional one (so, minus two)) from the exponent before it is divided by two. Whatever the result of the subtraction was for the upper seven bits remain to form the exponent. The one is shifted into the mantissa, and then asbefore we add 64 to the exponent. \begin{alignat*}{5} & \left(2^{ \{e_7 e_6 \cdots e_1 0 \}  127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}~~~~~~~~&&\text{ if } \{e_0\}=\{0\}\\ & \left(2^{(\{e_7 e_6 \cdots e_1 0 \}1)  127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}~~~~~~~~&&\text{ (subtract)}\\ & \left(2^{ \{e_7' e_6' \cdots e_1' 1 \}  127}\right) && \times \{1.m_{22}m_{21} \cdots m_1 m_0\}\\ & \left(2^{ \{0 e_7' e_6'\cdots e_2' e_1'\}  127}\right) && \times \{1.1 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~&&\text{ (shift)}\\ = & \left(2^{(\{e_7 e_6 \cdots e_1 0 \}2)/2  127}\right) && \times \{1.1 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~&&\text{ (same, in terms of original bits)}\\ & \left(2^{(\{e_7 e_6 \cdots e_2 0 \}2)/2 + 64  127}\right) && \times \{1.1 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~&&\text{ (add)} \end{alignat*}Again, we write the whole thing in one go and find that the exponent matches the desired result from above: \[ \left(2^{(\{e_7 e_6\cdots e_1 e_0\} + 126)/2  127}\right) \times \{1.1 m_{22}m_{21}\cdots m_2 m_1\}~~~~~~~~\text{ if } \{e_0\}=\{0\}\\ \]We again examine the mantissa: \begin{align} \{1.1 m_{22}m_{21}\cdots m_2 m_1\} &= 1.5 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 0 \}  1)\\[5pt] & \approx 1.5 + (1/2)(\{1.m_{22}m_{21}\cdots m_1 m_0\}  1)\\[5pt] & \approx (1.5811\cdots) + (0.63245\cdots)\left(\{1.m_{22}m_{21}\cdots m_1 m_0\}  (1.25)\right)\\[5pt] & \approx \sqrt{\frac{5}{2}} + \sqrt{\frac{2}{5}}(\{1.m_{22}m_{21}\cdots m_1 m_0\}  \frac{5}{4})\\[5pt] & \approx \sqrt{ 2 \{1.m_{22}m_{21}\cdots m_1 m_0\} } \end{align}The Taylor series (for \(\sqrt{2 x}\) at \(x=5/4\)) doesn't work out asnicely here, but it should be apparent that the mantissa is at least sortof right. Tidying UpThe last thing that's done is (optionally) add a constant to our value. This simply reduces the average error somewhat by shifting the approximation down slightly. The original number on Wikipedia, `0x0004B0D2u`, is supposed to minimize the maximum relative error (and I have confirmed that it does: the maximum relative error for the normals is 3.47475% (and average is 1.65573%)). It was probably chosen empirically. However, I think that minimizing the average relative error is moredesirable. The number I use, `0x0002D4ACu`, was found by my code to do this. It increases the maximum relative error (to 4.50224%), but reduces the average error (to 1.50473%). ConclusionAlthough I've said a lot, this really boils down to three things:


