Ingenious Pessimizations: Bit-Scan
Everyone loves fast code. And, there's a certain kind of programmer (hello) that likes to write it, too. We love shaving a cycle here and there on a pixel-level operation, seeing it multiply out into milliseconds, seconds, hours faster performance. We come up with asymptotically superior algorithms, clever bithacks, and everything in between.
There's a pitfall, though—writing fast code that is actually slower.
If you write HPC a lot, this is doubtless familiar. It's dishearteningly easy to write code that looks faster, but isn't for some technical reason. It's why you always check that the result is faster. And sometimes, you screw up because you don't understand your problem space very well.
When you notice that your optimized code is actually pessimizing and just needs to be thrown out entirely, it can be really frustrating! Perhaps, so that the effort isn't totally wasted, you might even want to at least write a blog post explaining it as a case-study . . .
Counting Zeros
A common operation is to count the number of leading or tailing 0-bits in a number: an operation called 'CLZ' and 'CTZ' (for 'count leading zeros' / 'count tailing/trailing zeros'). (Counting 1-bits is less common, but can be easily done by just twiddling (bit-complementing) the argument before then counting zeros.)
C++ (since C++20 or __cpp_lib_bitops
) offers us the convenient std::countr_zero<⋯>(⋯)
and std::countl_zero<⋯>(⋯)
functions. There are also versions for counting 1-bits, and all of it is even constexpr
! With these, we can write some tidy little wrappers[1]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <cstdint>
#include <bit>
template<class T> [[nodiscard]] constexpr
T clz( T val ) noexcept requires(
std::is_same_v<T,uint8_t > || std::is_same_v<T,uint16_t> ||
std::is_same_v<T,uint32_t> || std::is_same_v<T,uint64_t>
) {
return (T)std::countl_zero(val);
}
template<class T> [[nodiscard]] constexpr
T ctz( T val ) noexcept requires(
std::is_same_v<T,uint8_t > || std::is_same_v<T,uint16_t> ||
std::is_same_v<T,uint32_t> || std::is_same_v<T,uint64_t>
) {
return (T)std::countr_zero(val);
}
However, you look at the disassembly (e.g. for uint32_t
, "clang -O3") and are disappointed:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
clz<uint32_t>(uint32_t):
test edi, edi
je .LBB0_2
bsr eax, edi
xor eax, 31
ret
.LBB0_2:
mov eax, 32
ret
ctz<uint32_t>(uint32_t):
test edi, edi
je .LBB1_2
rep bsf eax, edi
ret
.LBB1_2:
mov eax, 32
ret
That's a lot of instructions!
Unpacking how these work, since the i386[2], the x86 architecture has offered the bsf
and bsr
('bit-scan forward/reverse' ('BSF'/'BSR')) instructions, which give the index of the first set bit in a 16- or 32-bit operand. If the operand is zero, there is no first set bit, and the result is undefined (but the zero flag 'ZF' is at least set).
The shortsightedness here is that by just defining the output for the zero case to be the bit width, Intel could have upgraded the semantics for BSF into CTZ. CTZ is much more common and useful, while also being completely compatible with the old semantics of getting the bit index. (Similarly, BSR can be upgraded to CLZ, although there is a semantic difference, discussed below.) Alas, Intel chose not to define the zero case, and so the compiler has had to insert a test[3] for zero.
Doing Better (i.e. Worse)
If you know that the input isn't zero, this test is wasteful! Also, "everyone knows" that branching is slow. Instead, we can execute the bit-scan instruction unconditionally (it's very fast[4]). This will set 'ZF' if the result is zero. We can then use that to predicate a conditional-move (cmovz
) to move the right answer in for that case.
This is clearest for the CTZ case, using GCC-like inline-assembly[5] (and dropping platform-specific variation, delegation for non-supported bit-widths, and some of the declaration formalism):
1
2
3
4
5
6
7
8
9
10
11
12
template<class T>
T my_ctz( T val )
{
constexpr T defl = 8 * sizeof(T);
T ret = val;
asm(
"bsf %0, %0\n"
"cmovz %0, %1\n"
: "+r"(ret) : "r"(defl) : "cc"
);
return ret;
}
The case for CLZ is similar, but we have to be little careful. BSR returns the bit index, counted from the right, whereas we want the number of zeros on the left. For example, if we have a 32-bit number and BSR returns 28, then the highest set bit is bit 28, and so bits 29, 30, and 31 are zeros, and so CLZ must be 3. In general, CLZ = (8*sizeof(T)-1) - BSR
.
We can implement that subtraction with the appropriate instruction (sub
). Note that sub
also sets ZF, so we need to execute it after the conditional-move:
1
2
3
4
5
6
7
8
9
10
11
12
template<class T>
T my_clz( T val )
{
T tmp=val, ret=8*sizeof(T)-1;
asm(
"bsr %1, %1\n"
"cmovz %1, %2\n"
"sub %0, %1\n"
: "+r"(ret), "+r"(tmp) : "rm"(-1) : "cc"
);
return ret;
}
The registers are a bit awkward, but notice how we have very cleverly re-used at least some of them. Unfortunately, cmovz
can't take an immediate for its second operand. By the form of the sub
, we can't use an intermediate there either, even though we know the value at compile-time.
However, now we have a chance to show off our talent! The subtraction is fast, but there are faster instructions. In particular, everyone knows that XOR (xor
) is faster. While XOR might not seem relevant, in this case we're talking about a power-of-two-minus-one, so we can use the following identity:
Notice that, because the XOR is commutative, we can reverse the operands—allowing us to not only use the faster instruction, but to use an immediate as one of the operands as well! The code becomes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
template<class T>
T my_clz( T val )
{
constexpr T nm1 = 8*sizeof(T) - 1;
constexpr T defl = (8*sizeof(T)) | (8*sizeof(T)-1);
T ret = val;
asm(
"bsr %0, %0\n"
"cmovz %0, %1\n"
"xor %0, %2\n"
: "+r"(ret) : "r"(defl), "imr"(nm1) : "cc"
);
return ret;
}
Now we can be properly pleased with ourselves. If bsr
is 3 cycles, and cmov
and xor
are each 1, then this is 5 cycles, which is shorter than the branchy solution given by the compiler—even on the happy path where it doesn't suffer an incredibly costly branch misprediction. The situation for CTZ is similar.
Sketchy cmov
s
The first (and less-significant) issue with the above "optimizations" is that we use a conditional-move instruction. Conditional-move instructions indeed do remove the branch misprediction penalty, but that isn't necessarily a win. Conditional moves are an additional dependency for both streams of data, as well as for the flags register. Whereas, if the branch predictor can predict a true branch, then it really predicts it and the jump is free—free, as in actually zero cycles (it does consume branch-prediction resources, though).
A lot of people will say a lot of things about conditional-moves, but I'd consider Agner Fog the most authoritative source, even above a typical Intel engineer. He writes[6]:
As a rule of thumb, we can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation [in one branch].
Without invoking an entire application / use-case, it's difficult to figure out a-priori what that tradeoff actually is. It seems likely that for many bit-scan applications, the zeros will be predictable. If nothing else, there are a lot more nonzero values than zero values—and when there are zero values, there are usually a lot of zero values—in either which case, the branch wins.
On the other hand, the dependency chains here are very short (two registers and the flags register, all of which we need regardless), and branch prediction really hurts—and will only get worse on newer processors. The 'best' solution would have the caller tell us (by template argument) whether the argument can be zero and, if so, how predictable it will be, although this is a little unwieldy.
This is the point where we should defer to a benchmark. Linus Torvalds offers a test of jump vs. conditional-move (though not in the context of bit-scanning), which we modernize here:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/*
Mayhap overly simplistic jump vs. cmov test, modernized from:
https://yarchive.net/comp/linux/cmov.html
Compile and test jump:
gcc -masm=intel -O3 main.cpp && time ./a.out
Compile and test cmov:
gcc -masm=intel -O3 -DCMOV main.cpp && time ./a.out
*/
#include <cstdio>
inline constexpr unsigned NUM_ITERS = 100'000'000;
inline constexpr unsigned long a=5, b=7;
int main( int /*argc*/, char* /*argv*/[] )
{
unsigned long sum = 0;
for ( unsigned k=0; k<NUM_ITERS; ++k )
{
//Basically, `choice = (k&TEST_BITMASK) ? a : b;`
unsigned long choice;
#define TEST_BITMASK 1
#ifdef CMOV
asm(
"test %1, %2\n"
"cmovne %0, %3\n"
: "=r"(choice) : "r"(k), "i"(TEST_BITMASK), "rm"(a), "0"(b) : "cc"
);
#else
asm(
"test %1, %2\n"
"je .ZERO\n"
"mov %0, %3\n"
".ZERO:\n"
: "=r"(choice) : "r"(k), "i"(TEST_BITMASK), "g"(a), "0"(b) : "cc"
);
#endif
sum += choice;
}
printf( "%lu\n", sum );
return 0;
}
On a modern computer, we can indeed reproduce the result that jump seems faster. Torvalds (mis)interprets this as supporting a sweeping generalization that conditional-moves are bad, but it really only proves the point that branches can become free.
The test basically branches on the lowest bit of the loop index, meaning that after a few iterations the loop is 100% predictable[7]. In that case, the jump goes away, and we're left with a loop that doesn't have a mov
half the time. This is obviously faster than a loop that has a mov
all the time. (We can actually make the results about the same by putting a nop
in the second loop after the mov
, so that half the loop iterations do 2 instead of 1 instruction, meaning both loops execute about the same number of instructions.)
Still, even if it's less that conditional-moves are bad and more that branches can be good, it's a real speedup. That effect needs to be considered when choosing our implementation.
Speaking of, let's write a similar test, but for bit-scanning:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/*
Mayhap overly simplistic jump vs. cmov test, for bit-scanning.
Compile and test jump:
gcc -masm=intel -O3 main.cpp && time ./a.out
Compile and test cmov:
gcc -masm=intel -O3 -DCMOV main.cpp && time ./a.out
*/
#include <cstdint>
#include <cstdio>
inline constexpr unsigned NUM_ITERS = 1'000'000'000;
__attribute__(( naked ))
uint32_t ctz( uint32_t val )
{
#ifdef CMOV
asm(
"bsf eax, edi\n"
"cmovz eax, %0\n"
"ret\n"
: : "rm"(32) : "eax", "edi", "cc"
);
#else
asm(
"test edi, edi\n"
"je .ZERO\n"
"bsf eax, edi\n"
"ret\n"
".ZERO:\n"
"mov eax, 32\n"
"ret\n"
: : : "eax", "edi", "cc"
);
#endif
}
int main( int /*argc*/, char* /*argv*/[] )
{
uint32_t sum = 0;
for ( unsigned k=0; k<NUM_ITERS; ++k ) sum+=ctz(k);
printf( "%u\n", sum );
return 0;
}
We have to be a bit careful with the test in the jump case. If we write it in the source language, the compiler will optimize the first loop iteration out. However, in assembly, the question arises of where to put the mov
. So that we don't unnecessarily execute it (GCC does that; it's probably okay), it needs to be in a branch, and that code needs to be somewhere. We could use two branches, but that would be unrepresentative, even though predicted branches are free. The solution I've chosen is to do everything in a separate function, so that we can ret
multiple times.
In any case, when we run the code, we find that the jump is indeed (a little) bit faster. If we #include <cstdlib>
and then change it to sum+=ctz(rand());
, then any difference is too small to reliably measure against the overhead of calling rand()
.
What it all boils down to is that the conditional-move was something of a very clever pessimization! The jump is better for predictable data, and for unpredictable data, it just doesn't matter enough.
(Actually) Doing Better
There's an even more serious problem with the "optimization", though. If we specify e.g. "clang -O3 -march=native"[8] for the original standard library versions, we get something remarkable:
clz<uint32_t>(uint32_t):
lzcnt eax, edi
ret
ctz<uint32_t>(uint32_t):
tzcnt eax, edi
ret
It turns out that our "optimization" above was also an example of the second way things can go wrong—not understanding the context. See, Intel and AMD[9], realized that the bsf
/bsr
instructions are deficient, and introduced new instructions to deal with this (and other issues). Specifically, the situation nowadays[10] is that modern x86 processors (newer than Intel Haswell or AMD Excavator) support lzcnt
/tzcnt
, which implement CLZ/CTZ, so we should only be using bsf
/bsr
as a fallback[11].
Even if we wanted to support older architectures, we can still go a long way by hinting to the compiler about the zero case by using assume(⋯)
, which we can define (for release mode[12]) as follows:
#if defined __clang__
#define assume(EXPR) __builtin_assume(EXPR)
#elif defined __GNUC__
#define assume(EXPR) ( (EXPR) ? static_cast<void>(0) : ::std::unreachable() )
#else //Intel or MSVC?
#define assume(EXPR) __assume(EXPR)
#endif
For example, if we know a-priori that the input will be nonzero (the programmer could signal this guarantee through e.g. a boolean template argument), then we can use assume( val > 0 )
. In our original C++ wrappers for std::countr_zero<⋯>(⋯)
/std::countl_zero<⋯>(⋯)
, we suddenly get a much tighter implementation:
clz<uint32_t>(uint32_t):
bsr eax, edi
xor eax, 31
ret
ctz<uint32_t>(uint32_t):
rep bsf eax, edi
ret
Hey, look at that! The compiler gave us the same XOR optimization we thought ourselves so clever for discovering. Actually, if we had payed attention to the original compiler output, we'd have noticed the compiler did it there, too.
Final Thoughts
What initially seemed like a clear win for cleverness actually turned out to be a pessimization. One reason is that there were hardware technicalities that we didn't properly understand. The other is that we failed to appreciate the context, that modern hardware provides better instructions anyway.
As in so many other cases, the Right Thing here is to roll everything back and just use the C++ standard library call, possibly wrapped up, and with an assume(⋯)
if we know the input is nonzero. This reproduces the best aspects of our optimization, when it's appropriate to do so—but with far less code and in a cross-platform way.
This result is common when optimizing code, and it underscores the importance of testing that the result really is faster afterward, as well as the value of thinking through some of the implications before investing a lot of time in creating something.
It's certainly possible to outdo the standard library implementations, and even easier to outdo the compiler[13]. Moreover, micro-optimization matters (whether by multiplying small gains out, or avoiding death from a thousand cuts). But many things worth doing are difficult, and code optimization is certainly one of them.