Ingenious Pessimizations: Bit-Scan

Everyone loves fast code. And, there's a certain kind of programmer (hello) that likes to write it, too. We love shaving a cycle here and there on a pixel-level operation, seeing it multiply out into milliseconds, seconds, hours faster performance. We come up with asymptotically superior algorithms, clever bithacks, and everything in between.

There's a pitfall, though—writing fast code that is actually slower.

If you write HPC a lot, this is doubtless familiar. It's dishearteningly easy to write code that looks faster, but isn't for some technical reason. It's why you always check that the result is faster. And sometimes, you screw up because you don't understand your problem space very well.

When you notice that your optimized code is actually pessimizing and just needs to be thrown out entirely, it can be really frustrating! Perhaps, so that the effort isn't totally wasted, you might even want to at least write a blog post explaining it as a case-study . . .

Counting Zeros

A common operation is to count the number of leading or tailing 0-bits in a number: an operation called 'CLZ' and 'CTZ' (for 'count leading zeros' / 'count tailing/trailing zeros'). (Counting 1-bits is less common, but can be easily done by just twiddling (bit-complementing) the argument before then counting zeros.)

C++ (since C++20 or __cpp_lib_bitops) offers us the convenient std::countr_zero<⋯>(⋯) and std::countl_zero<⋯>(⋯) functions. There are also versions for counting 1-bits, and all of it is even constexpr! With these, we can write some tidy little wrappers^[1]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <cstdint>

#include <bit>


template<class T> [[nodiscard]] constexpr
T clz( T val ) noexcept requires(
	std::is_same_v<T,uint8_t > || std::is_same_v<T,uint16_t> ||
	std::is_same_v<T,uint32_t> || std::is_same_v<T,uint64_t>
) {
	return (T)std::countl_zero(val);
}

template<class T> [[nodiscard]] constexpr
T ctz( T val ) noexcept requires(
	std::is_same_v<T,uint8_t > || std::is_same_v<T,uint16_t> ||
	std::is_same_v<T,uint32_t> || std::is_same_v<T,uint64_t>
) {
	return (T)std::countr_zero(val);
}

However, you look at the disassembly (e.g. for uint32_t, "clang -O3") and are disappointed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
clz<uint32_t>(uint32_t):
	test   edi, edi
	je     .LBB0_2
	bsr    eax, edi
	xor    eax, 31
	ret
.LBB0_2:
	mov    eax, 32
	ret

ctz<uint32_t>(uint32_t):
	test      edi, edi
	je        .LBB1_2
	rep bsf   eax, edi
	ret
.LBB1_2:
	mov       eax, 32
	ret

That's a lot of instructions!

Unpacking how these work, since the i386^[2], the x86 architecture has offered the bsf and bsr ('bit-scan forward/reverse' ('BSF'/'BSR')) instructions, which give the index of the first set bit in a 16- or 32-bit operand. If the operand is zero, there is no first set bit, and the result is undefined (but the zero flag 'ZF' is at least set).

The shortsightedness here is that by just defining the output for the zero case to be the bit width, Intel could have upgraded the semantics for BSF into CTZ. CTZ is much more common and useful, while also being completely compatible with the old semantics of getting the bit index. (Similarly, BSR can be upgraded to CLZ, although there is a semantic difference, discussed below.) Alas, Intel chose not to define the zero case, and so the compiler has had to insert a test^[3] for zero.

Doing Better (i.e. Worse)

If you know that the input isn't zero, this test is wasteful! Also, "everyone knows" that branching is slow. Instead, we can execute the bit-scan instruction unconditionally (it's very fast^[4]). This will set 'ZF' if the result is zero. We can then use that to predicate a conditional-move (cmovz) to move the right answer in for that case.

This is clearest for the CTZ case, using GCC-like inline-assembly^[5] (and dropping platform-specific variation, delegation for non-supported bit-widths, and some of the declaration formalism):

1
2
3
4
5
6
7
8
9
10
11
12
template<class T>
T my_ctz( T val )
{
	constexpr T defl = 8 * sizeof(T);
	T ret = val;
	asm(
		"bsf     %0, %0\n"
		"cmovz   %0, %1\n"
		: "+r"(ret) : "r"(defl) : "cc"
	);
	return ret;
}

The case for CLZ is similar, but we have to be little careful. BSR returns the bit index, counted from the right, whereas we want the number of zeros on the left. For example, if we have a 32-bit number and BSR returns 28, then the highest set bit is bit 28, and so bits 29, 30, and 31 are zeros, and so CLZ must be 3. In general, CLZ = (8*sizeof(T)-1) - BSR.

We can implement that subtraction with the appropriate instruction (sub). Note that sub also sets ZF, so we need to execute it after the conditional-move:

1
2
3
4
5
6
7
8
9
10
11
12
template<class T>
T my_clz( T val )
{
	T tmp=val, ret=8*sizeof(T)-1;
	asm(
		"bsr     %1, %1\n"
		"cmovz   %1, %2\n"
		"sub     %0, %1\n"
		: "+r"(ret), "+r"(tmp) : "rm"(-1) : "cc"
	);
	return ret;
}

The registers are a bit awkward, but notice how we have very cleverly re-used at least some of them. Unfortunately, cmovz can't take an immediate for its second operand. By the form of the sub, we can't use an intermediate there either, even though we know the value at compile-time.

However, now we have a chance to show off our talent! The subtraction is fast, but there are faster instructions. In particular, everyone knows that XOR (xor) is faster. While XOR might not seem relevant, in this case we're talking about a power-of-two-minus-one, so we can use the following identity:

\[ (2^n-1) - a ~~=~~ a \oplus (2^n-1) \]

Notice that, because the XOR is commutative, we can reverse the operands—allowing us to not only use the faster instruction, but to use an immediate as one of the operands as well! The code becomes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
template<class T>
T my_clz( T val )
{
	constexpr T nm1 = 8*sizeof(T) - 1;
	constexpr T defl = (8*sizeof(T)) | (8*sizeof(T)-1);
	T ret = val;
	asm(
		"bsr     %0, %0\n"
		"cmovz   %0, %1\n"
		"xor     %0, %2\n"
		: "+r"(ret) : "r"(defl), "imr"(nm1) : "cc"
	);
	return ret;
}

Now we can be properly pleased with ourselves. If bsr is 3 cycles, and cmov and xor are each 1, then this is 5 cycles, which is shorter than the branchy solution given by the compiler—even on the happy path where it doesn't suffer an incredibly costly branch misprediction. The situation for CTZ is similar.

Sketchy `cmov`s

The first (and less-significant) issue with the above "optimizations" is that we use a conditional-move instruction. Conditional-move instructions indeed do remove the branch misprediction penalty, but that isn't necessarily a win. Conditional moves are an additional dependency for both streams of data, as well as for the flags register. Whereas, if the branch predictor can predict a true branch, then it really predicts it and the jump is free—free, as in actually zero cycles (it does consume branch-prediction resources, though).

A lot of people will say a lot of things about conditional-moves, but I'd consider Agner Fog the most authoritative source, even above a typical Intel engineer. He writes^[6]:

As a rule of thumb, we can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation [in one branch].

Without invoking an entire application / use-case, it's difficult to figure out a-priori what that tradeoff actually is. It seems likely that for many bit-scan applications, the zeros will be predictable. If nothing else, there are a lot more nonzero values than zero values—and when there are zero values, there are usually a lot of zero values—in either which case, the branch wins.

On the other hand, the dependency chains here are very short (two registers and the flags register, all of which we need regardless), and branch prediction really hurts—and will only get worse on newer processors. The 'best' solution would have the caller tell us (by template argument) whether the argument can be zero and, if so, how predictable it will be, although this is a little unwieldy.

This is the point where we should defer to a benchmark. Linus Torvalds offers a test of jump vs. conditional-move (though not in the context of bit-scanning), which we modernize here:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/*
Mayhap overly simplistic jump vs. cmov test, modernized from:
	https://yarchive.net/comp/linux/cmov.html

Compile and test jump:
	gcc -masm=intel -O3        main.cpp && time ./a.out
Compile and test cmov:
	gcc -masm=intel -O3 -DCMOV main.cpp && time ./a.out
*/

#include <cstdio>


inline constexpr unsigned NUM_ITERS = 100'000'000;
inline constexpr unsigned long a=5, b=7;

int main( int /*argc*/, char* /*argv*/[] )
{
	unsigned long sum = 0;
	for ( unsigned k=0; k<NUM_ITERS; ++k )
	{
		//Basically, `choice = (k&TEST_BITMASK) ? a : b;`

		unsigned long choice;
		#define TEST_BITMASK 1
		#ifdef CMOV
			asm(
				"test     %1, %2\n"
				"cmovne   %0, %3\n"
				: "=r"(choice) : "r"(k), "i"(TEST_BITMASK), "rm"(a), "0"(b) : "cc"
			);
		#else
			asm(
				"test   %1, %2\n"
				"je     .ZERO\n"
				"mov    %0, %3\n"
				".ZERO:\n"
				: "=r"(choice) : "r"(k), "i"(TEST_BITMASK), "g"(a), "0"(b) : "cc"
			);
		#endif

		sum += choice;
	}
	printf( "%lu\n", sum );

	return 0;
}

On a modern computer, we can indeed reproduce the result that jump seems faster. Torvalds (mis)interprets this as supporting a sweeping generalization that conditional-moves are bad, but it really only proves the point that branches can become free.

The test basically branches on the lowest bit of the loop index, meaning that after a few iterations the loop is 100% predictable^[7]. In that case, the jump goes away, and we're left with a loop that doesn't have a mov half the time. This is obviously faster than a loop that has a movall the time. (We can actually make the results about the same by putting a nop in the second loop after the mov, so that half the loop iterations do 2 instead of 1 instruction, meaning both loops execute about the same number of instructions.)

Still, even if it's less that conditional-moves are bad and more that branches can be good, it's a real speedup. That effect needs to be considered when choosing our implementation.

Speaking of, let's write a similar test, but for bit-scanning:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/*
Mayhap overly simplistic jump vs. cmov test, for bit-scanning.

Compile and test jump:
	gcc -masm=intel -O3        main.cpp && time ./a.out
Compile and test cmov:
	gcc -masm=intel -O3 -DCMOV main.cpp && time ./a.out
*/

#include <cstdint>

#include <cstdio>


inline constexpr unsigned NUM_ITERS = 1'000'000'000;

__attribute__(( naked ))
uint32_t ctz( uint32_t val )
{
	#ifdef CMOV
		asm(
			"bsf     eax, edi\n"
			"cmovz   eax, %0\n"
			"ret\n"
			: : "rm"(32) : "eax", "edi", "cc"
		);
	#else
		asm(
			"test   edi, edi\n"
			"je     .ZERO\n"

			"bsf    eax, edi\n"
			"ret\n"

			".ZERO:\n"
			"mov    eax, 32\n"
			"ret\n"

			: : : "eax", "edi", "cc"
		);
	#endif
}

int main( int /*argc*/, char* /*argv*/[] )
{
	uint32_t sum = 0;
	for ( unsigned k=0; k<NUM_ITERS; ++k ) sum+=ctz(k);
	printf( "%u\n", sum );

	return 0;
}

We have to be a bit careful with the test in the jump case. If we write it in the source language, the compiler will optimize the first loop iteration out. However, in assembly, the question arises of where to put the mov. So that we don't unnecessarily execute it (GCC does that; it's probably okay), it needs to be in a branch, and that code needs to be somewhere. We could use two branches, but that would be unrepresentative, even though predicted branches are free. The solution I've chosen is to do everything in a separate function, so that we can ret multiple times.

In any case, when we run the code, we find that the jump is indeed (a little) bit faster. If we #include <cstdlib> and then change it to sum+=ctz(rand());, then any difference is too small to reliably measure against the overhead of calling rand().

What it all boils down to is that the conditional-move was something of a very clever pessimization! The jump is better for predictable data, and for unpredictable data, it just doesn't matter enough.

(Actually) Doing Better

There's an even more serious problem with the "optimization", though. If we specify e.g. "clang -O3 -march=native"^[8] for the original standard library versions, we get something remarkable:

clz<uint32_t>(uint32_t):
	lzcnt   eax, edi
	ret

ctz<uint32_t>(uint32_t):
	tzcnt   eax, edi
	ret

It turns out that our "optimization" above was also an example of the second way things can go wrong—not understanding the context. See, Intel and AMD^[9], realized that the bsf/bsr instructions are deficient, and introduced new instructions to deal with this (and other issues). Specifically, the situation nowadays^[10] is that modern x86 processors (newer than Intel Haswell or AMD Excavator) support lzcnt/tzcnt, which implement CLZ/CTZ, so we should only be using bsf/bsr as a fallback^[11].

Even if we wanted to support older architectures, we can still go a long way by hinting to the compiler about the zero case by using assume(⋯), which we can define (for release mode^[12]) as follows:

#if   defined __clang__
	#define assume(EXPR) __builtin_assume(EXPR)
#elif defined __GNUC__
	#define assume(EXPR) ( (EXPR) ? static_cast<void>(0) : ::std::unreachable() )
#else //Intel or MSVC?
	#define assume(EXPR) __assume(EXPR)
#endif

For example, if we know a-priori that the input will be nonzero (the programmer could signal this guarantee through e.g. a boolean template argument), then we can use assume( val > 0 ). In our original C++ wrappers for std::countr_zero<⋯>(⋯)/std::countl_zero<⋯>(⋯), we suddenly get a much tighter implementation:

clz<uint32_t>(uint32_t):
	bsr   eax, edi
	xor   eax, 31
	ret

ctz<uint32_t>(uint32_t):
	rep bsf   eax, edi
	ret

Hey, look at that! The compiler gave us the same XOR optimization we thought ourselves so clever for discovering. Actually, if we had payed attention to the original compiler output, we'd have noticed the compiler did it there, too.

Final Thoughts

What initially seemed like a clear win for cleverness actually turned out to be a pessimization. One reason is that there were hardware technicalities that we didn't properly understand. The other is that we failed to appreciate the context, that modern hardware provides better instructions anyway.

As in so many other cases, the Right Thing here is to roll everything back and just use the C++ standard library call, possibly wrapped up, and with an assume(⋯) if we know the input is nonzero. This reproduces the best aspects of our optimization, when it's appropriate to do so—but with far less code and in a cross-platform way.

This result is common when optimizing code, and it underscores the importance of testing that the result really is faster afterward, as well as the value of thinking through some of the implications before investing a lot of time in creating something.

It's certainly possible to outdo the standard library implementations, and even easier to outdo the compiler^[13]. Moreover, micro-optimization matters (whether by multiplying small gains out, or avoiding death from a thousand cuts). But many things worth doing are difficult, and code optimization is certainly one of them.

Notes

[1]	You could use them directly, but I prefer to restrict the input types with a `requires` and have the output type match the input type. Also we can use the 'clz'/'ctz' name and put them in our desired `namespace`.
[2]	I checked, myself. In particular, the 8086, i186, and i286 didn't have it. However, the i386 does (see the Programmer's Reference Manual, pg. 17-27f. (pg. 313f in PDF)). Being available on the i386 and later means that the instructions are good basically anywhere (and definitionally, for any 32- or 64-bit x86 program).
[3]	Specifically, `test` sets 'ZF' something like `ZF = (opA&opB)==0`, and then we jump if equal (i.e. ZF set). When one value is passed as both operands, this checks if it is zero, basically.
[4]	On modern processors, it should be about 3 cycles (e.g. for BSF).
[5]	Note: the constraint for `defl` is given as `"r"`. By all rights, it should be `"rm"`, but Clang currently screws up and uses `"m"` in that case. All performance comparisons in this article were done with GCC for this reason. Anyway, putting a constant in a register is not really onerous; it probably should be there anyway; the move can be pipelined above wherever so it's free-ish.
[6]	See "Optimizing subroutines in assembly language: An optimization guide for x86 platforms" (pg. 70, in §9.6).
[7]	Note that branch predictors can (easily) handle simple cases like this, where the branch switches direction each time.
[8]	Same thing on GCC. GCC also supports "-mabm", "-mbmi", "-mbmi2" separately.
[9]	(despite their mutual frenemy bickering)
[10]	AMD introduced 'Advanced Bit Manipulation' ('ABM') with `lzcnt`. (They also, briefly, offered 'Trailing Bit Manipulation' ('TBM'), which did not have a CTZ instruction, so not relevant here.) Intel copied ABM and called it 'Bit Manipulation Instruction Set' ('BMI', now 'BMI1'). Then they added 'BMI2' with `tzcnt`.
[11]	Or not at all. Those architectures are rather aged, by computer standards.
[12]	In debug mode, you should (instead!) have an assert to check the assumption.
[13]	People who think otherwise are just objectively wrong, full stop. Yes, it takes some talent—so if your friend hasn't been able to and thinks it impossible, remember when considering their future advice that that conclusion stems from their mediocrity.

Return to the list of blog posts.