Test-Driven Development is Stupid

Trying to improve software quality by increasing the amount of testing is like trying to lose weight by weighing yourself more often. -Steve C. McConnell

Editor’s Note

Once upon a time, this post went viral among some TDD developers, who took very unkindly to it. I received more hatemail from them than for any other post on my site (at that time). Out of all of it, there was a (single) substantive objection: that in (some) TDD shops, the test-then-write-code cycle happens at the module or function granularity rather than at program granularity. This indeed makes TDD less ridiculous.

However, even though I didn’t address this (actually, I didn’t address scale much one way or the other), I’d argue that the deficiencies are similar, even if at a reduced severity. In fact, although I’ve cleaned up some of the language and examples over the years, I stand by the substance of what I said.

Moreover, the sheer flame I got for this article, ironically, directly supports my belief that TDD makes people over-confident assholes.

Introduction and an Argument from Intuition

I am against writing unit tests in general, since experience shows that they actually prevent high-quality code from emerging from development—a better strategy is to focus on competence and good design.

However, one particular style of unit testing deserves special attention for its special lunacy. Test-Driven Development (TDD) is so monumentally dumb that the first time I heard of it, I thought it was an elaborate practical joke for a month into the project.

I still think it’s an elaborate practical joke, but at least now I know that some people take it seriously.

The basic idea—amazingly, one of the most popular methods of software engineering (and growing in popularity!)—is that, after you figure out what you want to do and how you want to do it, you write up a set of small test programs—each testing some tiny piece of the program that will exist. Then, you write the program.

Let me emphasize: you write the test cases for your program, and then you write your program. You are writing code to test something that doesn’t even exist yet. I am not rightly able to apprehend the kind of confusion of ideas that could provoke such a method.

Really, this is the kind of thing that only makes sense when you’re drunk, and if you think about it from an intuitive perspective, it is completely broken. Given that the entire challenge of software engineering is to make complex problems as simple as possible, doing things literally backward should start sounding alarm bells.

Developing software is like a painting commission. Say instead of fulfilling some specification, you’re to paint some rich patron’s daughter. So, you, without even doing a few sketches, you sit down and think: I want X number of black brush-strokes, and Y number of red brush-strokes. Her chin will be in exactly this position. Then, with the precision of a robot, you make every brush-stroke individually. (Maybe you do it slightly differently: you sketch out a basic form, and then you start at the top left corner and paint the first square centimeter until you’re pretty sure it’s just right. Then you move onto the next one, and so on). This is Test-Driven Development.

Do people paint paintings like this? No! Why? Because it’s retarded. The next week, the daughter shows up to pose and she’s not the slender teenager you were described, but actually a chubby fourth grader with a pout and a short attention span for modeling.

Real painters work from generalities downward, all at once. They change things before they ever resolve the details. They don’t plan ahead. They don’t work to constraints. They refactor pictures—even start over if it’s not shaping up right. This process is necessary. Guess what? Artists make pretty pictures this way. Software engineering’s intrinsic beauty should learn from this.

An Argument from Practicality

The most important argument is a practical one: Test-First doesn’t work. I don’t care what you’ve heard. I don’t care how much your suit wants it. I don’t care how much the stress-spattered eyes of your coworkers gleam in their endorsement. It. Doesn’t. Work.

A big part of the problem is there’s very little comparison. Most programmers learned of the Test-First strategy in their early programming years and saw it as more rigorous before they really had practical experience to go off of. Most never looked back and dismissed everything else as a frivolous pastime from their uneducated past. It doesn’t help that this is how kids are being taught these days in the first place.

I was once one of these newly educated kids. So, for a few years, I worked exclusively with the Test-First strategy. When it was over, the results were undeniable. The code—all of it—was horrible. Class projects, research code, contract programs, indie games, everything I’d written during that time—it was all slow, hard to read, and so buggy I probably should have cried. It passed the tests, but not much else.

It was also the slowest code to write. I was astounded by how much. It’s like a factor of five, at least.

And we shipped it! All of the horrible stuff I’d written, just like every other cog in the machine, got packaged up into a “.jar” or a “.exe” or a “.pdf” or a “.app” and got sent off to customers. To buy.

A big part of the arguments for TDD is that such complaints are anecdotal. This is a valid concern. But, it’s not hard to look at the situation as an outsider.

How often have you seen a program crash? If it was developed by a large software company, chances are it was written using TDD. Clearly, TDD is not a magic bullet. So, TDD does not “prove your code works”.

How long does development take? On occasion, I have singlehandedly written medium-size, feature-complete programs in a single day. Regularly, I write entire modules (a few datastructures and classes, some functions, assorted misc.) in the same amount of time. That is a week or a month to a team of engineers using TDD. I’d say my code is equivalent or better reliability (indeed, without naming names, I have written custom software to replace TDD 3rd-party code because the latter was too buggy). So, TDD takes far longer to achieve the same result.

I invite the industry reader to verify these two experiences for themself.

If you’d like to know why I believe these facts are, or still aren’t convinced, what follows are some more concrete reasons why TDD fails.

Good Design

The largest problem is that TDD restricts the elegance of finished designs. No one can magick a perfect design into existence, no matter how many design sessions you do on paper. As the Jargon File aptly notes:

[E]xperience has shown repeatedly that good designs arise only from evolutionary, exploratory interaction between one (or at most a small handful of) exceptionally able designer(s) and an active user population—and that the first try at a big new idea is always wrong.

Anyone who has any experience with software knows that this is true. This has been true from day one of computer programming. COBOL programmers know it. They did iterative programming on the ENIAC. This is how laws get written. This is how engineering gets done. This is how every single thing humans build has a chance at working. To say otherwise is denialism. It’s spitting in the face of a millennium of experience. Designs are unicorns. You should court them, get to know them. Develop a relationship. People who want to design everything first are looking for a shotgun wedding to cover up a one-night stand. It’s wrong.

So this is the first reason TDD fails: You’re trying to make a design before you learn anything about it.

What does this look like in software engineering?

If you’re working on your section of a module, it behooves you to constantly be changing your design around to make it perfect. Even in my “established” codebases, I still make occasional—nay, frequent—large-scale changes, since increased experience demonstrates faults in the original design, or new knowledge suggests a better one. This is healthy because it decouples current code from failed design decisions of the past.

Despite my codebase’s fairly large scope, even wide-reaching changes can be made quickly since there isn’t a battery of tests to delete and rewrite for each project affected. I have done massive restructurings of my core graphics library, for instance, on multiple occasions as I learned new things and as the old designs’ shortcomings became apparent. The fact that I could do this by myself to such a large volume of code should emphasize the fact that the same could not be achieved using TDD. I cannot even imagine how terrible my codebase would have been without restructuring its original design yet still adding features.

The Test-First design strategy discourages these frequent changes by increasing the amount of work it takes to modify anything. If you want to make a new function, you have to also make three new functions to test it. If you want to change what a function does, you have to change all the tests you wrote for it. It gets intractable. Eris help you if you want to refactor a class—let alone a class hierarchy. That’s getting to be a full day’s worth of error-ridden, painstaking work for something that should have taken fifteen minutes at worst.

So maybe you write tests at a coarser granularity. This is almost worse since it means you’re writing code to test algorithms. This requires more careful thought, rewards less partial success, and actually means more changes since algorithms change more than backends.

Catastrophic Design Failure

In practice, this leads to suboptimal designs. People write something one way, but then are afraid to change it because they’ll have to rewrite the testing code that goes along with it. These poor designs are actually often buggier, since they aren’t approaching the problem in the right way! They add tests to compatibility layers along with their algorithm. Then the cycle repeats and they add another boilerplate layer and merge it haphazardly with the first, adding a whole new battery of tests. So the problematic code stays as it is and becomes more and more protracted until it finally just needs to be thrown out and rewritten anyway.

Software engineers using the test-driven process get used to this kind of thing—so much so that they don’t even realize it’s happening. It’s only the new hire with no credibility who realizes that having three layers of semantically-void indirection to do something simple is inefficient and idiotic.

As proof, here’s a real example from a TDD project I had the misfortune to be affiliated with. The authors have requested anonymity. Just stew in the long, gory horror:

Function A is 147 lines long. It is the simple core of the program.
Function A is committed to the repository on June 26th, 2002. Function A has four test cases. Nevertheless, a bug is found in Function A on the 28th and a patch is uploaded on July 6th. It contains two new test cases.
This continues a bit. However, by August 2002, function A is mostly stable and has no fewer than thirteen test cases—mainly for fencepost errors and other idiotic things anyone can find with a stack trace or should have checked with assert(...). Except for a blip in early 2003, function A, now 152 lines long, is unchanged until mid-2006.
In 2006 the program expands to cover new functionality. This reveals function A to be a poor fit—it approaches the problem in a nonintuitive, non-general way. It was fine before, but now it is inadequate. So, in May of that year, two wrapper functions B and C for function A are added, with a unit test for each. It’s textbook perfect: they add a line in their SVN commit message: “WRapped [function A]. FIXME: refactor tests instead”. It never happens.
In June 2006, the program expands even further. Apparently, no one is sure how B and C work, so someone writes six wrapper functions D, E, F, G, H, and I for the wrapper functions B and C for function A. There are now eight wrapper functions with seven unit tests between them, and the unmodified function A which still has thirteen.
On July 1st, 2006, someone decides the growing mess isn’t object-oriented. Instead of refactoring everything, he makes function A a static method of a new class and makes B and C into member functions. Since D, E, F, G, H, and I are wrapper functions (for B and C), but they require some effort to understand, another programmer puts them into another class on July 9th, 2006. A hook into a newly integrated testing library is added to ensure that both new classes can be instantiated for testing. The code sits until 2007.
In January 2007, a programmer extends the functionality of A, not fully understanding the boilerplate over it. She (from the username, I think it’s a she) adds two new helper functions J and K to function A and delegates out the work. She refactors B and C to fix the test cases and inserts a new layer of abstraction between B and C and their wrapper functions D, E, F, G, H, and I to handle the differences. There are now 4131 lines of boilerplate wrapping functions A, J, and K, which are now 172 lines.
On July 1st, 2009, exactly three years after the same decision in 2006 (some internal company policy?), someone decides the growing mess isn’t object-oriented, again. Again, they don’t refactor the mess but instead push the entire thing into an inner class (called _Implementation) of yet another new class. Function A now has two helpers, three layers and two classes of wrapper functions, and almost 50 test cases holding the thing together. The implementation spans six files and almost 6000 lines.
A week later, someone realizes that wrapping everything in a class broke the instantiation of the inner class’s backing classes for the purpose of writing tests. So, they write a generic (like C++ templates, but not Turing-complete) class whose only function is to fake-copy-construct the backing classes, make them think they live in the inner class using an interface, and then delegate their unit tests in a sane manner.
In September 2009, the project’s repository gets corrupted by a bad push and, in the restoration, someone accidentally commits their personal branch of the module, which is a performance-enhanced fork of a version in 2006. The first one to notice the difference (on a push I assume) merges the two modules haphazardly, and comments out the test cases that don’t work. There are now more than ten-thousand lines of boilerplate code comprising four testing classes, a mess of wrapper functions (sorry; I lost count), and 81 test cases (of which 14 are silently failing). Function A now exists in two versions, because no one realized they do the same thing.
Meanwhile, the three coders who submitted the original function A and patches haven’t made a commit to this module since 2004.

All this came to my attention in early 2010, when I called the project leader’s attention to it. I questioned the choice of TDD, and he replied that it led to cleaner code. I then showed him an annotated SVN log, to which he took exception and said he would investigate. I stopped using the project soon after and then most of the developers moved to a closed-source fork, so I don’t know whether it ever got resolved. I frankly don’t care.

You would be very hard-pressed to convince me that the example above, a ten-thousand-line-monstrosity with redundant code that no one understands, is more reliable than a refactored function A would have been. This is an extreme example, of course, but it’s hard not to see the evidence of lesser evils in other TDD open source projects. Pick your favorite and look.

I dare you.

The lesson here, again, is that authors and maintainers where so afraid to change a bad design that they just plain didn’t.

Self-Confidence and Sloth

Many advocates claim TDD gives them more confidence that their code is right. As the above section demonstrates, perhaps this confidence is misplaced.

How confident are TDD developers? Very. In my experience, TDD too often produces code that works very nicely for the test cases, and simply doesn’t on real data. Frequently, my bug reports to such projects are automatically met with “no, we tested for something like that”. I have even pointed out exactly where the bug is, only to have its existence be denied. The arrogance TDD seems to cultivate—that any TDD code is somehow a higher standard of robustness—shields its developers from the preeminent fact that it is more often than not the opposite.

The problem is, TDD encourages programmers to write programs that fit those test cases—and not necessarily others. Ideally, if tests are written “well”, then any program that passes them is satisfactory. However, it is impossible in almost every case to specify the desired output of every possible input. Even if it is, it’s certainly not practical to.

Moreover the process itself encourages laziness. While writing the tests should be the most carefully focused part of TDD, it’s comparatively mindless to generating new code. From personal experience, I know that writing tests is wearying. There’s an attitude of “let’s just move on to the next test”. This laziness carries over into the target code itself: after hours of writing tests, just cruft some implementation together and hope it holds.

Time

It has become a running joke among software consumers by how late software development’s deadlines slip. Why does this happen?

It happens because the initial outlay of the project planned for only the cost of writing the code plus tests. It didn’t take into account the realization that the code wouldn’t actually work. Here’s a real, quoted schedule for a 12 week TDD project that came out of a planning meeting:

Week 1
: Setup and stuff.
Week 2
: Develop design.
Week 3-4
: Write tests.
Week 5-10
: Write code.
Week 11
: Feature freeze. Debugging.
Week 12
: Fix minor issues, deliver product.

Here’s what actually happened:

Week 1
: Setup and stuff.
Week 2
: Develop design.
Week 3
: Begin writing tests.
Week 3.5
: Discover design is literally impossible to implement.
Week 4
: Emergency design meeting, desperate tweaks to try to fix tests anyway.
Week 5-6
: Better design. Finish writing tests.
Week 7-8
: Two weeks late. Desperate coding.
Week 8-9
: Desperate coding and refactoring. Tweaked "initial" design to rubberstamp what we've already done.
Week 10
: Desperate coding; critical fixes.
Week 11
: Feature freeze; critical fixes.
Week 12
: [Horrifically embarrassing crash during product demo]

In a larger environment, the quality control team would have caught the bug, and ordered management to extend the project.

After I gave up TDD, I started getting emails about how-can-I-possibly-get-anything-done-without-TDD. How much can I get done without your TDD? Turns out: for the same quality . . . more than you.

I don’t think it’s just me, though. I think it’s just that I don’t use TDD.

My code is constructed loosely: at each stage of the process, I envision the key features of a design, and then build downwards and outwards. I check each function carefully, but I don’t let that distract me from visualizing how it all fits together. I put preliminary thought into my design, but by and large I improvise—and I do not fear radical changes halfway through. Moreover, I embrace them. If I think up a good architecture change, I make it immediately.

It is my belief that the complexity and dynamism of modern programs is so staggering that it can only be appreciated by understanding it intuitively. Writing tests forces one to formalize this complexity to the point of presupposing that the complexity of a modern program can be reduced to two or three simple cases. Writing test cases for programs is more often than not like trying to describe the Mona Lisa by sketching in crayon. The beautiful state and delicate control flow balanced so carefully in the programmer’s mind aren’t aided by the romp-tromp of test-driven boots; such things aren’t reducible to a few lines of description in a test file.

Even if you somehow succeed, TDD prevents incremental drafts by functionally requiring all tests for a module to pass before you get any real results. It’s test this, test that, and before you know it it’s a week later and your algorithm isn’t even half done—and when you finish it, you find out it doesn’t work. I’d have figured that out on day one, even if the implementation I used to do it were crap.

Conclusion

All the criticisms in my critique of test cases also apply, but for Test-Driven Development, the problems are specially preventable. The real bottom line is the same: software development doesn’t work at all if the people doing it are incompetent. No measure of unit testing or code review or whatever buzzphrase du jour will ever be thought up can guarantee that a program will work as intended. TDD is a bandaid over a larger problem: writing good code in the first place.

There is literally no substitute for competence. If your coders don’t have it, TDD won’t fix it. If they do have it, TDD will undermine it. The Test-First strategy discourages this sort of thought by giving false security in the form of a passed test suite. It leads to broken code in broken designs and allows people to feel proud of themselves anyway.

TDD hampers good design through code bloat and fragmentation. It arrogantly presupposes that designs can be built all at once, and it doesn’t even give better results. It should never be used.

Return to the list of blog posts.

Test-Driven Development is Stupid

Introduction and an Argument from Intuition

An Argument from Practicality

Good Design

Catastrophic Design Failure

Self-Confidence and Sloth

Time

Week 1

Week 2

Week 3-4

Week 5-10

Week 11

Week 12

Week 1

Week 2

Week 3

Week 3.5

Week 4

Week 5-6

Week 7-8

Week 8-9

Week 10

Week 11

Week 12

Conclusion