How to add a million bugs to a program (and why you might want to)

This is the first in a series of posts about evaluating and improving bug detection software by automatically injecting bugs into programs. You can find part two, with technical details of our bug injection technique, here.

In this series of posts, I'm going to describe how to automatically put bugs in programs, a topic on which we just published a paper at Oakland, one of the top academic security conferences. The system we developed, LAVA, can put millions of bugs into real-world programs. Why would anyone want to do this? Are my coauthors and I sociopaths who just want to watch the world burn? No, but to see why we need such a system requires a little bit of background, which is what I hope to provide in this first post.

I am sure this will come as a shock to most, but programs written by humans have bugs. Finding and fixing them is immensely time consuming; just how much of a developer's time is spent debugging is hard to pin down, but estimates range between 40% and 75%. And of course these errors can be not only costly for developers but catastrophic for users: attackers can exploit software bugs to run their own code, install malware, set your computer on fire, etc.

Weekly World News has known about this problem for years.

It should come as little surprise, then, that immense effort has been expended in finding ways to locate and fix bugs automatically. On the academic side, techniques such as fuzzing, symbolic execution, model checking, abstract interpretation, and creative combinations of those techniques, have been proposed and refined for the past 25 years. Nor has industry been idle: companies like Coverity, Fortify, Veracode, Klocwork, GrammaTech, and many more will happily sell (or rent) you a product that automatically finds bugs in your program.

Great, so by now we must surely have solved the problem, right? Well, not so fast. We should probably check to see how well these tools and techniques work. Since they're detectors, the usual way would be to measure the false positive and false negative rates. To measure false positives, we can just run one of these tools on our program, go through the output, and decide whether we think each bug it found is real.

The same strategy does not work for measuring false negatives. If a bug finder reports finding 42 bugs in a program, we have no way of knowing whether that's 99% or 1% of the total. And this seems like the piece of information we'd most like to have!

Heartbleed: detectable with static analysis tools, but only after the fact.

To measure false negatives we need a source of bugs so that we can tell how many of them our bug-finder detects. One strategy might be to look at historical bug databases and see how many of those bugs are detected. Unfortunately, these sorts of corpora are fixed in size – there are only so many bugs out there, and analysis tools will, over time, be capable of detecting most of them. We can see how this dynamic played out with Heartbleed: shortly after the bug was found, Coverity and GrammaTech quickly found ways to improve their software so that it could find Heartbleed.

Let me be clear – it's a good thing that vendors can use test cases like these to improve their products! But it's bad when these test cases are in short supply, leaving users with no good way of evaluating false negatives and bug finders with no clear path to improving their techniques.

This is where LAVA enters the picture. If we can find a way to automatically add realistic bugs to pre-existing programs, we can both measure how well current bug finding tools are doing, and provide an endless stream of examples that bug-finding tools can use to get better.

LAVA: Large-scale Automated Vulnerability Analysis

Goals for Automated Bug Corpora

So what do we want out of our bug injection? In our paper, we defined five goals for automated bug injection, requiring that injected bugs

Be cheap and plentiful
Span the execution lifetime of a program
Be embedded in representative control and data flow
Come with a triggering input that proves the bug exists
Manifest for a very small fraction of possible inputs

The first goal we've already discussed – if we want to evaluate tools and enable "hill climbing" by bug finders we will want a lot of bugs. If it's too expensive to add a bug, or if we can only add a handful per program, then we don't gain much by doing it automatically – expensive humans can already add small numbers of bugs to programs by hand.

The next two relate to whether our (necessarily artificial) bugs are reasonable proxies for real bugs. This is a tricky and contentious point, which we'll return to in part three. For now, I'll note that the two things called out here – occurring throughout the program and being embedded in "normal" control and data flow – are intended to capture the idea that program analyses will need to do essentially the same reasoning about program behavior to find them as they would for any other bugs. In other words, they're intended to help ensure that getting better at finding LAVA bugs will make tools better at understanding programs generally.

The fourth is important because it allows us to demonstrate, conclusively, that the bugs we inject are real problems. Concretely, with LAVA we can demonstrate an input for each bug we inject that causes the program to crash with a segfault or bus error.

The final property is critical but not immediately obvious. We don't want the bugs we inject to be too easy to find. In particular, if a bug manifests on most inputs, then it's trivial to find it – just run the program and wait for the crash. We might even want this to be a tunable parameter, so that we could specify what fraction of the input space of a program causes a crash and dial the difficulty of finding the right input up or down.

Ethics of Bug Injection

A common worry about bug injection is that it could be misused to add backdoors into legitimate software. I think these worries are, for the most part, misplaced. To see why, consider the goals of a would-be attacker trying to sneak a backdoor into some program. They want:

A way to get the program to do something bad on some secret input.
Not to get caught (i.e., to be stealthy, and for the bugs to be deniable).

Looking at (1), it's clear that one bug suffices to achieve the goal; there's no need to add millions of bugs to a program. Indeed, adding millions of bugs harms goal (2) – it would require lots of changes to the program source, which would be very difficult to hide.

An attempted Linux kernel backdoor attempt from 2003. Can you spot the bugdoor?

In other words, the benefit that LAVA provides is in adding lots of bugs at scale. An attacker that wants to add a backdoor can easily do it by hand – they only need to add one, and even if it takes a lot of effort to understand the program, that effort will be rewarded with extra stealth and deniability. Although the bugs that LAVA injects are realistic in many ways, they do not look like mistakes a programmer would have naturally made, which means that manual code review would be very likely to spot them.

(There is one area where LAVA might help a would-be attacker – the analysis we do to locate portions of the program that have access to attacker controlled input could conceivably speed up the process of inserting a backdoor by hand. But this analysis is quite general, and is useful for far more than just adding bugs to programs.)

The Road Ahead

The next post will discuss the actual mechanics of automated bug injection. We'll see how, using some new taint analyses in PANDA we can analyze a program to find small modifications that cause attacker-controlled input to reach sensitive points in the program and selectively trigger memory safety errors when the input is just right.

Once we understand how LAVA works, the final post will be about evaluation: how can we tell if LAVA succeeded in its goals of injecting massive numbers of realistic bugs? And how well do current bug-finders fare at finding LAVA bugs?

Credits

The idea for LAVA originated with Tim Leek of MIT Lincoln Laboratory. Our paper lists authors alphabetically, because designing, implementing and testing it truly was a group effort. I am honored to share a byline with Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan.

Push the Red Button