The Mechanics of Bug Injection with LAVA
This is the second in a series of posts about evaluating and improving bug detection software by automatically injecting bugs into programs. Part one, which discussed the setting and motivation, is available here.
Now that we understand why we might want to automatically add bugs to programs, let's look at how we can actually do it. We'll first investigate an existing approach (mutation testing), show why it doesn't work very well in our scenario, and then develop a more sophisticated injection technique that tells us exactly how to modify the program to insert bugs that meet the goals we laid out in the introductory post.
One way of approaching the problem of bug injection is to just pick parts of the program that we think are currently correct and then mutate them somehow. This, essentially, is the idea behind mutation testing: you use some predefined mutation operators that mangle the program somehow and then declare that it is now buggy.
For example, we could take every instance of strncpy and change it to strcpy. Presumably, this would add lots of potential buffer overflows to a program that previously had none.
Unfortunately, this method has a couple problems. First, it is likely that many such changes will break the program on every input, which would make the bug trivial to find. The following program will always fail if strncpy is changed to strcpy:
We also face the opposite problem: if the bug doesn't trigger every time, we won't necessarily know how to trigger it when we want to. This will make it hard to prove that there really is a bug, and violates one of the requirements we described last time: each bug must come with a triggering input that proves the bug exists. If we wanted to find the triggering input for a given mutation, we'd have to find an input that reaches our mutant, which is actually a large part of what makes finding bugs hard!
Instead of doing random, local mutations, LAVA first tries to characterize the program's behavior on some concrete input. We'll run the program on an input file, and then try to see where that input data reaches in the program. This solves the triggering program because we will know a concrete path through the program, and the input needed to traverse that path. Now, if we can place bugs in code along that path, we will be able to reach them using the concrete input we know about.
We need a couple other properties. Because we want to create bugs that are triggered only for certain values, we will want the ability to manipulate the input of the program. However, doing so might cause the program to take a different path, and the input data may get transformed along the way, making it difficult to predict what value it will have when we actually want to use it to trigger our bug.
To resolve this, we will try to find parts of the program's input data that are:
We'll call data that satisfies these three properties a DUA. DUAs try to capture the notion of attacker-controlled data: a DUA is something that can be set to an arbitrary value without changing the program's control flow, is available somewhere along the program path we're interested in, and whose value is predictable.
Having defined these properties, we need some way to measure them. We'll do that using a technique called dynamic taint analysis1. You can think of dynamic taint analysis like a PET scan or a barium swallow, where a radionuclide is introduced into a patient, allowed to propagate throughout the body, and then a scan checks to see where it ends up. Similarly, with taint analysis, we can mark some data, allow it to propagate through the program, and later query to see where it ended up. This is an extremely useful feature in all sorts of reverse engineering and security tasks.
To find out where input data is available, we can taint the input data to the program – essentially assigning a unique label to each byte of the program's input. Then, as the program runs, we'll propagate those labels as data is copied around the program, and query any variables in scope as the program runs to see if they are derived from some portion of the input data, and if so, from precisely which bytes.
Next, we want to figure out what data is currently unused. To do so, we'll extend simple dynamic taint analysis by checking, every time there's a branch in the program, whether the data used to decide it was tainted, and if so, which input bytes were used make the decision. At the end, we'll know exactly how many branches in the program each byte of the input was used to decide. This measure is known as liveness.
Finally, we want some measure of how complicated the data in each tainted program variable is. We can do this with another addition to the taint analysis. In standard taint analysis, whenever data is copied or computed in the program, the taint system checks if the source operands are tainted and if so propagates the taint labels to the destination. If we want to measure how complicated a piece of data is – that is, how much it has been changed since it was first introduced to the program – we can simply add a new rule that increments a counter whenever an arithmetic operation on tainted data occurs. That is, if you have something like c = a + b; then the taint compute number (TCN) of c is tcn(c) = max(tcn(a),tcn(b)) + 1.
On the implementation side, all this is done using PANDA, our platform for dynamic analysis. PANDA's taint system allows us to taint an input file with unique byte labels. To query the state of program variables, we use a clang tool that modifies the original program source code2 to add code that asks PANDA to query and log the taint information about a particular program variable. When we run the program under PANDA, we'll get a log telling us exactly which program variables were tainted, how complicated the data was, and how live each byte of input is.
After running PANDA, we can pick out the variables that are uncomplicated and derived from input bytes with low liveness. These are our DUAs, approximations of attacker controlled data that can be used to create bugs.
The last step is to modify the source code to add our bug. We will insert code in two places:
To see LAVA in action, let's step through a full example. Have a look at this small program, which parses and prints information about a very simple binary file format:
We start by instrumenting the source code to add taint queries. The queries will be inserted to check taint on program variables, and, for aggregate data structures, the members inside each structure. The result is a bit too long to include inline, since it quadruples the size of the original program, but you can see it in this gist.
When we compile and run that program on some input inside of PANDA with taint tracking enabled, we get information about taint compute numbers and the liveness of each byte of the input. For example, here's the liveness map for a small (88 byte) input:
LAVA's analysis finds 82 DUAs and 8 attack points, for a total of 407 potential bugs. Not all of these bugs will be viable: because we want to measure the effect of liveness and taint compute number, the current implementation does not impose limits on how live or complicated the DUAs used in bugs are.
To make sure that an injected bug really is a bug, we do two tests. First, we run the modified program on a non-triggering input, and verify that it runs correctly. This ensures that we didn't accidentally break the program in a way we weren't expecting. Second, we run it on the triggering input and check that it causes a crash (a segfault or bus error). If it passes both tests we deem it a valid bug. This could miss some valid bugs, of course – not all memory corruptions will cause the program to crash – but we're interested mainly in bugs that we can easily prove are real. Another approach might be to run the buggy program under Address Sanitizer and check to see if it flags any memory errors. After validation, we find that LAVA is able to inject 159 bugs into the toy program, for a yield of around 39%.
Let's look at an example bug (I've cleaned up the source a little bit by hand to make it easier to read; programmatically generated code is not pretty):
On lines 6–15, after parsing the file header, we add code that saves off the value of the reserved field3, which our analysis correctly told us was dead, uncomplicated, and available in head.reserved. Then, on line 20, we retrieve the value and conditionally add it to the pointer ent that is being passed to consume_record (checking the value in both possible byte orders, because endianness is hard). When consume_record tries to access fields inside the file_entry, it crashes. In this case, the DUA and attack point were in the same function, and so the use of a global variable was not actually necessary, but in a larger program the DUA and attack point could be in different functions or even different compilation units.
If you like, you can download all 407 buggy program versions, along with the original source code and triggering inputs. Note that the current implementation does not make any attempt to hide the bugs from human eyes, so you will very easily be able to spot them by looking at the source code.
Having developed a bug injection system, we would like to know how well it performs. In the next post, we'll examine questions of evaluation: how many bugs can we inject, and how do the liveness and taint compute measures influence the number of viable bugs? How realistic are the bugs? (much more complicated than it may first appear!) And how effective are some common bug-finding techniques like symbolic execution and fuzzing? We'll explore all these and more.
1 Having worked with dynamic program analysis for so long, I sometimes forget how ridiculous the term "dynamic taint analysis" is. If you're looking for another way to say the same thing, you can use "information flow" but dynamic taint analysis is the name that seems to have stuck.↩
2 Getting taint information by instrumenting the source works, but has a few drawbacks. Most notably, it causes a huge increase in the size of the source program, and slows it down dramatically. We're currently finishing up a new method, pri_taint, which can do the taint queries on uninstrumented programs as long as they have debug symbols. This should allow LAVA to scale to larger programs like Firefox.↩
3 The slightly weird ({ }) construct is a non-standard extension to C called a statement expression. It allows multiple statements to be executed in a block with control over what the block as a whole evaluates to. It's a nice feature to have available for automatically generated code, as it allows you to insert arbitrary statements in the middle of an expression without worrying about messing up the evaluation.↩
Now that we understand why we might want to automatically add bugs to programs, let's look at how we can actually do it. We'll first investigate an existing approach (mutation testing), show why it doesn't work very well in our scenario, and then develop a more sophisticated injection technique that tells us exactly how to modify the program to insert bugs that meet the goals we laid out in the introductory post.
A Mutant Strawman that Doesn't Work
One way of approaching the problem of bug injection is to just pick parts of the program that we think are currently correct and then mutate them somehow. This, essentially, is the idea behind mutation testing: you use some predefined mutation operators that mangle the program somehow and then declare that it is now buggy.
For example, we could take every instance of strncpy and change it to strcpy. Presumably, this would add lots of potential buffer overflows to a program that previously had none.
Unfortunately, this method has a couple problems. First, it is likely that many such changes will break the program on every input, which would make the bug trivial to find. The following program will always fail if strncpy is changed to strcpy:
We also face the opposite problem: if the bug doesn't trigger every time, we won't necessarily know how to trigger it when we want to. This will make it hard to prove that there really is a bug, and violates one of the requirements we described last time: each bug must come with a triggering input that proves the bug exists. If we wanted to find the triggering input for a given mutation, we'd have to find an input that reaches our mutant, which is actually a large part of what makes finding bugs hard!
Dead, Uncomplicated and Available Data
Instead of doing random, local mutations, LAVA first tries to characterize the program's behavior on some concrete input. We'll run the program on an input file, and then try to see where that input data reaches in the program. This solves the triggering program because we will know a concrete path through the program, and the input needed to traverse that path. Now, if we can place bugs in code along that path, we will be able to reach them using the concrete input we know about.
We need a couple other properties. Because we want to create bugs that are triggered only for certain values, we will want the ability to manipulate the input of the program. However, doing so might cause the program to take a different path, and the input data may get transformed along the way, making it difficult to predict what value it will have when we actually want to use it to trigger our bug.
To resolve this, we will try to find parts of the program's input data that are:
- Dead: not currently used much in the program (i.e., we can set to arbitrary values)
- Uncomplicated: not altered very much (i.e., we can predict their value throughout the program's lifetime)
- Available in some program variables
We'll call data that satisfies these three properties a DUA. DUAs try to capture the notion of attacker-controlled data: a DUA is something that can be set to an arbitrary value without changing the program's control flow, is available somewhere along the program path we're interested in, and whose value is predictable.
Measuring Liveness and Complication with Dynamic Taint Analysis
Having defined these properties, we need some way to measure them. We'll do that using a technique called dynamic taint analysis1. You can think of dynamic taint analysis like a PET scan or a barium swallow, where a radionuclide is introduced into a patient, allowed to propagate throughout the body, and then a scan checks to see where it ends up. Similarly, with taint analysis, we can mark some data, allow it to propagate through the program, and later query to see where it ended up. This is an extremely useful feature in all sorts of reverse engineering and security tasks.
Like a PET scan, dynamic taint analysis works by seeing where marked input ends up in your program. |
To find out where input data is available, we can taint the input data to the program – essentially assigning a unique label to each byte of the program's input. Then, as the program runs, we'll propagate those labels as data is copied around the program, and query any variables in scope as the program runs to see if they are derived from some portion of the input data, and if so, from precisely which bytes.
Next, we want to figure out what data is currently unused. To do so, we'll extend simple dynamic taint analysis by checking, every time there's a branch in the program, whether the data used to decide it was tainted, and if so, which input bytes were used make the decision. At the end, we'll know exactly how many branches in the program each byte of the input was used to decide. This measure is known as liveness.
Liveness measures how many branches use each input byte. |
Finally, we want some measure of how complicated the data in each tainted program variable is. We can do this with another addition to the taint analysis. In standard taint analysis, whenever data is copied or computed in the program, the taint system checks if the source operands are tainted and if so propagates the taint labels to the destination. If we want to measure how complicated a piece of data is – that is, how much it has been changed since it was first introduced to the program – we can simply add a new rule that increments a counter whenever an arithmetic operation on tainted data occurs. That is, if you have something like c = a + b; then the taint compute number (TCN) of c is tcn(c) = max(tcn(a),tcn(b)) + 1.
TCN measures how much computation has been done on a variable at a given point in the program.
|
On the implementation side, all this is done using PANDA, our platform for dynamic analysis. PANDA's taint system allows us to taint an input file with unique byte labels. To query the state of program variables, we use a clang tool that modifies the original program source code2 to add code that asks PANDA to query and log the taint information about a particular program variable. When we run the program under PANDA, we'll get a log telling us exactly which program variables were tainted, how complicated the data was, and how live each byte of input is.
PANDA's taint system allows us to find DUAs in the program. |
After running PANDA, we can pick out the variables that are uncomplicated and derived from input bytes with low liveness. These are our DUAs, approximations of attacker controlled data that can be used to create bugs.
Finding Attack Points
With some DUAs in hand, we now have the raw material we need to create our bugs. The last missing piece is finding some code we want to have an effect on. These are places where we can use the data from a DUA to trigger some buggy effect on the program, which we call attack points (ATP). In our current implementation, we look for places in the program where pointers are passed into functions. We can then use the DUA to modify the pointer, which will hopefully cause the program to perform an out of bounds read or write – a classic memory safety violation.
Because we want the bug to trigger only under certain conditions, we will also add code at the attack point that checks if the data from the DUA has a specific value or is in a specific range of values. This gives us some control over how much of the input space triggers the bug. The current implementation can produce both specific-value triggers (DUA == magic_value) and range-based triggers of varying sizes (x < DUA < y).
Each LAVA bug, then is just a pair (DUA, ATP) where the attack point occurs in the program trace after the DUA. If there are many DUAs and many attack points, then we will be able to inject a number of bugs roughly proportional to the product of the two. In large programs like Wireshark, this adds up to hundreds of thousands of potential bugs for a single input file! In our tests, multiple files increased the number of bugs roughly linearly, in proportion to the amount of coverage achieved by the extra input. Thus, with just a handful of input files on a complex program you can easily reach millions of bugs.
Because we want the bug to trigger only under certain conditions, we will also add code at the attack point that checks if the data from the DUA has a specific value or is in a specific range of values. This gives us some control over how much of the input space triggers the bug. The current implementation can produce both specific-value triggers (DUA == magic_value) and range-based triggers of varying sizes (x < DUA < y).
Each LAVA bug, then is just a pair (DUA, ATP) where the attack point occurs in the program trace after the DUA. If there are many DUAs and many attack points, then we will be able to inject a number of bugs roughly proportional to the product of the two. In large programs like Wireshark, this adds up to hundreds of thousands of potential bugs for a single input file! In our tests, multiple files increased the number of bugs roughly linearly, in proportion to the amount of coverage achieved by the extra input. Thus, with just a handful of input files on a complex program you can easily reach millions of bugs.
Our "formula" for injecting a bug. Any (DUA, ATP) pair where the DUA occurs before the attack point is a potential bug we can inject. |
Modifying the Source Code
The last step is to modify the source code to add our bug. We will insert code in two places:
- At the DUA site, to save a copy of the input data to a global variable.
- At the attack point, to retrieve the DUA's data, check if it satisfies the trigger condition, and use it to corrupt the pointer.
A Toy Example
To see LAVA in action, let's step through a full example. Have a look at this small program, which parses and prints information about a very simple binary file format:
We start by instrumenting the source code to add taint queries. The queries will be inserted to check taint on program variables, and, for aggregate data structures, the members inside each structure. The result is a bit too long to include inline, since it quadruples the size of the original program, but you can see it in this gist.
When we compile and run that program on some input inside of PANDA with taint tracking enabled, we get information about taint compute numbers and the liveness of each byte of the input. For example, here's the liveness map for a small (88 byte) input:
Liveness map for the input to our toy program. The bytes with a white background are completely dead – they can be set to arbitrary values without affecting the behavior of the program. |
LAVA's analysis finds 82 DUAs and 8 attack points, for a total of 407 potential bugs. Not all of these bugs will be viable: because we want to measure the effect of liveness and taint compute number, the current implementation does not impose limits on how live or complicated the DUAs used in bugs are.
To make sure that an injected bug really is a bug, we do two tests. First, we run the modified program on a non-triggering input, and verify that it runs correctly. This ensures that we didn't accidentally break the program in a way we weren't expecting. Second, we run it on the triggering input and check that it causes a crash (a segfault or bus error). If it passes both tests we deem it a valid bug. This could miss some valid bugs, of course – not all memory corruptions will cause the program to crash – but we're interested mainly in bugs that we can easily prove are real. Another approach might be to run the buggy program under Address Sanitizer and check to see if it flags any memory errors. After validation, we find that LAVA is able to inject 159 bugs into the toy program, for a yield of around 39%.
Let's look at an example bug (I've cleaned up the source a little bit by hand to make it easier to read; programmatically generated code is not pretty):
On lines 6–15, after parsing the file header, we add code that saves off the value of the reserved field3, which our analysis correctly told us was dead, uncomplicated, and available in head.reserved. Then, on line 20, we retrieve the value and conditionally add it to the pointer ent that is being passed to consume_record (checking the value in both possible byte orders, because endianness is hard). When consume_record tries to access fields inside the file_entry, it crashes. In this case, the DUA and attack point were in the same function, and so the use of a global variable was not actually necessary, but in a larger program the DUA and attack point could be in different functions or even different compilation units.
If you like, you can download all 407 buggy program versions, along with the original source code and triggering inputs. Note that the current implementation does not make any attempt to hide the bugs from human eyes, so you will very easily be able to spot them by looking at the source code.
Next Time
Having developed a bug injection system, we would like to know how well it performs. In the next post, we'll examine questions of evaluation: how many bugs can we inject, and how do the liveness and taint compute measures influence the number of viable bugs? How realistic are the bugs? (much more complicated than it may first appear!) And how effective are some common bug-finding techniques like symbolic execution and fuzzing? We'll explore all these and more.
1 Having worked with dynamic program analysis for so long, I sometimes forget how ridiculous the term "dynamic taint analysis" is. If you're looking for another way to say the same thing, you can use "information flow" but dynamic taint analysis is the name that seems to have stuck.↩
2 Getting taint information by instrumenting the source works, but has a few drawbacks. Most notably, it causes a huge increase in the size of the source program, and slows it down dramatically. We're currently finishing up a new method, pri_taint, which can do the taint queries on uninstrumented programs as long as they have debug symbols. This should allow LAVA to scale to larger programs like Firefox.↩
3 The slightly weird ({ }) construct is a non-standard extension to C called a statement expression. It allows multiple statements to be executed in a block with control over what the block as a whole evaluates to. It's a nice feature to have available for automatically generated code, as it allows you to insert arbitrary statements in the middle of an expression without worrying about messing up the evaluation.↩
Comments
LAVA seems like a very useful tool. Will it be open sourced? Also, is it possible for you to at least release some of the buggy applications that you used in your paper? This will help other researchers to evaluate vulnerability detection tools taking such applications as benchmark.
Thanks