Of Bugs and Baselines
Summary: recently published results on the LAVA-M synthetic bug dataset are exciting. However, I show that much simpler techniques can also do startlingly well on this dataset; we need to be cautious in our evaluations and not rely too much on getting a high score on a single benchmark.
A New Record
The LAVA synthetic bug corpora have been available now for about a year and a half. I've been really excited to see new bug-finding approaches (particularly fuzzers) use the LAVA-M dataset as a benchmark, and to watch as performance on that dataset steadily improved. Here's how things have progressed over time.
Angora (Chen and Chen, to appear at Oakland ’18), on the far right, has come within striking distance of perfect performance on the dataset! This is a great result, and the authors pulled it by combining several really neat techniques (I mention a few of them in this Twitter thread). They also managed to do it in just 10 minutes for base64, md5sum, and uniq, and 45 minutes for who. Kudos to the authors!
My feeling is that the effective “shelf life” of this dataset is pretty much up – current techniques are very close to being able to cover everything in this dataset. This is perhaps not too surprising, since the coreutils are quite small, and the trigger mechanism used by the original LAVA system (a comparison against a 4 byte magic number) is a bit too simple since you can often solve it by extracting constants from the binary. Luckily, we have been working on new bug injection techniques and new corpora, and we will hopefully have some results to announce soon – watch this space :)
Covering your Baseline
When talking about progress in bug-finding, it is important to figure out a good baseline. One such baseline, of course, is the results we provided in the original paper (the first two bars on the graph). However, during the summer following the paper's publication, I pointed out two fairly simple ways to strengthen the baseline for fuzzing: using a constant dictionary and using a compiler pass to split up integer comparisons. The former is built in to AFL, and the latter was implemented in an AFL fork by laf-intel.
What does our baseline look like when we use these techniques? To test this, I ran fuzzers on the LAVA-M dataset for 1 hour per program. This is a very short amount of time, but the intent here is to see what's possible with minimal effort and simple techniques. For the dictionary, I used a simple script that extracts strings and also runs objdump to extract integer constants. I tested:
Here's what our new baseline results look like when put next to published work.
In this light, published tools are still much stronger than our original weak baseline, but (aside from Angora) are not better than AFL with a dictionary. I suspect that this says more about deficiencies in our dataset than about the tools themselves (more on this later).
- AFL with a dictionary
- laf-intel with AFL's "fidgety" mode (the -d option) and LAF_SPLIT_COMPARES=1
- The same laf-intel configuration, but with a 256KiB map size rather than the default 64KiB.
Here's what our new baseline results look like when put next to published work.
In this light, published tools are still much stronger than our original weak baseline, but (aside from Angora) are not better than AFL with a dictionary. I suspect that this says more about deficiencies in our dataset than about the tools themselves (more on this later).
Who's Who
Because the who program has so many bugs, and published tools seem to have had some trouble with it, I also wanted to see how different techniques would fare when given 24 hours to chew on it. I should note here, for clarity, that I do not think 2000+ bugs in who is a reasonable benchmark by which to judge bug-finding tools. You should not use the results of this test to decide what you will use to fuzz your software. But it does help us characterize what works well on LAVA-style bugs in a very simple, small program that takes structured binary input.
The contestants this time:
- AFL, default mode, with dictionary
- AFL, fidgety mode (-d option), with dictionary
- AFLFast, with dictionary
- AFL, fidgety mode, with the laf-intel pass, with all four combinations of {dict,nodict} x {64KiB,256KiB} map.
The top of the chart, at 2136, represents the total number of (validated) bugs in who. We can see that the combination of our techniques comes close to finding all of the bugs (93.7%), and any variant of AFL that was able to use a dictionary does very well. The only exception to this trend is AFL with default settings – this is because by default, AFL starts with deterministic fuzzing. With a large dictionary, this can take an inordinately long time: we noticed that after 24 hours, default AFL still hadn't finished a full cycle, and was only 12% of the way through the stage. Fidgety mode (the -d option) bypasses deterministic fuzzing and does much better here.
Notes on Seed Selection
When we put together the LAVA-M benchmark, one thing we thought would be important is to include the seed file we used for our experiments. The performance of mutation-based fuzzers like AFL can vary significantly depending on the quality of their input corpus, so to reproduce our results you really need to know what seed we used. Sadly, I almost never see papers on fuzzing actually talk about their seeds. (Perhaps they usually start off with no seeds? But this is an unrealistic worst-case for AFL.)
In the experiments above, I used the corpus-provided seeds for all programs except md5sum. In the case of md5sum, the program is run with "md5sum -c", which causes it to go compute and check the MD5 of each file listed in the input file. The seed we used in the LAVA-M corpus lists 20 files, which slows down md5sum significantly as it has to compute 20 MD5 sums. Replacing the seed with one that only lists a single, small file allows AFL to achieve more than 10 times as many executions per second. In retrospect, our seed choice when running the experiments for the paper were definitely sub-optimal.
Conclusion
The upshot? We can almost completely solve the LAVA-M dataset using stock AFL with a dictionary. The only utility that this doesn't work for is "who"; however, a few simple tricks (a dictionary, constant splitting, and skipping the deterministic phase of AFL with "-d") and 24 hours suffice to get us to very high coverage (1731/2136 bugs, or 81%). So although the results of fuzzers like Angora are exciting, we should be cautious not to read too much into their performance on LAVA-M and instead also ask how they perform at finding important bugs in programs we care about. (John Regehr points out that the most important question to ask about a bug-finding paper is “Did you report the bugs, and what did they say?”)
To be honest, I think this is not a great result for the LAVA-M benchmark. If a fairly simple baseline can find most of its bugs, then we have miscalibrated its difficulty: ideally we want our bug datasets to be a bit beyond what the best techniques can do today. This also confirms our intuition that it's time to create a new, harder dataset of bugs!
Acknowledgements
Thanks to Caroline Lemieux and Kevin Laeufer for feedback on a draft of this post, which fixed several mistakes and provided several interesting insights (naturally, if you see any more errors – that's my fault). Additional thanks are due to Kevin Laeufer (again), Alex Gantman, John Regehr, Kristopher Micinski, Sam Tobin-Hochstadt, Michal Zalewski (aka lcamtuf), and Valentin Manès (aka Jiliac) for a lively Twitter argument that turned into some actual useful science.
Comments