Wednesday, October 17, 2018

A couple ideas that went nowhere

I suspect a lot of people in academia end up having a lot of ideas and projects that went nowhere for any number of reasons – maybe there were insurmountable technical challenges, maybe the right person to work on it never materialized, or maybe it just got crowded out by other projects and never picked back up. Here are a couple of mine. For each I'll try to indicate why it fell by the wayside, and whether I think it could be resurrected (if you're interested in doing some idea necromancy, let me know! :)).

Detecting Flush+Flush

Among the flurry of microarchitectural side channel attacks that eventually culminated in the devastating Spectre and Meltdown attacks was one that has received relatively little attention: Flush+Flush. The base of the attack is the observation that clflush takes a different amount of time depending on whether the address to be flushed was already in the cache or not. Gruss et al. had a nice paper on this variant of the attack at DIMVA 2016.
The interesting thing about Flush+Flush to me is not its speed (which I believe has since been surpassed) but the fact that it is stealthy: unlike Flush+Reload, which causes an unusually large number of cache misses in the attacking process, Flush+Flush only causes cache misses in the victim process. So you can tell that an attack is happening, but you can't tell which process is actually carrying it out --- which might be useful if you want to stop the attack without taking down the whole machine.
The key idea of the project was that even if you have only a global detector of whether an attack is going on system-wide, you can convert this into something that detects which process is attacking with a simple trick. Assuming you have control of the OS scheduler, just pause half the processes on the system and see if the attack stops. If it does, then the attacker must have been in the half you paused; you can now repeat this procedure on those processes and find the attacker via binary search. There are some wrinkles here: what if the attacker is carrying out the attack from multiple cooperating processes? What if the attacker probabilistically chooses whether or not to continue the attack at each time step?
It's a pretty simple idea but I think it may have applications beyond detecting flush-flush. For example, I recently complained on twitter that it's currently hard to tell what's causing my Mac's fan to spin. I have a detector (fan is on); if I can systematically freeze and unfreeze processes on the system, I can convert that to something that tells me . It's also very similar to the well known ideas of code bisection and (my favorite name) Wolf Fence Debugging. So if this defense is ever implemented I hope it becomes known as the wolf fence defense :)
This project failed primarily due to personnel issues: a couple different students took it on as a semester or internship project, but didn't manage to finish it up before they graduated or moved on to other things. It also turned out to be more difficult than expected to reliably carry out Flush+Flush attacks; at the time we started the project, a lot less was known about the Intel microarchitecture, and we were tripped up by features like hardware prefetching. We were also hampered by the lack of widely available implementations of hardware performance counter-based detections, and our own implementation was prone to false positives (at least in this we are not alone: a recent SoK paper has pointed out that many of these HPCs are not very reliable for security purposes).
I think this could still make a nice little paper, particularly if some of the more complex scenarios and advanced attackers were taken into account. When I discussed this with a more theoretically minded friend, he thought there might be connections to compressed sensing; meanwhile, a game theoretically oriented colleague I spoke to thought it sounded like an instance of a Multi-Armed Bandit problem. So there's still plenty of meat to this problem for an interested researcher.

Reverse Engineering Currency Detection Code

It is by now well-known that a wide variety of software and hardware, from cameras to photo editing software such as Photoshop, implement currency detection in order to help combat counterfeiting. But surprisingly little is known about these algorithms: Steven Murdoch did some research on the topic using a mix of black-box techniques and reverse engineering all the way back in 2004, but as far as I know never published a paper on the topic.
Reverse engineering the precise detection technique could have a lot of benefits. First, as a matter of principle, I think defensive techniques must be attacked if we are to rely on them; the fact that this has been used in the wild for more than 15 years across a wide range of devices without any rigorous public analysis is really surprising to me. Second, there is a fun application if we can pin down precisely what makes the currency detector fire: we could create something that placed a currency watermark on arbitrary documents, making them impossible to scan or edit on devices that implement the currency detector! We could even, perhaps, imagine making T-shirts that trigger the detector when photographed :) I believe this idea has been floated before with the EURion constellation, but based on Murdoch's research we know that the EURion is not the only feature used (and it may not be used at all by modern detectors).
Our technical approach to this problem was to use dynamic taint analysis and measure the amount of computation performed by Photoshop in each pixel of the input image. I now think this was a mistake. First, many image analyses compute global features (such as color histograms) over the whole image; taint analysis on these will only tell you that every pixel contributes to the decision [1]. A more profitable approach might be something like differential slicing, which pins down the causal differences between two closely related execution traces. This would hopefully help isolate the code responsible for the detection for more manual analysis.
As with the Flush+Flush detection above, this project could still be viable if approached by someone with strength in binary analysis and some knowledge of image processing and watermarking algorithms.


1. This also made me start thinking about variants of taint analysis that would be better suited to image analyses. One possibility is something based on quantitative information flow; Steven McCamant et al. had some success at using this to analyze the security of image redactions. Even with traditional dynamic taint analysis, though, I think it might be possible to try tainting the results of intermediate stages of the computation (e.g., in the histogram example, one could try tainting each bucket and seeing how it contributed to the eventual decision, or for an FFT one could apply the taint analysis after the transformation into the frequency domain).

Monday, March 19, 2018

Of Bugs and Baselines

Summary: recently published results on the LAVA-M synthetic bug dataset are exciting. However, I show that much simpler techniques can also do startlingly well on this dataset; we need to be cautious in our evaluations and not rely too much on getting a high score on a single benchmark.

A New Record


The LAVA synthetic bug corpora have been available now for about a year and a half. I've been really excited to see new bug-finding approaches (particularly fuzzers) use the LAVA-M dataset as a benchmark, and to watch as performance on that dataset steadily improved. Here's how things have progressed over time.

Performance on the LAVA-M dataset over time. Note that because the different utilities have differing numbers of bugs, this picture presents a slightly skewed view of how successful each approach was by normalizing the performance on each utility. Also, SBF was only evaluated on base64 (where it did very well), and Vuzzer's performance on md5sum is due to largely to a problem in the LAVA-M dataset.
You may notice that Angora (Chen and Chen, to appear at Oakland ’18), on the far right, has come within striking distance of perfect performance on the dataset! This is a great result, and the authors pulled it by combining several really neat techniques (I mention a few of them in this Twitter thread). They also managed to do it in just 10 minutes for base64, md5sum, and uniq, and 45 minutes for who. Kudos to the authors!

My feeling is that the effective “shelf life” of this dataset is pretty much up – current techniques are very close to being able to cover everything in this dataset. This is perhaps not too surprising, since the coreutils are quite small, and the trigger mechanism used by the original LAVA system (a comparison against a 4 byte magic number) is a bit too simple since you can often solve it by extracting constants from the binary. Luckily, we have been working on new bug injection techniques and new corpora, and we will hopefully have some results to announce soon – watch this space :)

Covering your Baseline


When talking about progress in bug-finding, it is important to figure out a good baseline. One such baseline, of course, is the results we provided in the original paper (the first two bars on the graph). However, during the summer following the paper's publication, I pointed out two fairly simple ways to strengthen the baseline for fuzzing: using a constant dictionary and using a compiler pass to split up integer comparisons. The former is built in to AFL, and the latter was implemented in an AFL fork by laf-intel

What does our baseline look like when we use these techniques? To test this, I ran fuzzers on the LAVA-M dataset for 1 hour per program. This is a very short amount of time, but the intent here is to see what's possible with minimal effort and simple techniques. For the dictionary, I used a simple script that extracts strings and also runs objdump to extract integer constants. I tested:

  • AFL with a dictionary
  • laf-intel with AFL's "fidgety" mode (the -d option) and LAF_SPLIT_COMPARES=1
  • The same laf-intel configuration, but with a 256KiB map size rather than the default 64KiB.
The laf-intel configurations were suggested by Caroline Lemieux and relayed to me by Kevin Laeufer, who also pointed out that one really ought to be doing multiple runs of each fuzzer since the instrumentation as well as some fuzzing stages are nondeterministic – advice I have, alas, ignored for now. The larger map size helps reduce hash collisions with laf-intel – splitting up comparisons adds a large number of new branches to the program, so fuzzing performance may suffer with too small a map.

Here's what our new baseline results look like when put next to published work.


In this light, published tools are still much stronger than our original weak baseline, but (aside from Angora) are not better than AFL with a dictionary. I suspect that this says more about deficiencies in our dataset than about the tools themselves (more on this later).

Who's Who


Because the who program has so many bugs, and published tools seem to have had some trouble with it, I also wanted to see how different techniques would fare when given 24 hours to chew on it. I should note here, for clarity, that I do not think 2000+ bugs in who is a reasonable benchmark by which to judge bug-finding tools. You should not use the results of this test to decide what you will use to fuzz your software. But it does help us characterize what works well on LAVA-style bugs in a very simple, small program that takes structured binary input.

The contestants this time:
  • AFL, default mode, with dictionary
  • AFL, fidgety mode (-d option), with dictionary
  • AFLFast, with dictionary
  • AFL, fidgety mode, with the laf-intel pass, with all four combinations of {dict,nodict} x {64KiB,256KiB} map.
You can see the results below; I've included Angora in the chart since it's the best-performing of the non-baseline approaches. Note, though, that Angora only ran for 45 minutes – it's possible it would have done better if allowed to run for 24 hours.



The top of the chart, at 2136, represents the total number of (validated) bugs in who. We can see that the combination of our techniques comes close to finding all of the bugs (93.7%), and any variant of AFL that was able to use a dictionary does very well. The only exception to this trend is AFL with default settings – this is because by default, AFL starts with deterministic fuzzing. With a large dictionary, this can take an inordinately long time: we noticed that after 24 hours, default AFL still hadn't finished a full cycle, and was only 12% of the way through the stage. Fidgety mode (the -d option) bypasses deterministic fuzzing and does much better here.

Notes on Seed Selection


When we put together the LAVA-M benchmark, one thing we thought would be important is to include the seed file we used for our experiments. The performance of mutation-based fuzzers like AFL can vary significantly depending on the quality of their input corpus, so to reproduce our results you really need to know what seed we used. Sadly, I almost never see papers on fuzzing actually talk about their seeds. (Perhaps they usually start off with no seeds? But this is an unrealistic worst-case for AFL.)

In the experiments above, I used the corpus-provided seeds for all programs except md5sum. In the case of md5sum, the program is run with "md5sum -c", which causes it to go compute and check the MD5 of each file listed in the input file. The seed we used in the LAVA-M corpus lists 20 files, which slows down md5sum significantly as it has to compute 20 MD5 sums. Replacing the seed with one that only lists a single, small file allows AFL to achieve more than 10 times as many executions per second. In retrospect, our seed choice when running the experiments for the paper were definitely sub-optimal.

Conclusion


The upshot? We can almost completely solve the LAVA-M dataset using stock AFL with a dictionary. The only utility that this doesn't work for is "who"; however, a few simple tricks (a dictionary, constant splitting, and skipping the deterministic phase of AFL with "-d") and 24 hours suffice to get us to very high coverage (1731/2136 bugs, or 81%). So although the results of fuzzers like Angora are exciting, we should be cautious not to read too much into their performance on LAVA-M and instead also ask how they perform at finding important bugs in programs we care about. (John Regehr points out that the most important question to ask about a bug-finding paper is “Did you report the bugs, and what did they say?”)

To be honest, I think this is not a great result for the LAVA-M benchmark. If a fairly simple baseline can find most of its bugs, then we have miscalibrated its difficulty: ideally we want our bug datasets to be a bit beyond what the best techniques can do today. This also confirms our intuition that it's time to create a new, harder dataset of bugs!

Acknowledgements


Thanks to Caroline Lemieux and Kevin Laeufer for feedback on a draft of this post, which fixed several mistakes and provided several interesting insights (naturally, if you see any more errors – that's my fault). Additional thanks are due to Kevin Laeufer (again), Alex Gantman, John Regehr, Kristopher Micinski, Sam Tobin-Hochstadt, Michal Zalewski (aka lcamtuf), and Valentin Manès (aka Jiliac) for a lively Twitter argument that turned into some actual useful science.