Tuesday, June 7, 2016

How to add a million bugs to a program (and why you might want to)

In this series of posts, I'm going to describe how to automatically put bugs in programs, a topic on which we just published a paper at Oakland, one of the top academic security conferences. The system we developed, LAVA, can put millions of bugs into real-world programs. Why would anyone want to do this? Are my coauthors and I sociopaths who just want to watch the world burn? No, but to see why we need such a system requires a little bit of background, which is what I hope to provide in this first post.

I am sure this will come as a shock to most, but programs written by humans have bugs. Finding and fixing them is immensely time consuming; just how much of a developer's time is spent debugging is hard to pin down, but estimates range between 40% and 75%. And of course these errors can be not only costly for developers but catastrophic for users: attackers can exploit software bugs to run their own code, install malware, set your computer on fire, etc.

Weekly World News has known about this problem for years.


It should come as little surprise, then, that immense effort has been expended in finding ways to locate and fix bugs automatically. On the academic side, techniques such as fuzzing, symbolic execution, model checking, abstract interpretation, and creative combinations of those techniques, have been proposed and refined for the past 25 years. Nor has industry been idle: companies like Coverity, Fortify, Veracode, Klocwork, GrammaTech, and many more will happily sell (or rent) you a product that automatically finds bugs in your program.

Great, so by now we must surely have solved the problem, right? Well, not so fast. We should probably check to see how well these tools and techniques work. Since they're detectors, the usual way would be to measure the false positive and false negative rates. To measure false positives, we can just run one of these tools on our program, go through the output, and decide whether we think each bug it found is real.

The same strategy does not work for measuring false negatives. If a bug finder reports finding 42 bugs in a program, we have no way of knowing whether that's 99% or 1% of the total. And this seems like the piece of information we'd most like to have!

Heartbleed: detectable with static analysis tools, but only after the fact.


To measure false negatives we need a source of bugs so that we can tell how many of them our bug-finder detects. One strategy might be to look at historical bug databases and see how many of those bugs are detected. Unfortunately, these sorts of corpora are fixed in size – there are only so many bugs out there, and analysis tools will, over time, be capable of detecting most of them. We can see how this dynamic played out with Heartbleed: shortly after the bug was found, Coverity and GrammaTech quickly found ways to improve their software so that it could find Heartbleed.

Let me be clear – it's a good thing that vendors can use test cases like these to improve their products! But it's bad when these test cases are in short supply, leaving users with no good way of evaluating false negatives and bug finders with no clear path to improving their techniques.

This is where LAVA enters the picture. If we can find a way to automatically add realistic bugs to pre-existing programs, we can both measure how well current bug finding tools are doing, and provide an endless stream of examples that bug-finding tools can use to get better.

LAVA: Large-scale Automated Vulnerability Analysis

Goals for Automated Bug Corpora


So what do we want out of our bug injection? In our paper, we defined five goals for automated bug injection, requiring that injected bugs
  1. Be cheap and plentiful
  2. Span the execution lifetime of a program
  3. Be embedded in representative control and data flow
  4. Come with a triggering input that proves the bug exists
  5. Manifest for a very small fraction of possible inputs
The first goal we've already discussed – if we want to evaluate tools and enable "hill climbing" by bug finders we will want a lot of bugs. If it's too expensive to add a bug, or if we can only add a handful per program, then we don't gain much by doing it automatically – expensive humans can already add small numbers of bugs to programs by hand.

The next two relate to whether our (necessarily artificial) bugs are reasonable proxies for real bugs. This is a tricky and contentious point, which we'll return to in part three. For now, I'll note that the two things called out here – occurring throughout the program and being embedded in "normal" control and data flow – are intended to capture the idea that program analyses will need to do essentially the same reasoning about program behavior to find them as they would for any other bugs. In other words, they're intended to help ensure that getting better at finding LAVA bugs will make tools better at understanding programs generally.

The fourth is important because it allows us to demonstrate, conclusively, that the bugs we inject are real problems. Concretely, with LAVA we can demonstrate an input for each bug we inject that causes the program to crash with a segfault or bus error.

The final property is critical but not immediately obvious. We don't want the bugs we inject to be too easy to find. In particular, if a bug manifests on most inputs, then it's trivial to find it – just run the program and wait for the crash. We might even want this to be a tunable parameter, so that we could specify what fraction of the input space of a program causes a crash and dial the difficulty of finding the right input up or down.

Ethics of Bug Injection


A common worry about bug injection is that it could be misused to add backdoors into legitimate software. I think these worries are, for the most part, misplaced. To see why, consider the goals of a would-be attacker trying to sneak a backdoor into some program. They want:
  1. A way to get the program to do something bad on some secret input.
  2. Not to get caught (i.e., to be stealthy, and for the bugs to be deniable).
Looking at (1), it's clear that one bug suffices to achieve the goal; there's no need to add millions of bugs to a program. Indeed, adding millions of bugs harms goal (2) – it would require lots of changes to the program source, which would be very difficult to hide.

An attempted Linux kernel backdoor attempt from 2003. Can you spot the bugdoor?

In other words, the benefit that LAVA provides is in adding lots of bugs at scale. An attacker that wants to add a backdoor can easily do it by hand – they only need to add one, and even if it takes a lot of effort to understand the program, that effort will be rewarded with extra stealth and deniability. Although the bugs that LAVA injects are realistic in many ways, they do not look like mistakes a programmer would have naturally made, which means that manual code review would be very likely to spot them.

(There is one area where LAVA might help a would-be attacker – the analysis we do to locate portions of the program that have access to attacker controlled input could conceivably speed up the process of inserting a backdoor by hand. But this analysis is quite general, and is useful for far more than just adding bugs to programs.)

The Road Ahead


The next post will discuss the actual mechanics of automated bug injection. We'll see how, using some new taint analyses in PANDA we can analyze a program to find small modifications that cause attacker-controlled input to reach sensitive points in the program and selectively trigger memory safety errors when the input is just right.

Once we understand how LAVA works, the final post will be about evaluation: how can we tell if LAVA succeeded in its goals of injecting massive numbers of realistic bugs? And how well do current bug-finders fare at finding LAVA bugs?

Credits


The idea for LAVA originated with Tim Leek of MIT Lincoln Laboratory. Our paper lists authors alphabetically, because designing, implementing and testing it truly was a group effort. I am honored to share a byline with Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan.

Tuesday, January 5, 2016

PANDA Plugin Documentation

It's been a very long time coming, but over the holiday break I went through and created basic documentation for all 54 currently-available PANDA plugins. Each plugin now includes a manpage-style document named USAGE.md in its plugin directory.

You can find a master list of each plugin and a link to its man page here:
https://github.com/moyix/panda/blob/master/docs/Plugins.md

Hopefully this will help people get started using PANDA to do some cool reverse engineering!

Friday, October 2, 2015

PANDA VM Update October 2015

The PANDA Virtual machine has once again been updated, and you can download it from:


Notable changes:

  • We fixed a record/replay bug that was preventing Debian Wheezy and above from replaying properly.
  • The QEMU GDB stub now works during replay, so you can break, step, etc. at various points during the replay to figure out what's going on. We still haven't implemented reverse-step though – hopefully in a future release.
  • Thanks to Manolis Stamatogiannakis, the Linux OS Introspection code can now resolve file descriptors to actual filenames. Tim Leek then extended the file_taint plugin to use this information, so file-based tainting should be more accurate now, even if things like dup() are used.
  • We have added support for more versions of Windows in the syscalls2 code.
Enjoy!

Thursday, August 27, 2015

(Sys)Call Me Maybe: Exploring Malware Syscalls with PANDA

System calls are of great interest to researchers studying malware, because they are the only way that malware can have any effect on the world – writing files to the hard drive, manipulating the registry, sending network packets, and so on all must be done by making a call into the kernel.

In Windows, the system call interface is not publicly documented, but there have been lots of good reverse engineering efforts, and we now have full tables of the names of each system call; in addition, by using the Windows debug symbols, we can figure out how many arguments each system call takes (though not yet their actual types).

I recently ran 24,389 malware replays under PANDA and recorded all the system calls made, along with their arguments (just the top-level argument, without trying to descend into pointer types or dereference handle types). So for each replay, we now have a log file that looks like:

3f9b2340 NtGdiFlush
3f9b2340 NtUserGetMessage 0175feac 00000000 00000000 00000000
3f9b2120 NtCreateEvent 0058f8d8 001f0003 00000000 00000000 00000000
3f9b2120 NtWaitForMultipleObjects 00000002 0058f83c 00000001 00000000 00000000
3f9b2120 NtSetEvent 000002ec 00000000
3f9b2120 NtWaitForSingleObject 000002f0 00000000 0058f89c
3f9b2120 NtReleaseWorkerFactoryWorker 00000050
3f9b2120 NtReleaseMutant 00000098 00000000
3f9b2120 NtWaitForSingleObject 000005a4 00000000 00000000
3f9b2120 NtWaitForMultipleObjects 00000002 00dbf49c 00000001 00000000 00000000
3f9b2120 NtReleaseMutant 00000098 00000000
3f9b2120 NtWaitForMultipleObjects 00000002 00dbf4a8 00000001 00000000 00dbf4c8
3f9b2120 NtWaitForMultipleObjects 00000002 00dbf49c 00000001 00000000 00000000
3f9b2120 NtClearEvent 000002ec
3f9b2120 NtReleaseMutant 00000098 00000000
3f9b2120 NtWaitForMultipleObjects 00000002 00dbf49c 00000001 00000000 00000000
3f9b2120 NtReleaseMutant 000001e8 00000000
3f9b2120 NtWaitForMultipleObjects 00000002 00dbf3b8 00000001 00000000 00000000
3f9b2120 NtReleaseMutant 00000158 00000000
3f9b2120 NtCreateEvent 00dbeed4 001f0003 00000000 00000000 00000000
3f9b2120 NtDuplicateObject ffffffff fffffffe ffffffff 002edf50 00000000 00000000 00000002

3f9b2120 NtTestAlert
...

The first column identifies the process that made the call, using its address space as a unique identifier. The second gives the name of the call, and the remaining columns show the arguments passed to the function.

As usual, this data can be freely downloaded; the data set is 38GB. Each log file is compressed; you can use the showsc program (included in the tarball) to display an individual log file:

$ ./showsc 32 32bit/008d065f-7f5d-4a86-9995-970509ff3999_syscalls.dat.gz

You can download the data set here:

Interesting Malware System Calls

As a first pass, we can look at what the least commonly used system calls are. These may be interesting because rarely used system calls are more likely to contain bugs; in the context of malware, invoking a vulnerable system call can be a way to achieve privilege escalation.

Here are a few that came out from sorting the list of system calls in the malrec dataset and then searching Google for some of the least common:
  • NtUserMagControl (1 occurrence) One of many functions found by j00ru to cause crashes due to invalid pointer dereferences when called from the context of the CSRSS process
  • NtSetLdtEntries (2 occurrences) Used as an anti-debug trick by some malware
  • NtUserInitTask (3 occurrences) Used as part of an exploit for CVE-2012-2553
  • NtGdiGetNearestPaletteIndex (3 occurrences) Used in an exploit for MS07-017
  • NtQueueApcThreadEx (5 occurrences) Mentioned as a way to get attacker-controlled code into the kernel, allowing one to bypass SMEP
  • NtUserConvertMemHandle (5 occurrences) Used to replace a freed kernel object with attacker data in an exploit for CVE-2015-0058
  • NtGdiEnableEudc (9 occurrences) Used in a privilege escalation exploit where NtGdiEnableEudc assumes a certain registry key is of type REG_SZ without checking, allowing an attacker to overflow a stack buffer (I was unable to find anything about whether this has been patched – Update: Mark Wodrich points out that this is CVE-2010-4398 and it was patched in MS11-011)
  • NtAllocateReserveObject (11 occurrences) Used for a kernel pool spray
  • NtVdmControl (55 occurrences) Used for the famous CVE-2010-0232 bug; Tavis Ormandy won the Pwnie for Best Privilege Escalation Bug in 2010 for this.
Of course, we can't say for sure that the replays that execute these calls actually contain exploitation attempts. After all, there are benign ways to use each of the calls, or they wouldn't be in Windows in the first place :) But these are a few that may reward closer examination; if they are in fact exploit attempts, you can then use PANDA's record and replay facility to step through the exploit in as much detail as you like. You can even use PANDA's recently-fixed QEMU gdb stub to go through the exploit instruction by instruction.

You can peruse the full list of system calls and their frequencies here: 32-bit, 64-bit. Let me know if you find any other interesting calls in there :)

Updates 8/28/2015

If you want to know which log files have which system calls without processing all of them, I have created an index that lists the unique calls for each replay:
Also, Reddit user trevlix wondered whether the lack of pointer dereferencing was inherent to PANDA or something I'd just left out. My response:

Yes, it is possible to do that. I just wasn't able to because I didn't have access to full system call prototypes. E.g., to follow pointers for something like NtCreateFile, you need to know that its full prototype is
NTSTATUS NtCreateFile(
  _Out_    PHANDLE            FileHandle,
  _In_     ACCESS_MASK        DesiredAccess,
  _In_     POBJECT_ATTRIBUTES ObjectAttributes,
  _Out_    PIO_STATUS_BLOCK   IoStatusBlock,
  _In_opt_ PLARGE_INTEGER     AllocationSize,
  _In_     ULONG              FileAttributes,
  _In_     ULONG              ShareAccess,
  _In_     ULONG              CreateDisposition,
  _In_     ULONG              CreateOptions,
  _In_     PVOID              EaBuffer,
  _In_     ULONG              EaLength
);
You furthermore have to know how big an OBJECT_ATTRIBUTES struct is, so that when you dereference the pointer you know how many bytes to read and store in the log.
If you wanted to collect extra information about any of the logs posted, it's possible since they are full-system traces and can be replayed :) Supposing you have a syscall trace file like0a1a1a77-d4f1-43e0-bc14-4f34f7d96820_syscalls.dat.gz, you can use the UUID to find it on malrec and download the log file:
Then you'd just unpack that log (scripts/rrunpack.py in the PANDA directory) and replay it with a PANDA plugin that understands how to dereference the various pointers involved. For reference, you can see the PANDA plugin I originally used to gather the syscall traces:
And you can see on lines 108 and 119 where you'd have to add in code to read the dereferenced values.

Monday, August 24, 2015

One Weird Trick to Shrink Your PANDA Malware Logs by 84%

When I wrote about some of the lessons learned from PANDA Malrec's first 100 days of operation, one of the things I mentioned was that the storage requirements for the system were extremely high. In the four months since, the storage problem only got worse: as of last week, we were storing 24,000 recordings of malware, coming in at a whopping 2.4 terabytes of storage.

The amount of data involved poses problems not just for our own storage but also for others wanting to make use of the recordings for research. 2.4 terabytes is a lot, especially when it's spread out over 24,000 HTTP requests. If we want our data to be useful to researchers, it would be great if we could find better ways of compressing the recording logs.

As it turns out, we can! The key is to look closely at what makes up a PANDA recording:
  • The log of non-deterministic events (the -rr-nondet.log files)
  • The initial QEMU snapshot (the -rr-snp files)
The first of these is highly redundant and actually compresses quite well already – the xz compression used by PANDA's rrpack.py usually manages to get around a 5-6X reduction for the nondet log. The snapshots also compress pretty well, at around 4X.

So where can we find further savings? The trick is to notice that for the malware recordings, each run is started by first reverting the virtual machine to the same state. That means that the initial snapshot files for our recordings are almost all identical! In fact, if we do a byte-by-byte diff, the vast majority differ by only a few bytes – most likely a timer value that increments in the short time between when we revert to the snapshot and begin our recording.

With this observation in hand, we can instead store the malware recordings in a new format. The nondet log will still be compressed with xz, but now the snapshot for each will now instead be stored as a binary diff with respect to a reference snapshot. Because we have two separate recording platforms and have changed the initial environment used by Malrec a few times, the total number of reference snapshots we need is 8 – but this is a huge improvement over storing 24,000 snapshots! The binary diff for each recording then requires only a handful of bytes to specify.

The upshot of all of this is that a dataset of 24,189 PANDA malware recordings now takes up just 387 GB, a savings of 84%. This is pretty astonishing – the recordings in the archive contain 476 trillion instructions' worth of execution, meaning our storage rate is 1147.5 instructions per byte! As a point of comparison, one recent published instruction trace compression scheme achieved 2 bits per instruction; our compression is 0.007 bits per instruction – though this comparison is somewhat unfair since that paper can't assume a shared starting point.

You can download this data set as a single file from our MIT mirror; please share and mirror this as widely as you like! There is a README included in the archive that contains instructions for extracting and replaying any of the recordings. Click the link below to download:


Stay tuned, too – there's more cool stuff on the way. Next time, I'll be writing about one of the things you can do with a full-trace recording dataset like this: extracting system call traces with arguments. And of course that means I'll have a syscall dataset to share then as well :)

Monday, April 13, 2015

PANDA VM Update April 2015

The PANDA virtual machine has been updated to the latest version of PANDA, which corresponds to commit ce866e1508719282b970da4d8a2222f29f959dcd. You can download it here:

  • The taint system has been rewritten and is now available as the taint2 plugin. It is at least 10x faster, and uses much less memory. You can check out an example of how to use it in the recently updated tainted instructions tutorial.
  • Since taint is now usable, I have increased the amount of memory in the VM to 4GB, which is reasonable for most tasks that use taint.
  • PANDA now understands system calls and their arguments on Linux (x86 and ARM) and Windows 7 (x86). This is available in the syscalls2 plugin, and even has some documentation.
  • There is now a generic logging format for PANDA, which uses Protocol Buffers. Check out the pandalog documentation for more details.
There's lots more that has changed, and I will try to write up a more detailed post about all the cool stuff PANDA can do now soon!

Tuesday, March 24, 2015

100 Days of Malware

It's now been a little over 100 days since I started running malware samples in PANDA and making the executions publicly available. In that time, we've analyzed 10,794 pieces of malware, which generated:
  • 10,794 record/replay logs, representing 226,163,195,948,195 instructions executed
  • 10,794 packet captures, totaling 26GB of data and 33,968,944 packets
  • 10,794 movies, which are interesting enough that I'll give them their own section
  • 10,794 VirusTotal reports, indicating what level of detection they had when they were run by malrec
  • 107 torrents, containing downloads of the above
I've been pleased by the interest malrec has generated. We've had visitors from over 6000 unique IPs, in 89 different countries:

The Movies

There's a lot of great stuff in these ~10K movies. An easy way to get an idea of what's in there is to sort by filesize; because of the way MP4 encoding works, larger files in general mean that there's more going on on-screen (though only up to a point – the largest ones seemed to be mostly command prompts scrolling, which wasn't very interesting). I took it upon myself to watch the ones between ~300KB and 1MB, and found some fun videos:

Several games:

Puzzle

Anime-like game

"Wild Spirit"

Some fake antivirus:



Extortion attempts were really popular:



Though they didn't always invest in fancy art design:

Extortion (low-rent)

Download managers and trojaned installers were very popular:

Broken trojaned installer

Download manager

Some used legitimate-looking documents to disguise nefarious intentions:

Some maritime history

German Britfilms

Chinese newspaper

Finally, there's the weird and random:

Trust Me, I'm a Doctor

Trees

Not pictured: there was even one sample that played a porn video (NSFW) while it infected you; I guess it was intended as a distraction?

Antivirus Results

After malrec runs a sample in the PANDA sandbox, it checks to see what VirusTotal thinks about the file, and saves the result in VirusTotal's JSON format. From this, we can find out what the most popular families in our corpus are. For example, here's what McAfee thought about our samples:




And Symantec:

In the graphs above, "None" includes both cases where the AV didn't detect anything and cases where the sample hadn't been submitted to VirusTotal yet. You therefore should probably not use this data to try and draw any conclusions about the relative efficacy of Symantec vs McAfee. I'd like to go back and see how the detection rate has changed, but unfortunately my VirusTotal API key is currently rate-limited, so running all 10,794 samples would be a pain.

Bugs in PANDA

Making recordings on such a large scale has exposed bugs in PANDA, some of which we have fixed, others of which need further investigation:
  • One sample, 5309206b-e76f-417a-a27e-05e7f20c3c9d, ran a tight loop of rdtsc queries without interrupts. Because PANDA's replay queue would continue filling until it saw the next interrupt, this meant that the queue could grow very large and exhaust physical memory. This was fixed by limiting the queue to 65,536 entries.
  • Our support for 64-bit Windows is much less reliable than I would like. Of the 153 64-bit samples, 45 failed to replay (29.4%). We clearly need to do better here!
  • We see some sporadic replay failures in 32-bit recordings as well, but they are much more rare: out of the 10,641 32-bit recordings we have, only 14 failed to replay. I suspect that some of these are due to a known bug involving the recording of port I/O.
  • One sample, MD5 1285d1893937c3be98dcbc15b88d9433, we have not even been able to record, because it causes QEMU to run out of memory; if you'd like to play with it, you can download it here.

Conclusions

With our current disk usage (1.2TB used out of 2TB), I'm anticipating that we'll be able to run for another 50 days or so; hopefully by then I'll be able to arrange to get more storage.

Meanwhile, I've started to do some deeper research on the corpus, including visualization and mining memory for printable strings. I'm looking forward to getting a clearer picture of what's in our corpus as it grows!