Tuesday, December 9, 2014

Reproducible Malware Analyses for All

Summary: With help from GTISC, I have begun running 100 malware samples per day and posting the PANDA record & replay logs online at http://panda.gtisc.gatech.edu/malrec/. The goal is to lower the barriers to entry for doing dynamic malware research, and to make such research reproducible.

Today, I spoke at the ACSAC Malware Memory Forensics workshop in New Orleans about a problem that I think has been largely ignored in existing dynamic malware analysis research: reproducibility.

To make results reproducible, a computer science researcher typically needs to do three things:
  1. Carefully and precisely describe their methods.
  2. Release the code they wrote for their system or analysis.
  3. Release the data the analysis was performed on.
Of course, even research published at top conferences may fail at some of these criteria; a recent study by Collberg et al. attempted to obtain the code associated with 613 recent papers from ACM conferences, and were able to obtain, build and run the code for only 102. (I'm eliding away a lot of important detail here; please do read the original study!)

Rather than discuss sharing of code today, however, I'd like to talk about sharing data, and particularly sharing data in malware analysis.

For static analysis of malware, sharing the malware executable is usually sufficient to satisfy the requirement for releasing data; anyone can then go and look at the same static code and reach the same conclusions by following the author's description. A number of sites exist to provide access to such malware samples, such as VirusShare, OpenMalware, and Contagio.

The data associated with a dynamic analysis is more difficult to share. Software execution is by nature ephemeral: each run of a program may be slightly different based on things like timings, the availability of network servers, the versions of software installed on the machine, and more. This problem is especially apparent with malware, which typically has a short "shelf life". Many malware samples need to contact their command and control servers to operate, and these C&C servers often disappear within days or weeks after a piece of malware is released. Malware may even be designed to "self-destruct" after a certain date, exiting immediately if it is run too long after its creation.

Thus, a researcher who tries to reproduce a dynamic malware analysis by running a sample from last year will almost certainly discover that the malware no longer has the behavior originally seen. As a result, most dynamic analyses of malware are currently not reproducible in any meaningful sense.

Record and replay provides a solution. As I have discussed in the past, record and replay allows one to reproduce a whole-system dynamic execution by creating a compact log of the nondeterministic inputs to a system. These logs can be shared and then replayed in PANDA, allowing anyone to re-run the exact execution and be assured that every instruction will be executed exactly the same way.

To put my malware where my mouth is, I've set up a site where, every day, 100 new malware record/replay logs and associated PCAPs will be posted. This is currently something of a trial run, so there may be some changes as I shake out the bugs; in particular, I hope to give it a nicer interface than just a brute listing of all the MD5s. Check it out:


Here are some ideas for what to do with this data:
  1. Create movies of all the malware executions and watch them to see if there's anything interesting. For example, here's a hilarious extortion attempt from last night:
  2. Use something like TZB to find all printable strings accessed in memory throughout the entire execution, and build a search engine that indexes all of these strings, so you could search for "bitcoin" and find all the bitcoin stealing samples in the corpus.
  3. Create system call traces and then use them to automatically apply behavioral labels to the corpus.
  4. Go apply your expertise in machine learning to do something really cool that I haven't even thought of because I'm bad at machine learning, without having to set up your own malware analysis platform.
I'm really excited to see what we can accomplish.
The malware recordings are graciously hosted by the Georgia Tech Information Security Center, who are also providing me with access to malware samples. Thanks in particular to Paul Royal and Adam Allred for helping me make this a reality after I pitched it at CSAW THREADS.

No comments: