Tuesday, December 9, 2014

Reproducible Malware Analyses for All

Summary: With help from GTISC, I have begun running 100 malware samples per day and posting the PANDA record & replay logs online at http://panda.gtisc.gatech.edu/malrec/. The goal is to lower the barriers to entry for doing dynamic malware research, and to make such research reproducible.

Today, I spoke at the ACSAC Malware Memory Forensics workshop in New Orleans about a problem that I think has been largely ignored in existing dynamic malware analysis research: reproducibility.

To make results reproducible, a computer science researcher typically needs to do three things:
  1. Carefully and precisely describe their methods.
  2. Release the code they wrote for their system or analysis.
  3. Release the data the analysis was performed on.
Of course, even research published at top conferences may fail at some of these criteria; a recent study by Collberg et al. attempted to obtain the code associated with 613 recent papers from ACM conferences, and were able to obtain, build and run the code for only 102. (I'm eliding away a lot of important detail here; please do read the original study!)

Rather than discuss sharing of code today, however, I'd like to talk about sharing data, and particularly sharing data in malware analysis.

For static analysis of malware, sharing the malware executable is usually sufficient to satisfy the requirement for releasing data; anyone can then go and look at the same static code and reach the same conclusions by following the author's description. A number of sites exist to provide access to such malware samples, such as VirusShare, OpenMalware, and Contagio.

The data associated with a dynamic analysis is more difficult to share. Software execution is by nature ephemeral: each run of a program may be slightly different based on things like timings, the availability of network servers, the versions of software installed on the machine, and more. This problem is especially apparent with malware, which typically has a short "shelf life". Many malware samples need to contact their command and control servers to operate, and these C&C servers often disappear within days or weeks after a piece of malware is released. Malware may even be designed to "self-destruct" after a certain date, exiting immediately if it is run too long after its creation.

Thus, a researcher who tries to reproduce a dynamic malware analysis by running a sample from last year will almost certainly discover that the malware no longer has the behavior originally seen. As a result, most dynamic analyses of malware are currently not reproducible in any meaningful sense.

Record and replay provides a solution. As I have discussed in the past, record and replay allows one to reproduce a whole-system dynamic execution by creating a compact log of the nondeterministic inputs to a system. These logs can be shared and then replayed in PANDA, allowing anyone to re-run the exact execution and be assured that every instruction will be executed exactly the same way.

To put my malware where my mouth is, I've set up a site where, every day, 100 new malware record/replay logs and associated PCAPs will be posted. This is currently something of a trial run, so there may be some changes as I shake out the bugs; in particular, I hope to give it a nicer interface than just a brute listing of all the MD5s. Check it out:


Here are some ideas for what to do with this data:
  1. Create movies of all the malware executions and watch them to see if there's anything interesting. For example, here's a hilarious extortion attempt from last night:
  2. Use something like TZB to find all printable strings accessed in memory throughout the entire execution, and build a search engine that indexes all of these strings, so you could search for "bitcoin" and find all the bitcoin stealing samples in the corpus.
  3. Create system call traces and then use them to automatically apply behavioral labels to the corpus.
  4. Go apply your expertise in machine learning to do something really cool that I haven't even thought of because I'm bad at machine learning, without having to set up your own malware analysis platform.
I'm really excited to see what we can accomplish.
The malware recordings are graciously hosted by the Georgia Tech Information Security Center, who are also providing me with access to malware samples. Thanks in particular to Paul Royal and Adam Allred for helping me make this a reality after I pitched it at CSAW THREADS.

Wednesday, November 26, 2014

Replaying Regin in PANDA

Regin, a piece of state-sponsored malware that may have been used to attack telecoms and cryptographers, has recently come to light. There are several good writeups out there, and I encourage you to check them out.

Getting access to samples in cases like this is often a challenge. Luckily, both The Intercept and VXShare (warning: both links contain live malware) have released samples thought to be associated with Regin, so that others can perform independent analysis. So far, it appears that the samples are all of the "stage1" component of the malware, rather than the initial "stage0" infector or the later stages.

In order to allow others to do dynamic analysis of this malware, I built a very small malware sandbox setup using PANDA. The sandbox essentially just executes a sample for five minutes, recording it using PANDA's record and replay facility. The process is slightly complicated by the fact that most of the stage1 samples are kernel-mode components; to (hopefully) deal with this I use the sc utility to create and start a service with the malware sample.

So, for normal executables:

start sample.exe

And for the kernel mode components:

sc create sample binPath= sample.exe type= kernel
sc start sample

So, without further ado, here are the recordings, associated PCAPs, and videos of the samples being executed:


The index.txt file shows the mapping between the original sample names and the auto-generated names used by the malware sandbox, along with the MD5s of each sample. Note that I have not tried to ensure that these samples really are Regin, and at least one (sample ID 26ed64ef-fcde-4171-99aa-e1e46301315d, MD5 0e783c9ea50c4341313d7b6b4037245b) seems to in fact be a QQ info stealer. There are also a few duplicates due to overlaps in the samples provided by The Intercept and VXShare; I have kept both in case a differential analysis between two runs turns out to be useful.

Happy malware analysis! And if you have more samples, please get in touch on Twitter (@moyix) or email me!

Monday, October 6, 2014

PANDA VM Updated

By popular request, I've updated the PANDA VM to a more recent version of PANDA. Get it here:


The version in the VM is based on Git revision 28787825aaf514da22e11650fdfca3ba82b9fc57.


Thursday, July 3, 2014

Breaking Spotify DRM with PANDA

Disclaimer: Although I think DRM is both stupid and evil, I don't advocate pirating music. Therefore, this post will stop short of providing a turnkey solution for ripping Spotify music, but it will fully describe the theory behind the technique and its implementation in PANDA. Don't be evil.

Update 6/6/2014: The following post assumes you know what PANDA is (a platform for dynamic analysis based on QEMU). If you want to know more, check out my introductory post on PANDA.

This past weekend I spoke at REcon, a conference on reverse engineering held every year in Montreal. I had a fantastic time there getting to meet other people interested in problems of memory analysis, reverse engineering, and dynamic analysis. One of the topics of my REcon talk was how to use PANDA to break Spotify DRM, and since the video from the talk won't be posted for a while, I thought I'd write up a post showing how we can use PANDA and statistics to pull out unencrypted OGGs from Spotify.

Gathering Data

The first step is to gather some data. We want to know what function inside Spotify is doing the actual decryption of the songs, so that we can then hook it and pull out the decrypted (but not decompressed) audio file. So to start with, we'll take a recording of Spotify playing a song; we can then apply whatever analysis we want to the replay. Working with a replay rather than a live system will also make our job considerably easier – no need to worry that we're going to slow things down enough to trip anti-debugging measures or network timeouts. I've prepared a record/repay log of Spotify playing 30 seconds of a song, which you can use to follow along with what comes next. The recording is 12 billion instructions, which gives us a lot of data to work with!

Just for fun, here's a movie of that replay, generated by taking screenshots throughout the replay and then stitching them into a video:

Some Theory

The next challenge is to figure out how we can identify the function that takes in encrypted data and outputs decrypted data. For this we turn to the excellent work of Ruoyu Wang, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna [1]. Their clever insight was that when you look at the distribution of bytes in encrypted vs. compressed streams, the byte entropy of the two is very similar, but compressed streams don't look very random. To illustrate this, let's look at the histograms for an encrypted mp3 file, and its decrypted version. First, encrypted:

Now the same file, decrypted:

You can clearly see that the one on the bottom looks significantly less "random" – or more precisely, the distribution of bytes is not very uniform. However, if we compute the byte entropy of each, they are both very close to the theoretical maximum of 8 bits per byte – the mp3 has 7.968480 bits of entropy per byte, whereas the encrypted file has 7.999981 bits per byte.

We can make this intuition more precise by turning to statistics. The Pearson chi-squared test (χ2) lets us compute a value for how much an observed distribution deviates from some ideal distribution. In this case, we expect the bytes in an encrypted file to be uniformly random, so we can compare with the uniform distribution by computing:

Here, Oi is the observed frequency of each byte, and Ei is the expected frequency, which for a uniform byte distribution with n samples will be (1/256)*n.

Similarly, the entropy of some ovserved data can be computed as:

Where p(xi) is the observed frequency of each byte value in the data.

Based on the work of Wang et al., if we find a function that reads a lot of high-entropy, highly random data, and writes a lot of high-entropy, non-random data, that's likely to be our guy!

Enter the PANDA

But enough theory. How do we actually gather the data we need in PANDA? We will want some way of gathering, for each function, statistics on the contents of buffers read and written by each function in the replay. As it happens, PANDA has a plugin called unigrams that will get us the data we want.

unigrams plugin works by tracking every memory read and write made by the system. When it sees a read or write, it looks up the current process context (i.e., CR3 on x86), program counter, and the callsite of the parent function (this last is done with the help of the callstack_instr plugin). Together, these three pieces of information allow us to put the individual memory access in context and separate out memory accesses made in different program contexts into coherent streams of data. So to gather the raw data we want, we can just run:

x86_64-softmmu/qemu-system-x86_64 -m 1024 -replay spotify \
  -panda-plugin x86_64-softmmu/panda_plugins/panda_callstack_instr.so \
  -panda-plugin x86_64-softmmu/panda_plugins/panda_unigrams.so

This produces two files, unigram_mem_read_report.bin and unigram_mem_write_report.bin. The format of these files isn't terribly interesting, but they can be parsed using the Python code found in the unigram_hist.py script. Essentially, it consists of many, many rows of data that have the (callsite, program counter, CR3) triple followed by an array of 256 integers giving the number of times each byte was read or written at that point in the code.

Armed with this data, we want to now go through each callsite and look for those that meet the following criteria:

  1. The function both reads and writes a lot of data, in roughly equal amounts.
  2. The byte entropy of the data read is high, and its χ2 value (deviation from random) is low.
  3. The byte entropy of the data written is high, and its χ2 value is high.
This is precisely what the find_drm.py script does. We can run it like so:

./find_drm.py unigram_mem_read_report.bin unigram_mem_write_report.bin

Among its output, we find the following promising candidate:

(00719b84 3f1ac2e0): 3 x 1 combinations
  Read sizes:  44033, 701761, 701761
  Write sizes: 701761
  Read rand:  2.238299, 258.176922, 263.599258
  Write rand: 142018.776009
  Best input/output ratio (0 is best possible): 0.0

This function read two buffers of size 701,761 bytes and wrote one of size 701,761 bytes – given that we played 30 seconds of the song, this looks just about right. The randomness of the input buffers was quite high (recall that in the χ2 test, high numbers mean the data observed is less likely to be random), but the output buffer was not very random.

Dumping the Data

So how can we confirm our guess? Well, the easiest thing is to simply dump out the data seen at that point. If we go back up to the beginning of the output of the script, we have a list of all the (callsite, program counter, CR3) identifiers for reads and writes that matched our criteria. Looking through the writes for our candidate callsite (00719b84), we find it here:

(00719b84 0042e2ed 3f1ac2e0): 701761 bytes

We can now use another PANDA plugin, tapdump, to dump out all the data flowing through that point in the program. First we create a text file named tap_points.txt in the QEMU directory, and put in it:

00719b84 0042e2ed 3f1ac2e0

Next we run the replay again with the tapdump plugin enabled.

x86_64-softmmu/qemu-system-x86_64 -m 1024 -replay spotify \
  -panda-plugin x86_64-softmmu/panda_plugins/panda_callstack_instr.so \
  -panda-plugin x86_64-softmmu/panda_plugins/panda_tapdump.so

This produces two files, read_tap_buffers.txt.gz and write_tap_buffers.txt.gz, which contain the data read and written at the specified points. If you examine this with zless, you'll see lots of lines of addresses, followed by a single byte value. Separating out each field onto its own line and annotating, these are: 

0000000082678e78 [Caller 13]
000000008260dcc3 [Caller 12]
000000000071a1a5 [Caller 2]
0000000000719b84 [Caller 1]
000000000042e2ed [PC]
000000003f1ac2e0 [Address space]
000000000b256570 [Write address]
       269882976 [Index]
              4f [Data]

The extra callstack information is included so that, if necessary, more calling context can be used to pull out just the data we're interested in. In our case, however, just one level turns out to be enough. Finally, we want to turn this text file into a binary stream. In the scripts directory, there is a script called split_taps.py which will go through a gzipped tapdump output file and separate out each distinct stream found in the file (based on our usual identifier of (callsite, program counter, CR3)).

So now we can run this on the writes seen at our candidate for the decryption function:

./split_taps.py write_tap_buffers.txt.gz spotify

And obtain spotify.0000000000719b84.000000000042e2ed.000000003f1ac2e0.dat, which contains the binary data written at program counter 0x0042e2ed, called from callsite 0x00719b84, inside of the process with CR3 0x3f1ac2e0. So, is this audio we seek?

$ file spotify.0000000000719b84.000000000042e2ed.000000003f1ac2e0.dat 

spotify.0000000000719b84.000000000042e2ed.000000003f1ac2e0.dat: Ogg data

This looks good! Of course, the proof of the pudding is in the eating, and the proof of the audio is in the listening, so do...

$ cvlc spotify.0000000000719b84.000000000042e2ed.000000003f1ac2e0.dat

And you should hear a rather familiar tune :)

Concluding Thoughts

As I mentioned in the disclaimer, this by itself is just the starting point for what you would need to really break Spotify's DRM. It doesn't give you a way to obtain the key for each song and decrypt it wholesale. Instead, you would have to place a hook in the function identified by this process and pull it out as it's played, which limits it to realtime decryption (and Spotify's packing and anti-debugging may make it hard to place the hook in the first place!). Although I can certainly imagine more efficient processes, I think for now this is a nice balance between enabling piracy and showing off the power of PANDA.

If you now want to get a better understanding of the function we found inside Spotify, you can create a memory dump, extract the unpacked Spotify binary (which is packed with Themida) using Volatility, and the load it up in IDA and go to 0x0042e2ed, which is the location where decrypted data is written out.


One may wonder what happens when the function that contains 0x0042e2ed is called by others. As it turns out, this appears to be a generic decryption function that is used for other media throughout Spotify, including album art! It is left as an exercise to the reader to dump and examine the rest of the data that this function decrypts.


[1] Steal This Movie: Automatically Bypassing DRM Protection in Streaming Media Services. Wang, R., Shoshitaishvili, Y., Kruegel, C., and Vigna, G. USENIX Security Symposium, Washington, D.C., 2013.

Tuesday, January 28, 2014

PANDA, Reproducibility, and Open Science

tl;dr: PANDA now supports detached replays (you don't need the underlying VM image to run a replay), and they can be shared at a new site called PANDA Share. Hooray for reproducibility!

One of the most inspiring developments of the past few years has been the push for open science, the movement to ensure that scientific publications, data, and software are freely available to all. In computer science, a big part of this has been a trend towards making software and experimental data available once a paper has been published, so that others can verify experiments and "stand on the shoulders of giants" by extending the software. There have also been initiatives aimed at making sure that the results of experiments in computer science can be replicated.

In the latest release of PANDA, our Platform for Architecture-Neutral Dynamic Analysis, we've taken an important step in ensuring that experiments in dynamic analysis can be freely shared and replicated: as of commit 9139261d70, PANDA creates and loads standalone record/replay logs. This means that you can create a recording of an execution and then share it with others, and they will be able to precisely duplicate the same execution on their own machine, down to the last instruction. Any of PANDA's plugins can be applied to such executions, allowing new analyses to be run on existing, shared executions.

What does this enable? To start with, this makes it possible to share experimental data from research in dynamic analysis. In our paper Tappan Zee (North) Bridge, we performed many experiments that showed how to find useful points to hook in an OS; however, because these were based on executions that were tied to virtual machine disk images, we weren't able to share the data necessary to exactly reproduce our experiments (since that would require sharing a Windows VM with proprietary software). Now, however, we can simply share the detached recordings for the TZB experiments, allowing anyone to verify, for example, that our plugins can find SSL master secrets in IE8 on Windows. We also hope that collections of interesting recordings can form the basis of new benchmarks for dynamic analysis, allowing different implementations and algorithms to be directly compared by running them against a standard set of executions.

Aside from the benefits to reproducibility of dynamic analyses, we hope that this will also permit the creation and sharing of interesting executions that can then be studied by the whole community. For example, we are releasing today a recording of the FBI-authored shellcode that was recently used to identify Tor users connecting to sites hosted by Freedom Hosting. This means that anyone can re-run the recording and analyze every instruction executed by the shellcode to confirm for themselves the information that has appeared in public writeups.

To provide a central location for sharing interesting executions, we have created a site called PANDA Share where PANDA recordings can be uploaded. Each recording comes with a short description and the command line for PANDA needed to reproduce the execution. Right now, the repository contains the recordings of our Tappan Zee Bridge experiments, and the FBI shellcode recording. We are planning to add many more soon, and hope that others will share their own!