Posts

Showing posts with the label data

The LAVA Synthetic Bug Corpora

I'm planning a longer post discussing how we evaluated the LAVA bug injection system, but since we've gotten approval to release the test corpora I wanted to make them available right away. The corpora described in the paper, LAVA-1 and LAVA-M, can be downloaded here: http://panda.moyix.net/~moyix/lava_corpus.tar.xz  (101M) Quoting from the included README: This distribution contains the automatically generated bug corpora used  in the paper, "LAVA: Large-scale Automated Vulnerability Addition". LAVA-1 is a corpus consisting of 69 versions of the "file" utility, each  of which has had a single bug injected into it. Each bug is a named branch in a git repository. The triggering input can be found in the file named CRASH_INPUT. To run the validation, you can use validate.sh, which builds each buggy version of file and evaluates it on the corresponding triggering input. LAVA-M is a corpus consisting of four GNU coreutils programs (base64,  md5sum,...

(Sys)Call Me Maybe: Exploring Malware Syscalls with PANDA

System calls are of great interest to researchers studying malware, because they are the only way that malware can have any effect on the world – writing files to the hard drive, manipulating the registry, sending network packets, and so on all must be done by making a call into the kernel. In Windows, the system call interface is not publicly documented, but there have been lots of good reverse engineering efforts, and we now have full tables of the names of each system call ; in addition, by using the Windows debug symbols, we can figure out how many arguments each system call takes (though not yet their actual types). I recently ran 24,389 malware replays under PANDA and recorded all the system calls made, along with their arguments (just the top-level argument, without trying to descend into pointer types or dereference handle types). So for each replay, we now have a log file that looks like: 3f9b2340 NtGdiFlush 3f9b2340 NtUserGetMessage 0175feac 00000000 00000000 000000...

One Weird Trick to Shrink Your PANDA Malware Logs by 84%

When I wrote about some of the lessons learned from P ANDA Malrec 's first 100 days of operation , one of the things I mentioned was that the storage requirements for the system were extremely high. In the four months since, the storage problem only got worse: as of last week, we were storing 24,000 recordings of malware, coming in at a whopping 2.4 terabytes of storage. The amount of data involved poses problems not just for our own storage but also for others wanting to make use of the recordings for research. 2.4 terabytes is a lot, especially when it's spread out over 24,000 HTTP requests. If we want our data to be useful to researchers, it would be great if we could find better ways of compressing the recording logs. As it turns out, we can! The key is to look closely at what makes up a PANDA recording: The log of non-deterministic events (the -rr-nondet.log files) The initial QEMU snapshot (the -rr-snp files) The first of these is highly redundant and actually ...