One Weird Trick to Shrink Your PANDA Malware Logs by 84%
When I wrote about some of the lessons learned from PANDA Malrec's first 100 days of operation, one of the things I mentioned was that the storage requirements for the system were extremely high. In the four months since, the storage problem only got worse: as of last week, we were storing 24,000 recordings of malware, coming in at a whopping 2.4 terabytes of storage.
The amount of data involved poses problems not just for our own storage but also for others wanting to make use of the recordings for research. 2.4 terabytes is a lot, especially when it's spread out over 24,000 HTTP requests. If we want our data to be useful to researchers, it would be great if we could find better ways of compressing the recording logs.
As it turns out, we can! The key is to look closely at what makes up a PANDA recording:
The amount of data involved poses problems not just for our own storage but also for others wanting to make use of the recordings for research. 2.4 terabytes is a lot, especially when it's spread out over 24,000 HTTP requests. If we want our data to be useful to researchers, it would be great if we could find better ways of compressing the recording logs.
As it turns out, we can! The key is to look closely at what makes up a PANDA recording:
- The log of non-deterministic events (the -rr-nondet.log files)
- The initial QEMU snapshot (the -rr-snp files)
The first of these is highly redundant and actually compresses quite well already – the xz compression used by PANDA's rrpack.py usually manages to get around a 5-6X reduction for the nondet log. The snapshots also compress pretty well, at around 4X.
So where can we find further savings? The trick is to notice that for the malware recordings, each run is started by first reverting the virtual machine to the same state. That means that the initial snapshot files for our recordings are almost all identical! In fact, if we do a byte-by-byte diff, the vast majority differ by only a few bytes – most likely a timer value that increments in the short time between when we revert to the snapshot and begin our recording.
With this observation in hand, we can instead store the malware recordings in a new format. The nondet log will still be compressed with xz, but now the snapshot for each will now instead be stored as a binary diff with respect to a reference snapshot. Because we have two separate recording platforms and have changed the initial environment used by Malrec a few times, the total number of reference snapshots we need is 8 – but this is a huge improvement over storing 24,000 snapshots! The binary diff for each recording then requires only a handful of bytes to specify.
The upshot of all of this is that a dataset of 24,189 PANDA malware recordings now takes up just 387 GB, a savings of 84%. This is pretty astonishing – the recordings in the archive contain 476 trillion instructions' worth of execution, meaning our storage rate is 1147.5 instructions per byte! As a point of comparison, one recent published instruction trace compression scheme achieved 2 bits per instruction; our compression is 0.007 bits per instruction – though this comparison is somewhat unfair since that paper can't assume a shared starting point.
You can download this data set as a single file from our MIT mirror; please share and mirror this as widely as you like! There is a README included in the archive that contains instructions for extracting and replaying any of the recordings. Click the link below to download:
Stay tuned, too – there's more cool stuff on the way. Next time, I'll be writing about one of the things you can do with a full-trace recording dataset like this: extracting system call traces with arguments. And of course that means I'll have a syscall dataset to share then as well :)
Comments
Best regards, Konrad