Wednesday, May 7, 2008

Parsing Windows Minidumps

When a user-mode application crashes in Windows, a built-in debugger known as "Dr. Watson" steps in and captures some basic information that can be sent back to developers to help debug the crash. As part of this process, it creates what's called a minidump that contains portions of the process's memory and a great deal of extra information about the state and attributes of the process. Among the information available is:

  • CPU state for each thread.

  • A list of loaded modules, including their timestamps.

  • The Process Environment Block (PEB) for the process.

  • Basic system information, such as the build number and service pack level of the perating system.

  • The process creation time, and how long it has spent executing in kernel and user space.

  • Detailed information on the exception that was raised.


Using the userdump.exe utility provided by Microsoft, it is also possible to take a complete snapshot of the memory of any running process. This tool also, as it turns out, stores its output using the minidump format. Minidumps made with this tool, in addition to all the information available in a standard minidump, include the full process memory, and (with the -w option), a list of window handles as well.

Unlike many Microsoft formats, the minidump container format is actually fully documented by Microsoft. The relevant data structures and constants can all be found in dbghelp.h, and explanations of each field can be found on MSDN. The basic structure of the file is simple: it starts with the MINIDUMP_HEADER, which gives the offset of the stream directory (a list of MINIDUMP_DIRECTORY structures). Each directory entry has a type code (indicating what the stream is for), the size of the stream, and the offset in the file where the stream begins. Don't be scared by the use of the term "relative virtual offset" (RVA) in the Microsoft documentation; in this context, it just means "offset from the beginning of the file".

The format is not only openly documented, it is also extensible: any application can add a new stream type (using the type codes above the reserved range 0x0000-0xFFFF) and thereby include any sort of extra data in the minidump. The open-source, cross-platform crash reporter, Google Breakpad, actually uses the minidump container as its native crash dump format on all platforms. The project's source includes a set of C++ classes that can parse and work with minidump files, which can be instructive in clearing up any ambiguities in the MS-provided documentation. One final (and somewhat unexpected) source of information is the United States patent on generating minidumps. Putting aside the fact that patenting the process of saving some context to a container format after a crash seems pretty silly, the patent description is full of interesting technical details.

For memory analysis purposes, it is useful to understand the minidump format, as it is the format used by the userdump utility to save the full address space of a process. For minidumps written by userdump.exe, the actual memory ranges are described in the Memory64ListStream stream (type code 9). The stream gives the base offset in the file where the process's memory can be found, and then has a list of structures that give the size and virtual address of each memory region. (it is not necessary to give the file offset for each memory range, since they are all contiguous; the second memory range described appears in the file directly after the end of the first). Additional information on each memory range is found in the MemoryInfoListStream, which lists the protection attributes (read-only, writable, executable), state (free, reserved, or committed) and type (image, mapped file, or private allocation) for each range addressable by the process.

From this information we can reconstruct the entire memory space for a given process, and then examine its virtual address space to find interesting artifacts, such as its list of loaded modules (accessible through the Process Environment Block, or PEB) or any application-specific data it was working with (a notable example would be passwords or encryption key data, as demonstrated at CanSecWest this year). It should be fairly easy to create an address space class within Volatility that can read minidumps, at which point any of the Volatility modules work with user-mode data (currently just dlllist, but more are expected in the future) will be usable on minidumps generated by userdump.exe.

Rather than go into the gory details of the data structures involved in parsing each stream, I have decided to simply release a library written using Python and Construct. The library can be downloaded here; currently every stream type listed in Microsoft's documentation is fully parsed. The library also supports the "Window Handle" stream created by userdump.exe (stream type 0x10000), although some fields are still unknown as they are undocumented (specifically, there are four unknown DWORDS that I have been unable to decipher -- if anyone has any suggestions as to the structure, I would love to hear them!).

You can also run minidump.py as a command line program, and it will print out the entire parsed structure of the minidump, including thread context, open handles, system information, and loaded modules. Enjoy!

18 comments:

Keydet89 said...

Very cool stuff!

Keydet89 said...

Hey, just ran through install and setup for userdump.exe...have you taken a look yet at getting it set up for use on a CD, say in IR?

Alan Keister said...

I can't get minidump.py to work. Does is require a specific version of Python?

Debugging exception of Struct('MINIDUMP_HEADER'):
File "C:\Python26\lib\site-packages\construct\debug.py", line 111, in _parse
return self.subcon._parse(stream, context)
File "C:\Python26\lib\site-packages\construct\core.py", line 522, in _parse
subobj = sc._parse(stream, context)
File "C:\Python26\lib\site-packages\construct\core.py", line 822, in _parse
obj = self.subcon._parse(stream, context)
File "C:\Python26\lib\site-packages\construct\core.py", line 335, in _parse
raise ArrayError("expected %d, found %d" % (count, c), ex)
ArrayError: ('expected 9, found 0', FieldError(FieldError('expected 4, found 0',),))

(you can set the value of 'self.retval', which will be returned)
> c:\python26\lib\site-packages\construct\core.py(335)_parse()
-> raise ArrayError("expected %d, found %d" % (count, c), ex)

moyix said...

Alan,

Most likely, you have encountered a type of minidump the script doesn't handle (i.e., a bug :) ). Could you provide an example of a minidump that shows this problem so I can reproduce it and hopefully fix it?

Thanks,
Brendan

Alan Keister said...

Sure. How can I get it to you? It was generated by Google Breakpad.

moyix said...

You can either email it to me at mooyix@gmail.com, or if necessary I can set up an anonymous FTP for you to drop it in.

Dirk Stoop said...

Hi Brendan,

Thanks for creating minidump.py :)

I've been experimenting a bit with it over the past week or so, and am currently stuck trying to figure out how to get call stacks for each thread from the data structure created by minidump.py (doesn't need to be symbolicated, that'll be my next to-do).

I'm building an online crash reporting app, that I intend to host on google app engine, so being able to actually parse minidump files completely in Python would be awesome.

After reading some of the the docs on the minidump file format over at msdn it seems to me that individual frames should be extractable from each MINIDUMP_THREAD's Stack, but beyond that I don't really have a clue how to actually extract that info using minidump.py.

Parsing the same dump file with "minidump_stackwalk" (part of Google Breakpad) produces a neat backtrace for every thread, so the dump itself does contain the needed information.

Can you maybe give me a pointer in the right direction? I think (and hope) I'm missing something really obvious..

thx!
- Dirk

Dirk Stoop said...

Oh yeah, in case you're interested in it, this is the current (very early) version of the web-app I mentioned:

http://appnalyze.appspot.com/

Martin said...

Awesome piece of code :) Cheers!

However, one problem that I stumbled on (which I believe is Alan's problem as well) is that if you run minidump.py from the command line it will not open the file in binary mode but in text mode. This can cause the parsing to go all out of whack if it encounters the 2 bytes that resemble a newline character.

Bruce said...

The problem that Alan and Martin reported can be trivially fixed. Just add the "rb" mode to the last line of minidump.py, like this:

print MINIDUMP_HEADER.parse_stream(open(sys.argv[1], "rb"))

The formatting of the output could certainly use some work. It seems to dump the entire minidump as one line, which is unusable for all but the tiniest of dumps. Still, a useful starting point.

moyix said...

Yes, it looks like that is indeed the issue here. I'd only tested the script on UNIX-based systems, where binary vs. text mode is not an issue.

The output is indeed voluminous! Its main use is as a library; just do:

import minidump
parsed_minidump = minidump.MINIDUMP_HEADER.parse_stream(file)

The resulting object will contain all the information in the minidump.

Sameer said...

Pardon my noobish question (I am a total Python noob), but I'm getting a syntax error while running the minidump script with Python 3.2.2.

>> File "minidump.py", line 823
>> print MINIDUMP_HEADER.parse_stream(open(sys.argv[1]))
>> ^
>> SyntaxError: invalid syntax

Any clues as to what might be wrong? Thank you.

Brendan Dolan-Gavitt said...

Hi Sameer,

The script was written for Python 2.x, and there have been changes in the language between the two versions. It should work with the most recent version of the 2.x series (which is 2.7.3 as of this writing).

Martin said...

Hey, its been a year but I got back round to working with crashdumps :)

Found a bug in the exception stream (which was stopping me building a callstack for the exception). On line ~654 ExceptionInformation is defined as an array of NumberParameters but the actual size of the array is always EXCEPTION_MAXIMUM_PARAMETERS, which on my machine is 15. (http://msdn.microsoft.com/en-us/library/windows/desktop/ms680367(v=vs.85).aspx)

Changing the code to a hard coded 15 allows me to pull the right exception threadcontext and so get the stack.

Wanda Tinaksy said...

The links to the parser are broken. Could you update them when you get a chance?

Michael Boman said...

The links to minidump.py is MIA, could you please re-upload them somewhere?

Michael Boman said...

I found a cached copy of minidump.py in Google's cache. I have pasted it to http://pastebin.com/AZD1HCty

Google cache URL is http://webcache.googleusercontent.com/search?q=cache%3Ahttp%3A%2F%2Fkurtz.cs.wesleyan.edu%2F~bdolangavitt%2Fmemory%2Fminidumps%2Fminidump.py&oq=cache%3Ahttp%3A%2F%2Fkurtz.cs.wesleyan.edu%2F~bdolangavitt%2Fmemory%2Fminidumps%2Fminidump.py&aqs=chrome..69i57j69i58.3468j0&sourceid=chrome&ie=UTF-8

Brendan Dolan-Gavitt said...

Hi,

The links to the parser have been fixed. The new location is:

http://amnesia.gtisc.gatech.edu/~moyix/minidump.py

Cheers,
Brendan