Over the past few weeks, I've been continuing to investigate the structure of the Types stream (stream 2) in Microsoft PDB files with the help of Sven Schreiber's PDB parsing code. Some issues with getting approval to publish research came up at work, but I think they're mostly ironed out now, so I'm going to devote this entry to going through some of the trickier bits involved in parsing the Types stream. Some code also accompanies this entry: a python script to parse and print out the types contained in a stream. It works on streams that have alrady been extracted from a PDB file (see this earlier entry); if you don't have one around you can try it out on the Types stream from ntoskrnl.exe on Windows XP SP2.
The Type Stream Header
The types stream begins with a header that gives a few pieces of useful information. The first dword represents the version number, of the PDB file, and is generally determined by the version of the compiler that created the PDB file. For XP, the version is 19990903, and for Vista 20040203 (note that the numbers can be read as dates that approximately line up to the release of Visual Studio used to create them).
The next word gives the size of the header; I have not yet seen a case where this was anything other than 0x38 bytes, but in theory this could allow for extensions to the header without breaking backwards compatibility with older parsers (they could just skip any extra header data). After the header size, there are two dwords, tiMin and tiMax, that give the numerical index of the first and last type listed in the file.
These type indices are important to understanding the structure of the file format. The first type in a file will be numbered tiMin, the next tiMin+1, and so on up to tiMax. In addition, many types will refer to other types in the same file using their type index; for example, if _MMVAD is type 0x1002, a pointer to an _MMVAD would be a pointer type with its utype (underlying type) field set to 0x1002.
Finally, the header contains a dword at offset 0x0E that gives the size in bytes of the data that follows the header. The size of the whole stream should be equal to the header size plus the size of the following data. This can be used as a crude sanity check to verify file integrity.
There is an additional structure present in the header aside from the ones already mentioned, which Schreiber calls the tpiHash. Unfortunately, I have not been able to determine what purpose the values in it serve; although the names seem to refer to some sort of hash, I have not found such a structure in the type streams available to me.
The structure of the file is quite simple: after the header, we find a sequence of type structures, referred to in the CodeView documentation (more on this later) and Schreiber's code as leaf types. Each type has a size and a type, both words (in this case it could be seen as a metatype, as it is the type of that type--e.g., a structure, a field list, etc). It is imporant to note that the size does not include the size field itself, so the size of the entire structure is actually two bytes more than listed by the size field (this tripped me up for a little while).
The type field indicates how the leaf type is intended to be parsed. The constants have already been kindly defined by Schreiber in sbs_sdk/include/pdb_info.h in the win_pdbx sources. Likewise, most of the leaf types themselves are defined in that same file, as C structures. The structures appear to be extremely similar to the CodeView format documented in an early format specification from Microsoft, though there appear to have been some changes made to better accomodate 32-bit architectures (many words changed to dwords, and so on). However, certain details of how they are to be parsed are ambiguous, and one must look at how Schreiber actually uses them in his code to see how they should be interpreted. For example, many of the structures end with a char data field, which, upon investigation, actually contains a combination of the structure's name and a numeric value that means different things depending on the leaf type.
In the following sections, I will describe those portions of the file format that seemed obscure or were difficult to figure out just by looking at the data structures defined in Schreiber's code. Due to limited space and reader interest, I will not describe each leaf type in full; those interested in the details can check out my code or look at Schreiber's win_pdbx.
Based on examination of PDB files from Windows XP SP2 and Vista, structures in the types stream are aligned on 32-bit boundaries. This means that if a structure does not align properly, it will be padded out to the appropriate size. However, in the CodeView format, the padding is not junk data or nulls, but takes a particular form: each pad byte inserted looks like (0xF0 | [number of bytes until next structure]). Less formally, the upper four bits are all set, and the lower four give the number of bytes to skip until the next structure. This results in patterns that look like "?? F3 F2 F1 [next structure]".
I am not entirely clear on the reasoning behind this padding scheme. One possibility is that it allows alignment to multiple different boundaries without requiring changes to the parser: rather than rounding up to the next aligned boundary, a parser can just read a single byte to determine how far away the next structure is. Dealing with padding is not strictly necessary, though; Schreiber's parser simply rounds up to the nearest 4-byte boundary. This scheme does impose an interesting constraint, though: because bytes with values greater than 0xF0 are defined to be padding, no leaf type can have a type code that contains a byte greater than 0xF0.
Structures, bitfields, argument lists, and so on all need a way to refer to a list of types that they contain; for example, an _EPROCESS structure will contain a _KPROCESS, _DISPATCHER_HEADER, and so on. To deal with this, types that refer to other types will have a field that gives the type index of their corresponding field list. Fieldlists (leaf type 0x1203), in turn, have a very simple structure: after the standard size and type fields, the body of the structure is made up of an arbitrary number of leaf types of type LF_MEMBER, LF_ENUMERATE, LF_BCLASS, LF_VFUNCTAB, LF_ONEMETHOD, LF_METHOD, or LF_NESTTYPE. This is somewhat annoying to parse, because the number of substructures is not known in advance, and so the only way to know when field list is finished is to see how many bytes have been parsed and compare it to the size of the overall structure.
As mentioned before, many of the structures Schreiber documents end with char data. This field is parsed as follows:
- If the value of the first word is less than LF_NUMERIC (0x8000), the value data is just the value of that word. The name begins at data and is a C string.
- Otherwise, the word refers to the type of the value data, one of LF_CHAR, LF_SHORT, LF_USHORT, LF_LONG, or LF_ULONG. Then comes the actual value, and then the name as a C string. The length of the value data is determined by the value type--one byte for LF_CHAR, 2 for LF_SHORT, and so on.
The actual meaning of the value data depends on the leaf type it's embedded in. For example, for LF_STRUCTURE types, it refers to the size of the overall structure, while for LF_ENUMERATEs, it gives the value of the enum.
As you parse through types, you will find that almost every structure appears more than once in the stream: first as an entry claiming to have zero members and no corresponding field list, and then as a much more normal looking structure with the appropriate number of members and a valid reference to a field list. The reason for this is that as the compiler generates the debugging symbols, it may come across names for structures that it does not yet know the contents of; in this case, it creates a forward reference by creating an empty structure in the types stream and then setting the "fwdref" bit in its attributes (bit 7 of the word at offset 0x06 of an LF_STRUCTURE).
The information already presented here, combined with Schreiber's already available win_pdbx source code, should be enough to build a parser for PDB type information. Specifically, the following files in Schreiber's distribution were most helpful to me:
- sbs_sdk/include/pdb_info.h (constants and structure definitions for leaf types)
- Program Files/DevStudio/MyProjects/sbs_pdb/sbs_pdb.h (header definitions for PDB files)
- Program Files/DevStudio/MyProjects/win_pdbx/win_pdbx.c (the actual code that does the parsing using the structures defined in those first two files)
In addition, the CodeView documentation, while quite outdated, is extremely useful for getting an idea of how the format works, and most of the strucutres are still valid with a few tweaks to support 32-bit architectures (mostly changing words to dwords in some places). The documentation also covers the format of the debug stream, which Schreiber's code does not handle. This should make the development of a parser for the debug stream (coming soon!) much easier.
Just to show that this all actually works, I have made available a preliminary parser that prints out parsed type information. I have also almost finished a more general parser that uses Construct, a Python library for parsing low-level binary data (though this description does not do it justice, and I will be writing more about it when I release the updated parser).