How to parse the .doc file format

Security Research & Defense

/ By swiat / July 18, 2008 / 4 min read

This past February, Microsoft publicly released the Office binary file formats specification. These describe how to parse Word, Excel, and PowerPoint files to review or extract the content. Because they describe the structure of these file formats in detail, we think the file format specification will be particularly interesting to ISVs who write detection logic for malware scanners (such as Anti-Virus software). Let us start delving into these documents by examining the basics of parsing a Word Document created with this legacy binary file format. This discussion will not cover the new OOXML format introduced in Word 2007.

Compound Binary Format First things first, in order to parse these older formats you will need to have knowledge of the Compound Binary File Format. The specification for this format is available online. Upon examination, you can see that the format is like a file system, similar to FAT. There are directories called storages, which contain data files named streams. All of this data is potentially fragmented across the file in various sectors described by the internal FAT. There are libraries available to parse this format so you don’t have to re-invent the wheel if you don’t want to. Use the Windows COM API, starting with the StgOpenStorageEx function, or one of the several freely available parsers and viewers to extract the individual data streams. Compound binary files are actually quite common, and the COM API is a good way to access the data stored in these files.

High Level Structure of Word Data In valid Word documents, there exists a stream named WordDocument. Go ahead and view this stream’s contents. It begins with a structure named the File Information Block, or FIB, which is described on page 141 of the Word Binary File Format specification. This massive structure both contains data and acts like a guide map for the rest of the document. At offset 0x9A in the FIB you will find a placeholder value named Rgfclcb. This marks the beginning of the offset/length pairs which describe structure locations for the rest of the document. These are offsets into another stream, named 0Table or 1Table. To find out which of these two streams is being referenced, examine the 16 bit value at FIB offset 0xA and look at bit number 10 (AND it with 0x200). If that bit is 0, then the offsets are referring to the 0Table stream. If it’s 1, then the 1Table stream.

This xTable stream contains much of the formatting data that Word uses to construct the document on your screen. Each offset points at a different structure, so check the Word Binary File specification for additional details about the data stored at a particular Table stream location.

A Real Example Now, let’s combine this information with some details about a patched vulnerability to see how we can detect a possible exploit attempt. MS06-060 fixed a vulnerability in the Print Merge State (PMS) structure, which is stored in one of the xTable streams.

First, use the method described above to determine if this document is using the 0Table or 1Table stream.

Next we have to find out if there is a PMS structure in the document, and if so, what its offset is in the xTable stream. There are two possible places in the FIB that might store the location of this structure, fcPms and fcPmsNew. First check fcPms. The corresponding length value for that offset value is called lcbPms, and it’s a DWORD located at FIB offset 0x1FE. If that value is non-zero, then the fcPms DWORD at FIB offset 0x1FA contains the xTable offset we need. If the length value is 0, then we need to check the second possible location, fcPmsNew at FIB offset 0x48A. The length value for this one is called lcbPmsNew, and is at FIB offset 0x48E. If this DWORD is 0, then the document contains no PMS structures. If it is non-zero, then the DWORD fcPmsNew contains the offset in the xTable stream of the PMS structure.

Finally, if the document does contain a PMS structure, examine it in the table stream you identified earlier at the offset you just read from the FIB. The structure looks like:

struct PrintMergeState {
   WORD Reserved1;
   BYTE One;
   BYTE Two;
   DWORD Reserved2;
   BYTE Three;
   BYTE Reserved3[7];
   BYTE Four;
};

Within this structure there are four bytes of interest named One, Two, Three, and Four within the definition above. Verify that the values of One and Two are set to either 0 or 1. For Three and Four, verify that those values are within the range 0 to 5 inclusive. Any other values for these fields are invalid and the document should be flagged as potentially abusing the MS06-060 vulnerability.

In the following example, the WordDocument stream started at file offset 0x1C00, and the 1Table stream started at file offset 0xC00. After finding lcbPms (at 0x1C00 + 0x1FE) to be non-zero, we look at the PMS structure at 0xC00 + 0x1A6, and see that the value named “Three” is > 5, so this document might be trying to exploit MS06-060.

We hope that you’ve found this a helpful example of how you can use the Word Binary File Format specification to accurately describe and detect attempts to exploit specific security vulnerabilities.

How to parse the .doc file format

Related Posts

How satisfied are you with the MSRC Blog?

Rating

Thank you for your feedback!