AOL Mac File Cabinet format

2004-04-29
Bruce Tomlin

This is partial documentation for the AOL file cabinet database as used by the MacOS version of AOL. The intent of this is to have enough information to recover all of the e-mail inside the file so that it can be imported into another program. Using this info, I was able to recover over three megs of mail out of an old cabinet file, by generating an mbox file and copying the result to my IMAP server.

First of all, this file format is completely different from the format used by the PC version of AOL. Second, the basic format of this file doesn't seem to have changed since at least version 3.0 of the Mac AOL software. I was able to read an old file cabinet file from AOL 3.0 using the AOL 9.0 software.

Information leading to e-mail is marked in boldface.

Update: 2005-12-30 - In the time since I wrote this, there is now an official way to not only get mail out of your old cabinet files, but you can also send and receive AOL mail directly from within Mail.app! If you want to bring your AOL mail into this millenium, you should get AOL Service Assistant right now!

Master header record:

At the start of the file there are six longwords:


00000009 AOL Version?
0021A3F0 Free List?
00004440 Master List?
00000015 ???
000007B1 ???
00000400 ???

The data from 0018 to 00FF is unitialized and will possibly even contain the MacBinary header from some random downloaded file.

Only the Master List is important for resucing data from this file.

Data blocks:

Each data block is aligned to an 8-byte boundary. Data after the end of the block data is undefined and may even be unitialized. After a block is shrunk, the last few bytes will probably appear to repeat at the end.

The data blocks start out with 8 words:


4B41 Flags - I can not figure out what this means, but the high
     nibble sometimes changes depending on the block type.
0000 Apparently always = 0000
0001 Apparently always = 0001
0002 The number of entries in this block (not always used, depending
     on the block type)

Note that there is no way to know the exact size of a block without the context of what kind of block it is. Some blocks contain nothing but a list of block addresses (in which case the size is the number of entries * 4), sometimes it is a list of block addresses with object IDs (in which case the size is the number of entries * 8). Some entry types are very complicated.

In smaller entry types, the low nibble (the "1" in 4B41 above) seems to correspond to the number of 8-byte blocks following the header.

There are apparently no more than 32 entries in a block, so when a block full of pointers gets full, it must re-balance them into a new block.

Not all blocks in the file have headers. Some are just raw data. These should have a data length in the referring block.

Master block:

In the two example cabinet files I have, this block is at 4440 in both of them. It consists of the same two disk pointers in both examples that I have.


00004440:
4B41 0000 0001 0002 - header for a block with two entries
00000100 - pointer to an index block
000044C8 - pointer to an index block

So far, I know that the second index block leads to my saved e-mail. The first index block apparently leads to Favorites lists.

Object IDs:

Some things in the database apparently have an object ID. E-mail and the file cabinet folders all have object IDs. There doesn't seem to be any kind of master index of object IDs; you just have to look around for them.

Index block:

This is the most complicated block. Each entry starts with 19 bytes:


0002     ???
04       ???
0000002B object ID?
00000000 object ID?
0000DC00 pointer to list of object IDs?
00000000 ???

The 04 may be the count of records within each entry, but all entries have four records. Each record is 16 bytes long:


00000007 object ID?
00100B58 block pointer
0000002C object ID?
6E756C6C the word "null", which must mean that this field is null (duh)

So far, I know that the block pointer in the first record of the second entry leads to my saved e-mail. Everything else seems to be for mail folder names.

Indirect blocks

In the cabinet file that I have which contains much e-mail, the indirect block contains a list of other blocks which point to object lists. The flags for one of these blocks was "4A01", so the "4" may be a way to identify this type of block. I have also found these in place of blocks of object IDs.


00100B58:
4A01 0000 0001 0002
0003EA10
00194ED8

Pointers to object lists

This type of block contains a list of blocks which contain object lists. The flags for one of these blocks was "0A02", so the "0" may be a way to identify this type of block.


00194ED8:
0A02 0000 0001 001E
...
  00244718
...

Object lists

This type of block contains a list of e-mail message header blocks and object IDs. The object IDs are apparently that of the messages.


00244718:
8B02 0000 0001 0020
...
 00242E48 000003DE
...

E-mail message headers

This is the goal of the quest. These blocks know where your e-mail data is stored. They are 56 bytes long.


00242E48:
C202 0000 0001 0000 - block header with zero entries
03B4     - object ID of folder which contains this message
00244A30 - subject line text for display in overview lists
0000000F - length of subject line text
00000133 - ???
0000     - ???
002EE560 - pointer to message body data
00000C16 - length of message body data
00245F28 - pointer to message sender/recipient text for display in overview lists
00000027 - length of message sender/recipient text
B0103248 - message date?  this is apparently in some wierd non-Unix epoch
00000000 - ???
00000000 - ???
00000000 - ???
00000000 - ???
00000002 - ???

Message body, first layer

You didn't think it was going to get easy here, did you? The message body has two layers of data that you have to peel through, like an onion. The message body starts out with 4 bytes, which are sometimes all zero, but not always. I don't know what this is. Then it contains a bunch of records:


0004     - record type
0000000E - length of record data
52653A20 4E6F2053 75626A65 6374 - record data, in this case 'Re: No Subject'

Here are some record types:


0004 - subject line
0005 - sender, e-mail address + real name
0006 - date (same as in message header block) - this field may
       not always be present, especially in mail you sent!
0007 - possibly the same date with a different epoch
0009 - To/CC address, apparently starts with an extra byte to
       indicate type.  00 seems to be To:, 01 and 02 both seem to be
       CC:, but I don't know what the difference between them is.
       There may be multiple of these records, all consecutive, and
       sorted by type.
000A - message text (see next section)
000C - attachment
       AAAAAAAA - file size
       BBBBBBBB - ???
       CCCCCCCC - ???
       DD       - file name length
       EEEEEEEE - file name (may include "(xxxxxx bytes)" at end)
0012 - sender, shortened to 30 or so characters with "..." at end
0013 - ??? (contains no data)
0014 - sender, e-mail address only
0015 - recipient, apparently your e-mail address

The message text record ALWAYS comes last.

Message body, second layer

Once you've found the 000A record, now you need to parse that too. It consists of multiple records, each prefixed with a hexadecimal ASCII number, and terminated with an 03 byte.


AAAA     - hex word of record type
BBBBBBBB - hex longword of record data length (including 03)
CCCCCCCC - hex longword, depends on record type
           (in message text records, this is the offset of the
            current block in the message text)
XXXX     - data, of length BBBBBBBB - 1
03       - end of block

Record types go from FFFF to at least as high as 0013. Some of them contain formatting (0008 usually contains the word "Courier", for instance, indicating it has something to do with font selection), but the one we want is 0002. This contains message text records.

Message body, second layer, text record


0002     - hex word of record type
BBBBBBBB - hex longword of record data length
CCCCCCCC - hex longword of offset within entire message
00       - 00 byte
DDDDDD   - 1-3 hex ASCII bytes containing length of text (usually BBBB-5)
2C       - comma
XXXX     - message text, apparently always broken between lines
03       - end of block

Once all the 0002 blocks are parsed out, and the junk before the comma skipped, and the 03 ignored, you have your e-mail!

Date conversion

Apparently AOL used MacOS time representation in the Mac version of their cabinet files. This is like Unix time, only 66 years different. To convert to Unix time, subtract 2082844800. The time in the file may be local time, rather than UTC time, so you may have to add a few hours for time zone conversion. If your time zone is -0500, add 5*86400.

Favorites blocks

There is one more sort of block that I have identified, which seems to contain the Favorites list. Each entry in the block looks like this:


AAAAAAAA - object ID or pointer to C-string text
BBBBBBBB - object ID or pointer to C-string text
CC       - length of title
DDDDDDDD - title of item

I have not done much research into these.

Sample code

Here is some sample code which I used to dump my own file cabinet: cabinetdump.c

It's ugly and will probably need some work to be useful. At the very least, you will need to change the "#define cabName" to point to your own cabinet file.

(NOTE: the "../" in cabName was because Xcode builds and runs your binary in a subdirectory.)