« Open Source Portfolios | Main | GanttProject 2.0 »

February 17, 2006

anatomy of a mail server crash

ok, so the mail server crashed twice today.

Want to know why?

We knew that the initial cause was the IMAP server dumping core. We may have even had an inkling as to what may have caused it. Here's a little bit of the output of a pstack on the core file:

----------------- lwp# 12 / thread# 16 --------------------
ff129394 txn_abort (0, f30a14d8, f30a14c0, ff12e164, 0, 0) + 4
00085fbc mboxlist_trans_finish (0, 16, f30a14d8, f30a14c0, 0, d) + 24
00091f94 mboxlist_maybeRecoverSubsFromFile (1e25994, 2f, 0, 0, 0, 2f) + 68c
0008ebdc mboxlist_findsub (1799550, 1, 0, 1e25994, 1c7b538, 625b0) + 6c
00053bb0 cmd_list (1e25920, 1e2f0b8, 1, 11995d8, 1799550, 0) + 288
00047d90 cmdloop (1e25920, 708, 2000, 2710, 0, 0) + 3058
000434d8 sock_readable (1e25920, 0, 7c2b48, 0, 0, 0) + 580
ff1860bc GDispCx_Dispatch (221204, faa47c, 7bed4c, fffffff8, 0, 6d47c1) + ec
ff1864b8 GDispCx_InternalWork (7bed4c, ff186188, fee6c000, c, 7c2b48, 0) + 330
ff0b2ad0 _pt_root (7c2b48, ff0cc24c, 0, fee78d04, 0, 2) + a4
fee5b03c _thread_start (7c2b48, 0, 0, 0, 0, 0) + 40

So something weird was going on and causing the IMAP server to crash, which then left something locked in the store daemon (the process that actually touches the mail, whether via POP, IMAP, webmail, or SMTP), which then caused further problems.

After the second crash under similar unusual circumstances, I went digging into the IMAP log files, hoping that something would turn up. Something did:

[17/Feb/2006:16:08:16 -0500] northuist imapd[33856]: Account Information: login [x] uid proxyauth
[17/Feb/2006:16:08:16 -0500] northuist imapd[33856]: Store Critical: Mailbox database error: missing or empty key value specified

Well, that's interesting. Let's see if that error turned up in the earlier crash:

17/Feb/2006:14:05:59 -0500] northuist imapd[31550]: Account Information: login [x] uid proxyauth
[17/Feb/2006:14:05:59 -0500] northuist imapd[31550]: Protocol Information: uid created mbox UM-Messages
[17/Feb/2006:14:05:59 -0500] northuist imapd[31550]: Store Critical: Mailbox database error: missing or empty key value specified

Note that I've cleverly disguised the user's network ID as "uid" here. But interestingly, both crashes involved this same person's mailbox. It looks like they were being provisioned for Unified Messaging on the earlier crash, and attempting to access voicemail on the second.

Time to look at their mailbox!

northuist%ls
=+U+M-+Messages/ store.idx store.sub store.usr

Nothing terribly unusual from the initial peek. There's no actual mail in the INBOX (but it turns out they forward their mail, so that makes sense)

Let's look at the UM-Messages folder:

northuist%ls -l
total 16
-rw------- 1 mailsrv nobody 7856 Feb 17 16:41 store.idx

Very normal looking too. Let's go back and take a closer look:

northuist%ls -l
total 526
drwx------ 2 mailsrv nobody 96 Feb 17 16:41 =+U+M-+Messages/
-rw------- 1 mailsrv nobody 7856 Feb 17 16:41 store.idx
-rw------- 1 mailsrv nobody 259656 Sep 3 2003 store.sub
-rw------- 1 mailsrv nobody 37 Feb 17 15:48 store.usr

hmmmmm... That store.sub file looks a little big. It's just supposed to be a list of all the folders you're subscribed to (i.e. your own folders) This user only has one folder, so why is it so big? Let's take a peek:

northuist%more store.sub
/hBOcdctpYVDO46Y
W16kL91GMdfKlokIlr+9qDX31vApE+z7ENLl+8pyz4TuZe3YCN2Bbnml9DKtxzImBtBjCiM8SSiF
jbJ2c4rGfW+KTC+VtSq+x6vK/qXINxqtfl+58EqPHrEj9SQ9bN7TWh4TiZ7f3pm6C4Vj9lIcG2+q
9z9ErtDxB2F56fBzHb+fsOXhtnNQ00M4ht8Z+PbW7tpiBq3ygXCm0Yg2/CHY/xAvWL9ro247Yu6l
jUKZhFeCFl8mZLln2gOdx9bnfirC6zCkbUPnzW82to03ij1Q31EZWjj+P0bQjX5Xy2fo2i1LWAXK
CwnxMmqkbMWSG5D3vbrgezE3I7dR7nhB0bo9p6X0h/LodW42MwPsfX3bt1Es93xc9xp1b3jnHqVT

That doesn't look good. For comparison, here's an excerpt from my own store.sub file:

user/sdh7/Sent
user/sdh7/Sent Items
user/sdh7/Trash
user/sdh7/UM-Messages
user/sdh7/JunkMail

A-ha! A little magic later (i.e. rm'ing the file and re-massaging the mailbox database) and I think everything should be happier now. We'll see what happens.


Update:
We've finished running a database reconstruct and all still seems well. I did just happen to notice that that bogus file dated from 2003, so these crashes appear to be a holdover from an old crash, rather than a recent corruption. Given that more people are going to be set up for UM soon, we should probably run a check for things like this.

Posted by sdh7 at February 17, 2006 04:51 PM

Trackback Pings

TrackBack URL for this entry:
http://blog.case.edu/sdh7/mt-tb.cgi/5970

Listed below are links to weblogs that reference anatomy of a mail server crash:

» Sam Breaks Down a Mail Server Crash from Jeremy Smith's blog
Sam explains what really happened to the email server earlier in anatomy of a mail server crash.... [Read More]

Tracked on February 17, 2006 05:52 PM

Comments

Post a comment




Remember Me?

(you may use HTML tags for style)