Jeremy Smith's blog

Entry Is Labelled

Mail System Woes

So, the mail system is down again. You may be wondering what happened. So, here's the scoop.

You might have noticed the Calendaring system was upgraded recently. Well, this morning, it was having some problems. So, it was restarted. But, it wasn't coming back up. There was a file descriptor issue (don't ask me). The server needed rebooting. And, that was the fateful moment.

The server that runs the calendaring system is backup to the primary server for the mail system (the one that handles the mail message store). Both servers, the primary and the secondary, can both access the SAN in read/write mode via the magic of some Veritas Clustering File System stuffs. (At this point, I am talking about stuff I only know about on a superficial level because it is Server Engineering magic; so I may get some stuff wrong.) When the backup server for the mail system was rebooted, this CFS layer went into a weird state. It is not supposed to go into a weird state. As a matter of fact, its job is specifically to prevent weird states. Alas, though, a weird state, it did enter.

This caused the primary mail message store server to enter a weird state. This caused the mail server software to have a hissy fit. *Boom!* Mail server crashed. The calendar server/backup mail server was still in a bad state. Bringing down the primary mail server finally straightened out the calendar server, and it entered a known good state.

Sighs of relief abounded.

The primary mail message store server was brought up. It mounted the SAN in a normal manner and everything appeared good.

More sighs of relief.

Now, when the mail message store crashes hard, you can't just bring it back up. We have been burned by this before. If there is corruption in the message store, bringing the service back online will cause this corruption to spread like a cancer; and it will eventually grow big enough to cause more crashes. What you must do is fix the corruption (if there is any) before you bring back the service. Fixing the corruption involves running a program called reconstruct over the entire message store. This usually takes an hour or so for a message store of our size, but it is a net win in that we won't have subsequent crashes because of this crash.

That is what is happening right now. The mail system is back up, we are just cleaning up the corruption before restoring the service.

A failure of the Veritas CFS... doing something it is designed to specifically prevent. Grrr... The funny thing is, when the mail server crashes, it is rarely the fault of the actual mail server. It is usually something like this; a weird dependency that happens to do something that it absolutely should not be doing.

Regardless, the mail server is coming back soon with a nice, clean message store.

Comments

  1. gravatar

    Thanks for the information! I didn't understand most of it, of course, but I am always fascinated by glimpses of what is lurking behind all the things we now take for granted.

  2. gravatar
    I am always fascinated by glimpses of what is lurking behind all the things we now take for granted.

    This is one of the main uses I think Blog@Case can do for ITS. Make things a little more transparent and bring the conversation to a personal level.

    I wish more ITS-ers would blog.

  3. gravatar

    I wish I had more time to blog...I've got a whole pile of adventures in Linux + Oracle 10g/RAC-land (and probably more to come; the project is barely started) that are probably worth sharing. But first I have to get it into a usable state to turn over to the the DBA's, and finish up & document some VMware stuff for certain users who keep pestering me :-) :-)

  4. gravatar

    So what of the many messages some people seem to have received around the time of the crash with random words as sender fist and last name (e.g. Dalmatian T. Whoppings; see http://home.cwru.edu/modpl/forum/threaded?message_id=0001Ui for more examples)? They were all plugging a porn site and reported at around the same time (some before, some after). This big chunk of spam had nothing to do with the crash?

  5. gravatar

    A half-dozen or so people receiving a swath of the same spam is not a coincidence, that's entirely normal. Spammers send their mails in bulk loads.

    Another reason they may have all arrived at the same time once the mail system came back online (and, even while legitmate mail got held up) is that the anti-spam layer dequeued its mail before the anti-virus layer started its dequeueing.

Post a comment