So, the mail system is down again. You may be wondering what happened. So, here's the scoop.
You might have noticed the Calendaring system was upgraded recently. Well, this morning, it was having some problems. So, it was restarted. But, it wasn't coming back up. There was a file descriptor issue (don't ask me). The server needed rebooting. And, that was the fateful moment.
The server that runs the calendaring system is backup to the primary server for the mail system (the one that handles the mail message store). Both servers, the primary and the secondary, can both access the SAN in read/write mode via the magic of some Veritas Clustering File System stuffs. (At this point, I am talking about stuff I only know about on a superficial level because it is Server Engineering magic; so I may get some stuff wrong.) When the backup server for the mail system was rebooted, this CFS layer went into a weird state. It is not supposed to go into a weird state. As a matter of fact, its job is specifically to prevent weird states. Alas, though, a weird state, it did enter.
This caused the primary mail message store server to enter a weird state. This caused the mail server software to have a hissy fit. *Boom!* Mail server crashed. The calendar server/backup mail server was still in a bad state. Bringing down the primary mail server finally straightened out the calendar server, and it entered a known good state.
Sighs of relief abounded.
The primary mail message store server was brought up. It mounted the SAN in a normal manner and everything appeared good.
More sighs of relief.
Now, when the mail message store crashes hard, you can't just bring it back up. We have been burned by this before. If there is corruption in the message store, bringing the service back online will cause this corruption to spread like a cancer; and it will eventually grow big enough to cause more crashes. What you must do is fix the corruption (if there is any) before you bring back the service. Fixing the corruption involves running a program called
reconstruct over the entire message store. This usually takes an hour or so for a message store of our size, but it is a net win in that we won't have subsequent crashes because of this crash.
That is what is happening right now. The mail system is back up, we are just cleaning up the corruption before restoring the service.
A failure of the Veritas CFS... doing something it is designed to specifically prevent. Grrr... The funny thing is, when the mail server crashes, it is rarely the fault of the actual mail server. It is usually something like this; a weird dependency that happens to do something that it absolutely should not be doing.
Regardless, the mail server is coming back soon with a nice, clean message store.