CASE.EDU:    HOME | DIRECTORIES | SEARCH

Problem Report: Mail SPAM filter machine keeps dying

Posted by postprob on April 09, 2008 at 08:46 AM

Problem:   Mail SPAM filter machine keeps dying
Cause:     Overheating & Failed hardware
Affects:   New mail reception of about 1/4 of mail users
Started:   04/09/2008 07:55 AM
Resolved:  04/09/2008 03:25 PM

Notes:

The first line of the previous update should have read: "We have NOW gotten all production services off the failed hardware and things are stabilized..."

[04/09/2008 4:30PM] - We have not gotten all production services off the failed hardware and things are stabilized, so we are resolving this issue. No mail was lost during this hardware failure event and there were only minor delivery delays for the affected individuals.

In our attempts to move all production services to a cooler part of the room, we have inadvertently moved what now appears to be a hardware problem to a different production spam filter. An engineer is enroute to correct the issue.

The rack that was holding the overheating system has two Sun V880's (very large servers) in it. We moved the failing Spam server (and another one that was also running hot) to a new rack that has more cool air near the servers.

At this time we believe that we have resolved the problem. We will continue to monitor the servers.

The vendor has looked into the crashing problem and found no obvious reason for the crashes but did find the system running very hot. (the chassis was actually hotter than the CPU) We have done a few things to cool down the rack that the server is in and it has (so far) stabilized the server. We still need to do more to cool down the server or move the server to another rack with better cooling.

The system seems to keep shutting down and restarting itself - it has not remained operational log enough for any one period since this started to actually access the diagnostics of the system to seee what is going on. We are opening up a trouble ticket with the vendor and are preparing to put our spare hardware into production in place of the failing system.

We will post updates as they become available.


Created: 04/09/2008 08:46:19 by dak

Updates: 04/09/2008 10:07:29 by emr, 04/09/2008 13:36:24 by emr, 04/09/2008 14:29:33 by dak, 04/09/2008 16:25:43 by dak, 04/09/2008 16:33:35 by dak