Emergency Maintenance
Emergency Maintenance: mv-qss (MediaVision QuickTime Streaming video server)
Problem: mv-qss (MediaVision QuickTime Streaming video server) Cause: need to finish unsuccessful hardware install from Weds. morning Affects: all users of mv-qss server Started: 05/15/2008 05:00 AM Resolved: 05/15/2008 06:00 AM
Notes:
A work-around has been found for the problem of fitting a PCI-X shaped peg in a PCI-E shaped hole.
Created: 05/14/2008 16:39:48 by jan3
Updates:
Emergency Maintenance: Restarting webmail server
Problem: Restarting webmail server Cause: A spammer is logged into someone's webmail account and won't log out Affects: People using webmail Started: 05/13/2008 05:30 PM Resolved: 05/13/2008 05:35 PM
Notes:
We only plan to have the webmail service down for a minute or so.
The restart was successful. The service was down for less than a minute.
Created: 05/13/2008 14:52:49 by emr
Updates: 05/13/2008 17:56:10 by emr
Problem Report
Problem Report: PBL-p10-e1 switches is down
Problem: PBL-p10-e1 switches is down Cause: Fan Tray Failure Affects: Partial Network user at PBL Started: 05/12/2008 04:06 PM Resolved: 05/12/2008 05:50 PM
Notes:
Switch failed again about one and a half hour later. A reboot restored it.
Found failed FAN Tray. Fan tray caused switches to halt immediately to prevent component overheating. Replaced failure component. All services back to normal.
Network engineer is enroute to investigate.
Created: 05/12/2008 17:01:28 by wxc16
Updates: 05/12/2008 17:53:11 by wxc16, 05/13/2008 07:40:57 by roo
Problem Report: Nrv house 3 network switch
Problem: Nrv house 3 network switch Cause: unknown Affects: All network connection in nrv house 3, phones and wireless Started: 05/09/2008 09:52 PM Resolved:
Notes:
Investigating the cause.
5/10/08 10:45AM Supervisor module with critical error was replaced
5/10/08 11:01AM Restored switch to minimal function - still no voice but switch online
5/10/08 7:05PM OS version bug caused voice and wireless card to fail. Need to upgrade OS
5/11/08 2:06AM Upgrade OS and configs - still issues
5/11/08 7:00Am Identified cause of poroblem to be damage data line card. Voice, Data and wireless restored.
Created: 05/10/2008 06:53:55 by roo
Updates: 05/11/2008 08:46:12 by roo
Problem Report: Mail system crashed
Problem: Mail system crashed Cause: Unknown at this time Affects: Mail system users Started: 05/09/2008 05:05 AM Resolved: 05/09/2008 06:17 AM
Notes:
[05/09/08 6:15 AM] - The mail system is back up and appears stable. Server Engineering will be going through the logs to determine what exactly happened that caused the system to crash in the first place.
The maintenance for this morning at first appeared to go fine but then the mail system crashed and we have been unable to restore service. Engineers from both Middleware Services and Server Engineering are working on the problem. The problem at this point seems to have something to do with the software that controls the mail box storage.
Created: 05/09/2008 05:56:15 by dak
Updates: 05/09/2008 06:17:03 by dak
Problem Report: Crawford Data Center AC degraded
Problem: Crawford Data Center AC degraded Cause: Unit #3 operating at 50% Affects: High Performance Compute Cluster (HPCC) Started: 05/08/2008 10:13 AM Resolved: 05/09/2008 12:08 PM
Notes:
Workaround found as we deal w/ fixing both A/C units.
Air Conditioner unit was restarted at 10:43AM but second compressor would not start. Maintenance is scheduled for morning of May 9, 2008. As a precaution, to prevent overheating the room, the High Performance Compute Cluster (HPCC) has been configured to not allow new jobs to start until sufficient cooling is restored.
Created: 05/08/2008 18:44:36 by bsc4
Updates: 05/09/2008 17:08:00 by jhm
Problem Report: blackboard.case.edu outage
Problem: blackboard.case.edu outage Cause: failed restart process Affects: all Blackboard users Started: 05/04/2008 10:05 AM Resolved: 05/04/2008 10:51 AM
Notes:
The nightly restart process failed on one server, leaving it unable to take over when the other server needed to restart itself.
Created: 05/04/2008 11:10:48 by prj
Updates:
Problem Report: VoIP: User unable to make personal setting changes
Problem: VoIP: User unable to make personal setting changes Cause: Unknown Affects: see notes Started: 04/28/2008 09:54 AM Resolved: 04/28/2008 10:41 AM
Notes:
This issue has been resolved. Engineer verified that new personal setting for end-user is applied to the phone number correctly.
Engineer is working on resolving the issue.
Some end-users may not be change Personal Settings on their phone even though Web Page shows changes were made.
Created: 04/28/2008 09:56:38 by wxc16
Updates: 04/28/2008 10:41:39 by wxc16
Problem Report: Internet is slow from from campus to internet
Problem: Internet is slow from from campus to internet Cause: possible DOS or malware virus activity Affects: all internet activity on campus Started: 04/24/2008 02:43 PM Resolved:
Notes:
The Internet will be slow until an upstream Internet provider repair a fiber optic cable cut.
details follow
An upstream Internet provider global crossing is reporting a fiber cut in the region of Michigan or Indiana. They are re-routing traffic around the cut. The re-routed traffic is causing overloading and saturation of the smaller re routed links which is causing large latency and packet loss.
engineers are investigating the issue. Edge firewalls have been reloaded.
Created: 04/24/2008 14:44:46 by lxc152
Updates: 04/24/2008 15:38:46 by lxc152
Problem Report: pgd-m1-vg248 - conn - Thu Apr 24 00:15:30 2008
Problem: pgd-m1-vg248 - conn - Thu Apr 24 00:15:30 2008 Cause: unknown Affects: The PGD-M1-VG248, it is one of the Analog Phone Gateways, in the Fiji Fraternity House. Started: 04/24/2008 12:15 AM Resolved: 04/24/2008 07:20 AM
Notes:
Repaired.
Created: 04/24/2008 07:40:45 by euw
Updates:
Problem Report: 4 ports down on KSL SAN Switch
Problem: 4 ports down on KSL SAN Switch Cause: Failed Board in Fiber Channel Switch Affects: Performance may be degraded but no outage is expected. Started: 04/23/2008 11:00 AM Resolved: 04/23/2008 04:15 PM
Notes:
Waiting on replacement board fiber channel board.
Replacement will happen hopefully 4-6pm, today.
Created: 04/23/2008 13:39:00 by rfw
Updates: 04/23/2008 17:08:50 by rfw
Problem Report: Case Voicemail System Degraded
Problem: Case Voicemail System Degraded Cause: Unknown Affects: Campus Voicemail Services Started: 04/21/2008 09:30 AM Resolved:
Notes:
Engineer was notified concerning the voicemail problem this morning, and immediately proceeded to switch voicemail service to backup server. The main voicemail was rebooted and voicemail service were still served by the backup server. Service was returned to normal until backup server stopped working around 3:15pm. The voicemail service was then switched back to the main voicemail server.
Both voicemail servers are now in operational state.
Created: 04/21/2008 17:32:21 by wxc16
Updates:
Problem Report: axo-m1-e1, zbt-m1-e1
Problem: axo-m1-e1, zbt-m1-e1 Cause: unknown Affects: Network access, phones and wireless in both buildings Started: 04/21/2008 10:05 AM Resolved: 04/21/2008 04:05 PM
Notes:
[10:05 AM]
This is typical of power outage in both buildings,
Eng is on his way to investigate.
[11:55 AM]
Waiting for the electrical power to be restored
to both buildings, and to the traffic lights at
the intersection of the Bellflower and Ford Roads.
[04:05 PM]
The electrical power came back on, and
the sec-axo.TELEM, the axo-m1-e1, and the zbt-m1-e1,
they came back on, by themselves, without a problem,
while I had to work, on getting
the axo-m1-vg248 and the zbt-m1-vg248,
on getting them, to start working correctly again, and
luckily, both efforts were successful.
Created: 04/21/2008 11:08:41 by roo
Updates: 04/21/2008 11:57:57 by euw, 04/21/2008 16:44:56 by euw
Problem Report: Mail List Manager (Sympa) was down
Problem: Mail List Manager (Sympa) was down Cause: unknown Affects: Mail list users Started: 04/20/2008 04:00 AM Resolved: 04/20/2008 05:10 AM
Notes:
The mail list manager (Sympa) had several processes that were hung and was not processing mail in its incoming queue. The system had to be completely shut down and the hung precesses forcibly terminated before the system could be brought back to operation.
No mail was lost during the time the mail list manager was unavailable, it was merely delayed and was delivered as soon as the hung processes were terminated and the system was restarted.
Created: 04/20/2008 09:15:02 by dak
Updates:
Problem Report: stone-m1-e1 - conn - Thu Apr 17 23:05:41 2008
Problem: stone-m1-e1 - conn - Thu Apr 17 23:05:41 2008 Cause: unknown Affects: The Stone-M1-E1, it is a Cisco Switch, which operates at an Access Level. Started: 04/17/2008 11:05 PM Resolved: 04/18/2008 12:20 PM
Notes:
Apr 17, 11:05 AM
The Problem Began.
Apr 18, 12:50 AM
Repaired.
Apr 18, 02:05 AM
Disconnected the Port 04/08 from the Switch;
Hardware Resetted the Switch;
Software Resetted the Switch;
the Switch is now reachable;
by the PINGs, and by the Telnets;
Reconnected the Port 04/08 back to the Switch.
Apr 18, 08:27 AM
Still unreachable.
Apr 18, 09:12 AM
The switch has returned to normal operation.
Apr 18, 12:20 PM
Found Out the Source of the Problem,
which was because of a Physical Loop, and
a Media Converter, which had gone bad; so therefore,
took them out of the Circuit, and
things quickly returned back to normal.
Created: 04/17/2008 23:49:41 by euw
Updates: 04/18/2008 01:03:38 by euw, 04/18/2008 08:26:22 by roo, 04/18/2008 09:12:37 by roo, 04/19/2008 02:08:58 by euw, 04/19/2008 08:52:38 by euw
Problem Report: HTMLDB &?wwwftp server down
Problem: HTMLDB &Problem Report: Off Campus Connection Down
Problem: Off Campus Connection Down Cause: Failed firewall update Affects: All outbound internet connections to sites off campus Started: 03/28/2008 11:00 AM Resolved:Notes:
the issue with the internet is resolved for major systems
(mail,web and the like) and should be resolved all other systems as well. we expect to still see some issues with access do to operating systems caching of various items. Please reboot if you are still having issues
Root cause is still being determined. Based on first hand reports this issue does not appear to be related to firewall rule changes. They were completed 8:30 am this morning. This issue surfaced around 11:00 am or so
A network engineer is looking into the problem and will update when they find the problem.
Problem likely due to problematic rule push to edge firewall early this morning. Primary firewall routing table was lost and redundant firewall configuration was lost. Primary routing table was rebuilt by engineers to correct problem. Network security engineer currently rebuilding redundant firewall and investigating reasons for failed update. ~ Ozanich
Created: 03/28/2008 11:15:29 by emr
Updates: 03/28/2008 14:30:51 by jxo63, 03/28/2008 15:04:13 by lxc152
Problem Report: Google Start Page (http://webstart.case.edu) Is Performing Infinite Redirects to http://login.case.edu
Problem: Google Start Page (http://webstart.case.edu) Is Performing Infinite Redirects to http://login.case.edu Cause: Google Is Redirecting Improperly Affects: http://webstart.case.edu Started: 03/21/2008 02:00 PM Resolved:Notes:
Google has been notified of the problem.
In the meantime, to work around the issue, after getting to the error message in your browser, go back to the URL bar and manually enter "webstart.case.edu" and navigate back to the page.
Created: 03/21/2008 15:27:23 by jms18
Updates:
Problem Report: Mail servers have stopped processing mail
Problem: Mail servers have stopped processing mail Cause: Unknown at this time Affects: Mail delivery Started: 02/14/2008 11:30 AM Resolved:Notes:
We can see no reason for the problems we are seeing. At this point the queues are growing on all the delivery (both incoming virus and spam) layers. We are working with the vendor to try to resolve the problem.
Created: 02/14/2008 15:03:18 by dak
Updates:
Problem Report: Unified Messaging Server issue
Problem: Unified Messaging Server issue Cause: Unknown Affects: Case Unified Messaging Services Started: 02/14/2008 08:00 AM Resolved:Notes:
End-users are reporting that there is an intermittent issue with the UM system when dead air is heard. Engineers are currently investigating the issue.
Created: 02/14/2008 10:56:29 by wxc16
Updates:
Problem Report: Mail Delivery Delays
Problem: Mail Delivery Delays Cause: Huge Increase in spam traffic Affects: Mail delivery times Started: 02/12/2008 09:00 AM Resolved:Notes:
We have been seeing a large increase in spam today that has been seriously affecting our delivery times. We have been seeing about a three-fold increase in traffic from off-campus.
As an interim step to alleviate the worst of the delivery delays, we have pressed into service some evaluation units with higher capacity than the existing production units. It appears the additional capacity is having a positive effect.
Since the new units are a new model with which we are not completely familiar, we will continue to monitor them through the course of the evening to make sure that they are performing correctly.
Created: 02/12/2008 17:13:02 by dak
Updates:
Problem Report: MARS appliance not powering up
Problem: MARS appliance not powering up Cause: Power supply? Affects: Security, Network Engineering, reviewers of log aggregate reports Started: 02/01/2008 08:00 AM Resolved:Notes:
Network Engineering members offsite today - Server Engineering was kind enough to check power & network status. Appliance was not powered up, and remained non responsive even after power cable re-seating.
Created: 02/01/2008 12:24:33 by rac40
Updates:
Problem Report: KSL load balancer upgrade
Problem: KSL load balancer upgrade Cause: possible faulty hardware Affects: ksl data center Started: 01/12/2008 10:16 AM Resolved:Notes:
waiting on call back from cisco TAC
issue has been temporarily resolved for now
nesg will continue to monitor the issue.
additional work will be needed to fully resolve this issue
Created: 01/12/2008 10:37:32 by lxc152
Updates: 01/12/2008 11:37:08 by lxc152
Problem Report: Case IM gateway to Yahoo Messenger not working
Problem: Case IM gateway to Yahoo Messenger not working Cause: Yahoo changed their protocol Affects: Case IM (Spark client) users using the Yahoo gateway Started: 12/18/2007 12:47 AM Resolved:Notes:
Yahoo changed something with their Messenger protocol which is preventing the Case IM gateway from working. You can find more information about the problem at
http://www.igniterealtime.org/community/thread/30590?tstart=15
In the meantime, the Yahoo gateway has been disabled until the Openfire IM gateway plugin is upgraded.
Created: 12/18/2007 12:53:07 by sdh7
Updates:
Problem Report: Crawfordsvr0 and Crawfordsvr1 connections
Problem: Crawfordsvr0 and Crawfordsvr1 connections Cause: Crawford Data Center Interswitch link connection flapping Affects: Connections on the Switch Started: 05/26/2007 11:40 PM Resolved: 08/15/2007 06:00 AMNotes:
UPDATE: Engineer has made a temporary interswitch link patch in the crawford data center. Problem link has been disabled to prevent further problem. Most of the symptoms we've seen have gone away. Engineers will investigating further on this problem.
ITS Engineers discovered flapping interswitch link in the crawford server farm switches. Still investigating.
Created: 05/26/2007 23:44:53 by roo
Updates: 05/27/2007 04:25:20 by wxc16
Read more Problem Report posts.![]()
Scheduled Maintenance
Scheduled Maintenance: Campus WWW Server Upgrade
Problem: Campus WWW Server Upgrade Cause: Migrating the campus WWW server to it's new platform Affects: All users of the campus WWW server Started: 05/19/2008 08:00 AM Resolved: 05/19/2008 09:00 AMNotes:
We are moving the campus WWW server from it's multi-year-old architecture to a modern configuration.
Created: 05/16/2008 13:24:08 by jms20
Updates:
Scheduled Maintenance: Name Server Changes for New Single Sign On
Problem: Name Server Changes for New Single Sign On Cause: Need redundant services in Crawford computer room before the KSL shutdown Affects: Should not affect anyone Started: 05/20/2008 05:00 AM Resolved: 05/20/2008 05:30 AMNotes:
We are doing some name server work now so that some IP address lifetimes have a chance to expire before teh shutdown and the move to the new service.
There should actually be no downtime here, we are simply repointing some name aliases which should be virtually instantaneous. We are not actually moving to the new server at this time, just laying the groundwork for the move in about a week.
Created: 05/15/2008 16:26:37 by dak
Updates:
Scheduled Maintenance: Defective Hard Drive on the Windows Server Kessel
Problem: Defective Hard Drive on the Windows Server Kessel Cause: Failure prediction threshold exceeded due to test Affects: No one is effected, drive is part of a Raid set Started: 05/16/2008 04:30 PM Resolved: 05/16/2008 05:30 PMNotes:
Windows Team will replaced defective drive. No outage is to be expected.
Created: 05/15/2008 14:26:40 by tpw9
Updates:
Scheduled Maintenance: mv-qss and mv-flash will be down for scheduled maintenance
Problem: mv-qss and mv-flash will be down for scheduled maintenance Cause: Add HBA controller to each system Affects: all server functions Started: 05/14/2008 04:00 AM Resolved: 05/14/2008 06:00 AMNotes:
mv-qss and mv-flash will be powered down for scheduled maintenance on Wed. May 14, 2008 at 04:00. Will be powered up no later than 06:00 that day.
mv-flash received new HBA. Also performed PERC firmware update and new driver installation.
Was unable to install a new HBA in mv-qss. Manufacturer shipped the wrong part. Will have to reschedule maintenance.
Created: 05/12/2008 17:35:30 by dxw134
Updates: 05/14/2008 06:25:05 by dxw134
Scheduled Maintenance: http://mail.case.edu will be down
Problem: http://mail.case.edu will be down Cause: Branding changes require reboot Affects: Users of http://mail.case.edu Started: 05/09/2008 05:00 AM Resolved: 05/09/2008 05:30 AMNotes:
[05/09/08 6:30 AM] - The maintenance this morning did not go quite as smoothly as expected. After making the branding changes the mail system appears to run fine for about 5 minutes and then unexpected crashed. The problem seems to have been with the software that controls the mail box storage. Service has been restored (as of about 6:10) and everything appears stable at this time. We will be going through the logs to determine exactly what caused the crash and will post an update if appropriate.
To eliminate some confusion between http://mail.case.edu (the current mail system web mail interface) and http://webmail.case.edu (the Google Apps web mail interface) we will be removing the "Webmail@Case.edu" banner on http://mail.case.edu. Unfortunately that requires us to stop and restart the interface. Actual downtime is expected to be about 5 minutes.
This will only affect users of the web mail interface. POP and IMAP users will be unaffected.
Created: 05/07/2008 09:03:12 by dak
Updates: 05/09/2008 06:27:23 by dak
Scheduled Maintenance: Crawford 315, Rack 304
Problem: Crawford 315, Rack 304 Cause: to unrack the i-Trap appliance Affects: discontinueing its services Started: 05/07/2008 03:00 PM Resolved: 05/07/2008 06:00 PMNotes:
and then,
to return it to its Manufacturer
this week
Created: 05/05/2008 18:12:09 by euw
Updates:
Scheduled Maintenance: The Adelbert Gymnasium and The One-To-One Fitness Centre
Problem: The Adelbert Gymnasium and The One-To-One Fitness Centre Cause: to move a six-rack-unit-high metal shelf down twenty-four rack-units Affects: all of the End-Users in the Adelbert Gymnasium and the One-To-One Fitness Centre Started: 05/05/2008 03:00 PM Resolved: 05/05/2008 06:00 PMNotes:
to move it,
to the Rack Units 07 through 12,
from the Rack Units 31 through 36, and
sitting on this shelf is,
one B/B device, and
six Power/Injectors.
Created: 05/05/2008 16:21:40 by euw
Updates:
Scheduled Maintenance: The OSC Networking Trouble Ticket Number: 48148
Problem: The OSC Networking Trouble Ticket Number: 48148 Cause: The OSC Planned Maintenance Notification Affects: The Clients directly connected to the OSC CLEVS-r1 core router in Cleveland. Started: 05/07/2008 12:00 AM Resolved: 05/07/2008 03:00 PMNotes:
The OSC engineers will be updating the routing on
the CLEVS-r1 router.
The changes being made have been tested in
a lab environment and no downtime to
the OSC clients is anticipated.
If you have any questions or concerns regarding
this planned work, please contact
the OSC Networking NOC and reference
the above ticket number.
Network Operations Center
OARnet - Networking Division of OSC
Phone: 1-800-627-6420
Email: support@oar.net
Created: 04/25/2008 15:53:45 by euw
Updates:
Scheduled Maintenance: Shutdown of old LDAP servers
Problem: Shutdown of old LDAP servers Cause: Old equipment being phased out of service Affects: A few users connecting directly to the old servers (ldap-replica2 & ldap-replica3) Started: 02/01/2008 05:45 AM Resolved: 02/01/2008 06:00 AMNotes:
[02/01/08 5:55 AM] - This time it looks like we have managed to remove all the dependencies from the blog system. The servers will remain shut down unless we hear of other critical services with dependencies that cannot be easily modified to use the new servers.
[01/31/08 8:15 AM] - We have been forced to restart the servers as there are dependencies on the old servers in the blog system that we could not find and fix in the short term. We are moving the shutdown to tomorrow morning and will try again then.
[01/31/08 6:15 AM] - The shutdown went as planned. The servers are currently shut down in such a way that they can be brought quickly back to service should we find that anyone has missed updating a critical service. In about a week's time, we will begin taking steps to permanently remove the servers.
On January 31st, ITS will be shutting down two of its older LDAP servers. This will not have any affect on most users. If you are connecting to an LDAP server without using SSL then you should be using ldap.case.edu as the server address. That will get you a list of all the current LDAP servers without any need to update anything now or in the future.
If you are using SSL (and you should be if you are accessing any sensitive data) then you must connect to each server by name. In that case you may be affected by the shutdown.
The two servers being shutdown are ldap-replica2.case.edu and ldap-replica3.case.edu. If you have any references to ldap-replica2 or ldap-replica3 you will need to replace them with the newer servers ldap-replica7 and ldap-replica8. Since these are all identical replicas all you need to do is change the number. No other changes should be required.
Created: 01/08/2008 12:54:26 by emr
Updates: 01/31/2008 06:12:06 by dak, 01/31/2008 08:20:22 by dak, 02/01/2008 06:06:59 by dak
Read more Scheduled Maintenance posts.![]()
