CASE.EDU:    HOME | DIRECTORIES | SEARCH

Emergency Maintenance

Emergency Maintenance: mv-qss (MediaVision QuickTime Streaming video server)

Problem:   mv-qss (MediaVision QuickTime Streaming video server)
Cause:     need to finish unsuccessful hardware install from Weds.  morning
Affects:   all users of mv-qss server
Started:   05/15/2008 05:00 AM
Resolved:  05/15/2008 06:00 AM

Notes:

A work-around has been found for the problem of fitting a PCI-X shaped peg in a PCI-E shaped hole.


Created: 05/14/2008 16:39:48 by jan3

Updates:


Emergency Maintenance: Restarting webmail server

Problem:   Restarting webmail server
Cause:     A spammer is logged into someone's webmail account and won't log out
Affects:   People using webmail
Started:   05/13/2008 05:30 PM
Resolved:  05/13/2008 05:35 PM

Notes:

We only plan to have the webmail service down for a minute or so.

The restart was successful. The service was down for less than a minute.


Created: 05/13/2008 14:52:49 by emr

Updates: 05/13/2008 17:56:10 by emr


Read more Emergency Maintenance posts. Subscribe

Problem Report

Problem Report: PBL-p10-e1 switches is down

Problem:   PBL-p10-e1 switches is down
Cause:     Fan Tray Failure
Affects:   Partial Network user at PBL
Started:   05/12/2008 04:06 PM
Resolved:  05/12/2008 05:50 PM

Notes:

Switch failed again about one and a half hour later. A reboot restored it.

Found failed FAN Tray. Fan tray caused switches to halt immediately to prevent component overheating. Replaced failure component. All services back to normal.

Network engineer is enroute to investigate.


Created: 05/12/2008 17:01:28 by wxc16

Updates: 05/12/2008 17:53:11 by wxc16, 05/13/2008 07:40:57 by roo


Problem Report: Nrv house 3 network switch

Problem:   Nrv house 3 network switch
Cause:     unknown
Affects:   All network connection in nrv house 3, phones and wireless
Started:   05/09/2008 09:52 PM
Resolved:  

Notes:

Investigating the cause.
5/10/08 10:45AM Supervisor module with critical error was replaced
5/10/08 11:01AM Restored switch to minimal function - still no voice but switch online
5/10/08 7:05PM OS version bug caused voice and wireless card to fail. Need to upgrade OS
5/11/08 2:06AM Upgrade OS and configs - still issues
5/11/08 7:00Am Identified cause of poroblem to be damage data line card. Voice, Data and wireless restored.


Created: 05/10/2008 06:53:55 by roo

Updates: 05/11/2008 08:46:12 by roo


Problem Report: Mail system crashed

Problem:   Mail system crashed
Cause:     Unknown at this time
Affects:   Mail system users
Started:   05/09/2008 05:05 AM
Resolved:  05/09/2008 06:17 AM

Notes:

[05/09/08 6:15 AM] - The mail system is back up and appears stable. Server Engineering will be going through the logs to determine what exactly happened that caused the system to crash in the first place.

The maintenance for this morning at first appeared to go fine but then the mail system crashed and we have been unable to restore service. Engineers from both Middleware Services and Server Engineering are working on the problem. The problem at this point seems to have something to do with the software that controls the mail box storage.


Created: 05/09/2008 05:56:15 by dak

Updates: 05/09/2008 06:17:03 by dak


Problem Report: Crawford Data Center AC degraded

Problem:   Crawford Data Center AC degraded
Cause:     Unit #3 operating at 50%
Affects:   High Performance Compute Cluster (HPCC)
Started:   05/08/2008 10:13 AM
Resolved:  05/09/2008 12:08 PM

Notes:

Workaround found as we deal w/ fixing both A/C units.

Air Conditioner unit was restarted at 10:43AM but second compressor would not start. Maintenance is scheduled for morning of May 9, 2008. As a precaution, to prevent overheating the room, the High Performance Compute Cluster (HPCC) has been configured to not allow new jobs to start until sufficient cooling is restored.


Created: 05/08/2008 18:44:36 by bsc4

Updates: 05/09/2008 17:08:00 by jhm


Problem Report: blackboard.case.edu outage

Problem:   blackboard.case.edu outage
Cause:     failed restart process
Affects:   all Blackboard users
Started:   05/04/2008 10:05 AM
Resolved:  05/04/2008 10:51 AM

Notes:

The nightly restart process failed on one server, leaving it unable to take over when the other server needed to restart itself.


Created: 05/04/2008 11:10:48 by prj

Updates:


Problem Report: VoIP: User unable to make personal setting changes

Problem:   VoIP:  User unable to make personal setting changes
Cause:     Unknown
Affects:   see notes
Started:   04/28/2008 09:54 AM
Resolved:  04/28/2008 10:41 AM

Notes:

This issue has been resolved. Engineer verified that new personal setting for end-user is applied to the phone number correctly.


Engineer is working on resolving the issue.

Some end-users may not be change Personal Settings on their phone even though Web Page shows changes were made.


Created: 04/28/2008 09:56:38 by wxc16

Updates: 04/28/2008 10:41:39 by wxc16


Problem Report: Internet is slow from from campus to internet

Problem:   Internet is slow from from campus  to internet
Cause:     possible DOS or malware virus activity
Affects:   all internet activity on campus
Started:   04/24/2008 02:43 PM
Resolved:  

Notes:

The Internet will be slow until an upstream Internet provider repair a fiber optic cable cut.

details follow

An upstream Internet provider global crossing is reporting a fiber cut in the region of Michigan or Indiana. They are re-routing traffic around the cut. The re-routed traffic is causing overloading and saturation of the smaller re routed links which is causing large latency and packet loss.

engineers are investigating the issue. Edge firewalls have been reloaded.


Created: 04/24/2008 14:44:46 by lxc152

Updates: 04/24/2008 15:38:46 by lxc152


Problem Report: pgd-m1-vg248 - conn - Thu Apr 24 00:15:30 2008

Problem:   pgd-m1-vg248 - conn - Thu Apr 24 00:15:30 2008
Cause:     unknown
Affects:   The PGD-M1-VG248, it is one of the Analog Phone Gateways, in the Fiji Fraternity House.  
Started:   04/24/2008 12:15 AM
Resolved:  04/24/2008 07:20 AM

Notes:

Repaired.


Created: 04/24/2008 07:40:45 by euw

Updates:


Problem Report: 4 ports down on KSL SAN Switch

Problem:   4 ports down on KSL SAN Switch
Cause:     Failed Board in Fiber Channel Switch
Affects:   Performance may be degraded but no outage is expected.
Started:   04/23/2008 11:00 AM
Resolved:  04/23/2008 04:15 PM

Notes:

Waiting on replacement board fiber channel board.
Replacement will happen hopefully 4-6pm, today.


Created: 04/23/2008 13:39:00 by rfw

Updates: 04/23/2008 17:08:50 by rfw


Problem Report: Case Voicemail System Degraded

Problem:   Case Voicemail System Degraded
Cause:     Unknown
Affects:   Campus Voicemail Services
Started:   04/21/2008 09:30 AM
Resolved:  

Notes:

Engineer was notified concerning the voicemail problem this morning, and immediately proceeded to switch voicemail service to backup server. The main voicemail was rebooted and voicemail service were still served by the backup server. Service was returned to normal until backup server stopped working around 3:15pm. The voicemail service was then switched back to the main voicemail server.

Both voicemail servers are now in operational state.


Created: 04/21/2008 17:32:21 by wxc16

Updates:


Problem Report: axo-m1-e1, zbt-m1-e1

Problem:   axo-m1-e1, zbt-m1-e1
Cause:     unknown
Affects:   Network access, phones and wireless in both buildings
Started:   04/21/2008 10:05 AM
Resolved:  04/21/2008 04:05 PM

Notes:

[10:05 AM]
This is typical of power outage in both buildings,
Eng is on his way to investigate.
[11:55 AM]
Waiting for the electrical power to be restored
to both buildings, and to the traffic lights at
the intersection of the Bellflower and Ford Roads.
[04:05 PM]
The electrical power came back on, and
the sec-axo.TELEM, the axo-m1-e1, and the zbt-m1-e1,
they came back on, by themselves, without a problem,
while I had to work, on getting
the axo-m1-vg248 and the zbt-m1-vg248,
on getting them, to start working correctly again, and
luckily, both efforts were successful.


Created: 04/21/2008 11:08:41 by roo

Updates: 04/21/2008 11:57:57 by euw, 04/21/2008 16:44:56 by euw


Problem Report: Mail List Manager (Sympa) was down

Problem:   Mail List Manager (Sympa) was down
Cause:     unknown
Affects:   Mail list users
Started:   04/20/2008 04:00 AM
Resolved:  04/20/2008 05:10 AM

Notes:

The mail list manager (Sympa) had several processes that were hung and was not processing mail in its incoming queue. The system had to be completely shut down and the hung precesses forcibly terminated before the system could be brought back to operation.

No mail was lost during the time the mail list manager was unavailable, it was merely delayed and was delivered as soon as the hung processes were terminated and the system was restarted.


Created: 04/20/2008 09:15:02 by dak

Updates:


Problem Report: stone-m1-e1 - conn - Thu Apr 17 23:05:41 2008

Problem:   stone-m1-e1 - conn - Thu Apr 17 23:05:41 2008
Cause:     unknown
Affects:   The Stone-M1-E1, it is a Cisco Switch, which operates at an Access Level.  
Started:   04/17/2008 11:05 PM
Resolved:  04/18/2008 12:20 PM

Notes:

Apr 17, 11:05 AM
The Problem Began.
Apr 18, 12:50 AM
Repaired.
Apr 18, 02:05 AM
Disconnected the Port 04/08 from the Switch;
Hardware Resetted the Switch;
Software Resetted the Switch;
the Switch is now reachable;
by the PINGs, and by the Telnets;
Reconnected the Port 04/08 back to the Switch.
Apr 18, 08:27 AM
Still unreachable.
Apr 18, 09:12 AM
The switch has returned to normal operation.
Apr 18, 12:20 PM
Found Out the Source of the Problem,
which was because of a Physical Loop, and
a Media Converter, which had gone bad; so therefore,
took them out of the Circuit, and
things quickly returned back to normal.


Created: 04/17/2008 23:49:41 by euw

Updates: 04/18/2008 01:03:38 by euw, 04/18/2008 08:26:22 by roo, 04/18/2008 09:12:37 by roo, 04/19/2008 02:08:58 by euw, 04/19/2008 08:52:38 by euw


Problem Report: HTMLDB &?wwwftp server down

Problem:   HTMLDB &




Problem Report: Off Campus Connection Down

Problem:   Off Campus Connection Down
Cause:     Failed firewall update
Affects:   All outbound internet connections to sites off campus
Started:   03/28/2008 11:00 AM
Resolved:  

Notes:

the issue with the internet is resolved for major systems
(mail,web and the like) and should be resolved all other systems as well. we expect to still see some issues with access do to operating systems caching of various items. Please reboot if you are still having issues

Root cause is still being determined. Based on first hand reports this issue does not appear to be related to firewall rule changes. They were completed 8:30 am this morning. This issue surfaced around 11:00 am or so


A network engineer is looking into the problem and will update when they find the problem.

Problem likely due to problematic rule push to edge firewall early this morning. Primary firewall routing table was lost and redundant firewall configuration was lost. Primary routing table was rebuilt by engineers to correct problem. Network security engineer currently rebuilding redundant firewall and investigating reasons for failed update. ~ Ozanich


Created: 03/28/2008 11:15:29 by emr

Updates: 03/28/2008 14:30:51 by jxo63, 03/28/2008 15:04:13 by lxc152


Problem Report: Google Start Page (http://webstart.case.edu) Is Performing Infinite Redirects to http://login.case.edu

Problem:   Google Start Page (http://webstart.case.edu) Is Performing Infinite Redirects to http://login.case.edu
Cause:     Google Is Redirecting Improperly
Affects:   http://webstart.case.edu
Started:   03/21/2008 02:00 PM
Resolved:  

Notes:

Google has been notified of the problem.

In the meantime, to work around the issue, after getting to the error message in your browser, go back to the URL bar and manually enter "webstart.case.edu" and navigate back to the page.


Created: 03/21/2008 15:27:23 by jms18

Updates:


Problem Report: Mail servers have stopped processing mail

Problem:   Mail servers have stopped processing mail
Cause:     Unknown at this time
Affects:   Mail delivery
Started:   02/14/2008 11:30 AM
Resolved:  

Notes:

We can see no reason for the problems we are seeing. At this point the queues are growing on all the delivery (both incoming virus and spam) layers. We are working with the vendor to try to resolve the problem.


Created: 02/14/2008 15:03:18 by dak

Updates:


Problem Report: Unified Messaging Server issue

Problem:   Unified Messaging Server issue
Cause:     Unknown
Affects:   Case Unified Messaging Services
Started:   02/14/2008 08:00 AM
Resolved:  

Notes:

End-users are reporting that there is an intermittent issue with the UM system when dead air is heard. Engineers are currently investigating the issue.


Created: 02/14/2008 10:56:29 by wxc16

Updates:


Problem Report: Mail Delivery Delays

Problem:   Mail Delivery Delays
Cause:     Huge Increase in spam traffic
Affects:   Mail delivery times
Started:   02/12/2008 09:00 AM
Resolved:  

Notes:

We have been seeing a large increase in spam today that has been seriously affecting our delivery times. We have been seeing about a three-fold increase in traffic from off-campus.

As an interim step to alleviate the worst of the delivery delays, we have pressed into service some evaluation units with higher capacity than the existing production units. It appears the additional capacity is having a positive effect.

Since the new units are a new model with which we are not completely familiar, we will continue to monitor them through the course of the evening to make sure that they are performing correctly.


Created: 02/12/2008 17:13:02 by dak

Updates:


Problem Report: MARS appliance not powering up

Problem:   MARS appliance not powering up
Cause:     Power supply?
Affects:   Security, Network Engineering, reviewers of log aggregate reports
Started:   02/01/2008 08:00 AM
Resolved:  

Notes:


Network Engineering members offsite today - Server Engineering was kind enough to check power & network status. Appliance was not powered up, and remained non responsive even after power cable re-seating.


Created: 02/01/2008 12:24:33 by rac40

Updates:


Problem Report: KSL load balancer upgrade

Problem:   KSL load balancer upgrade
Cause:     possible faulty hardware
Affects:   ksl data center
Started:   01/12/2008 10:16 AM
Resolved:  

Notes:

waiting on call back from cisco TAC
issue has been temporarily resolved for now
nesg will continue to monitor the issue.

additional work will be needed to fully resolve this issue


Created: 01/12/2008 10:37:32 by lxc152

Updates: 01/12/2008 11:37:08 by lxc152


Problem Report: Case IM gateway to Yahoo Messenger not working

Problem:   Case IM gateway to Yahoo Messenger not working
Cause:     Yahoo changed their protocol
Affects:   Case IM (Spark client) users using the Yahoo gateway
Started:   12/18/2007 12:47 AM
Resolved:  

Notes:

Yahoo changed something with their Messenger protocol which is preventing the Case IM gateway from working. You can find more information about the problem at
http://www.igniterealtime.org/community/thread/30590?tstart=15

In the meantime, the Yahoo gateway has been disabled until the Openfire IM gateway plugin is upgraded.


Created: 12/18/2007 12:53:07 by sdh7

Updates:


Problem Report: Crawfordsvr0 and Crawfordsvr1 connections

Problem:   Crawfordsvr0 and Crawfordsvr1 connections
Cause:     Crawford Data Center Interswitch link connection flapping
Affects:   Connections on the Switch  
Started:   05/26/2007 11:40 PM
Resolved:  08/15/2007 06:00 AM

Notes:

UPDATE: Engineer has made a temporary interswitch link patch in the crawford data center. Problem link has been disabled to prevent further problem. Most of the symptoms we've seen have gone away. Engineers will investigating further on this problem.

ITS Engineers discovered flapping interswitch link in the crawford server farm switches. Still investigating.


Created: 05/26/2007 23:44:53 by roo

Updates: 05/27/2007 04:25:20 by wxc16


Read more Problem Report posts. Subscribe

Scheduled Maintenance

Scheduled Maintenance: Campus WWW Server Upgrade

Problem:   Campus WWW Server Upgrade
Cause:     Migrating the campus WWW server to it's new platform
Affects:   All users of the campus WWW server
Started:   05/19/2008 08:00 AM
Resolved:  05/19/2008 09:00 AM

Notes:

We are moving the campus WWW server from it's multi-year-old architecture to a modern configuration.


Created: 05/16/2008 13:24:08 by jms20

Updates:


Scheduled Maintenance: Name Server Changes for New Single Sign On

Problem:   Name Server Changes for New Single Sign On
Cause:     Need redundant services in Crawford computer room before the KSL shutdown
Affects:   Should not affect anyone
Started:   05/20/2008 05:00 AM
Resolved:  05/20/2008 05:30 AM

Notes:

We are doing some name server work now so that some IP address lifetimes have a chance to expire before teh shutdown and the move to the new service.

There should actually be no downtime here, we are simply repointing some name aliases which should be virtually instantaneous. We are not actually moving to the new server at this time, just laying the groundwork for the move in about a week.


Created: 05/15/2008 16:26:37 by dak

Updates:


Scheduled Maintenance: Defective Hard Drive on the Windows Server Kessel

Problem:   Defective Hard Drive on the Windows Server Kessel
Cause:     Failure prediction threshold exceeded due to test
Affects:   No one is effected, drive is part of a Raid set 
Started:   05/16/2008 04:30 PM
Resolved:  05/16/2008 05:30 PM

Notes:

Windows Team will replaced defective drive. No outage is to be expected.


Created: 05/15/2008 14:26:40 by tpw9

Updates:


Scheduled Maintenance: mv-qss and mv-flash will be down for scheduled maintenance

Problem:   mv-qss and mv-flash will be down for scheduled maintenance
Cause:     Add HBA controller to each system
Affects:   all server functions
Started:   05/14/2008 04:00 AM
Resolved:  05/14/2008 06:00 AM

Notes:

mv-qss and mv-flash will be powered down for scheduled maintenance on Wed. May 14, 2008 at 04:00. Will be powered up no later than 06:00 that day.

mv-flash received new HBA. Also performed PERC firmware update and new driver installation.

Was unable to install a new HBA in mv-qss. Manufacturer shipped the wrong part. Will have to reschedule maintenance.


Created: 05/12/2008 17:35:30 by dxw134

Updates: 05/14/2008 06:25:05 by dxw134


Scheduled Maintenance: http://mail.case.edu will be down

Problem:   http://mail.case.edu will be down
Cause:     Branding changes require reboot
Affects:   Users of http://mail.case.edu
Started:   05/09/2008 05:00 AM
Resolved:  05/09/2008 05:30 AM

Notes:

[05/09/08 6:30 AM] - The maintenance this morning did not go quite as smoothly as expected. After making the branding changes the mail system appears to run fine for about 5 minutes and then unexpected crashed. The problem seems to have been with the software that controls the mail box storage. Service has been restored (as of about 6:10) and everything appears stable at this time. We will be going through the logs to determine exactly what caused the crash and will post an update if appropriate.

To eliminate some confusion between http://mail.case.edu (the current mail system web mail interface) and http://webmail.case.edu (the Google Apps web mail interface) we will be removing the "Webmail@Case.edu" banner on http://mail.case.edu. Unfortunately that requires us to stop and restart the interface. Actual downtime is expected to be about 5 minutes.

This will only affect users of the web mail interface. POP and IMAP users will be unaffected.


Created: 05/07/2008 09:03:12 by dak

Updates: 05/09/2008 06:27:23 by dak


Scheduled Maintenance: Crawford 315, Rack 304

Problem:   Crawford 315, Rack 304
Cause:     to unrack the i-Trap appliance
Affects:   discontinueing its services
Started:   05/07/2008 03:00 PM
Resolved:  05/07/2008 06:00 PM

Notes:

and then,
to return it to its Manufacturer
this week


Created: 05/05/2008 18:12:09 by euw

Updates:


Scheduled Maintenance: The Adelbert Gymnasium and The One-To-One Fitness Centre

Problem:   The Adelbert Gymnasium and The One-To-One Fitness Centre
Cause:     to move a six-rack-unit-high metal shelf down twenty-four rack-units
Affects:   all of the End-Users in the Adelbert Gymnasium and the One-To-One Fitness Centre
Started:   05/05/2008 03:00 PM
Resolved:  05/05/2008 06:00 PM

Notes:

to move it,
to the Rack Units 07 through 12,
from the Rack Units 31 through 36, and
sitting on this shelf is,
one B/B device, and
six Power/Injectors.


Created: 05/05/2008 16:21:40 by euw

Updates:


Scheduled Maintenance: The OSC Networking Trouble Ticket Number: 48148

Problem:   The OSC Networking Trouble Ticket Number:  48148
Cause:     The OSC Planned Maintenance Notification
Affects:   The Clients directly connected to the OSC CLEVS-r1 core router in Cleveland.  
Started:   05/07/2008 12:00 AM
Resolved:  05/07/2008 03:00 PM

Notes:

The OSC engineers will be updating the routing on
the CLEVS-r1 router.
The changes being made have been tested in
a lab environment and no downtime to
the OSC clients is anticipated.

If you have any questions or concerns regarding
this planned work, please contact
the OSC Networking NOC and reference
the above ticket number.

Network Operations Center
OARnet - Networking Division of OSC
Phone: 1-800-627-6420
Email: support@oar.net


Created: 04/25/2008 15:53:45 by euw

Updates:


Scheduled Maintenance: Shutdown of old LDAP servers

Problem:   Shutdown of old LDAP servers
Cause:     Old equipment being phased out of service
Affects:   A few users connecting directly to the old servers (ldap-replica2 & ldap-replica3)
Started:   02/01/2008 05:45 AM
Resolved:  02/01/2008 06:00 AM

Notes:

[02/01/08 5:55 AM] - This time it looks like we have managed to remove all the dependencies from the blog system. The servers will remain shut down unless we hear of other critical services with dependencies that cannot be easily modified to use the new servers.

[01/31/08 8:15 AM] - We have been forced to restart the servers as there are dependencies on the old servers in the blog system that we could not find and fix in the short term. We are moving the shutdown to tomorrow morning and will try again then.

[01/31/08 6:15 AM] - The shutdown went as planned. The servers are currently shut down in such a way that they can be brought quickly back to service should we find that anyone has missed updating a critical service. In about a week's time, we will begin taking steps to permanently remove the servers.

On January 31st, ITS will be shutting down two of its older LDAP servers. This will not have any affect on most users. If you are connecting to an LDAP server without using SSL then you should be using ldap.case.edu as the server address. That will get you a list of all the current LDAP servers without any need to update anything now or in the future.



If you are using SSL (and you should be if you are accessing any sensitive data) then you must connect to each server by name. In that case you may be affected by the shutdown.



The two servers being shutdown are ldap-replica2.case.edu and ldap-replica3.case.edu. If you have any references to ldap-replica2 or ldap-replica3 you will need to replace them with the newer servers ldap-replica7 and ldap-replica8. Since these are all identical replicas all you need to do is change the number. No other changes should be required.


Created: 01/08/2008 12:54:26 by emr

Updates: 01/31/2008 06:12:06 by dak, 01/31/2008 08:20:22 by dak, 02/01/2008 06:06:59 by dak


Read more Scheduled Maintenance posts. Subscribe