Host Down

by nwebb
/
Comment Closed

We have one host down in our offsite datacenter.  It looks like a connectivity problem with the datacenter rather than a host issue.  We are working to resolve the issue.

Update 6:48pm – We have confirmed that it is yet another problem with a Cisco switch.  Datacenter engineers are working with Cisco now to resolve the issue.  Network connectivity has been intermittent since 6:37pm.

Update 8:40pm – Switching has been relatively stable for the last hour or so, but the datacenter engineers are continuing to research and diagnose the issue with Cisco TAC.

Update – Cisco identified a software bug.  They have applied updates to the equipment and the network is stable now.

Post Incident Report
Date / Time of Incident: 6/1/2010 6:44PM ET
Duration of Incident: 1hour 13minutes
Scope of Incident: Newark, Delaware Data Center Dedicated, Colocated and Cloud Customers

Description of Incident:
At 6:44PM ET one of the two redundant Cisco 6509 switches supporting the Dedicated, Colocated and Cloud Customers failed.  The second switch failed in the same manner when load from the first switch was re-routed to the second.  The end result was loss of connectivity for approximately 40 minutes.  As the switches recovered, latency lasted another 33 minutes.

Actions During Incident:
6:44 PM  - Network Operations and Network Engineering were immediately dispatched to investigate and resolve the issue.  A ticket was opened with the vendor of the switches (Cisco) and efforts to restore service were coordinated between the datacenter and Cisco.  

6:55 PM - After exhausting normal troubleshooting avenues, the Switch 1 is reloaded at console by Network Engineers

6:58 PM – Switch 2 primary supervisor engine fails to boot, secondary supervisor engine becomes active

7:03 PM – Switch 1 completes rebooting , routing protocols established, datacenter team monitors performance of switch

7:08 PM – Switch 2 Primary supervisor engine is manually reset, secondary supervisor engine becomes fully active

7:12 PM – Switch 1 primary supervisor engine fails, secondary supervisor begins to take over

7:20 PM – Switch 1 secondary supervisor takes over

7:37 PM – Switch 1 Primary supervisor engine pulled, engineers suspect faulty hardware

7:48 PM – Switch 2 Secondary Supervisor engine completes booting, begins forwarding traffic.  CPU on both switches remain elevated for 10 minutes while routing protocols are re-established

7:58 PM – Forwarding returns to normal, data delivered to Cisco for further analysis

Root Cause of Incident:
Cisco has identified a specific bug as the likely root cause that created a memory leak which triggered the event.  Still awaiting final confirmation from Cisco on the bug, but initial findings indicate software issues as the root.

Further Actions:
Based on an expected confirmation from Cisco on the bug, a software patch will be applied in a lab environment.  After successful application of the patch, Network Engineers will be applying the patch during a scheduled and announced maintenance window.

The datacenter Network Engineering team is looking into other architectural changes to add additional diversity to the customers that are currently supported by the current set of Cisco 6509s to further insolate customers from any future type of single core switching failure.