As promised, a more detailed write up about the Citylink outage in Auckland earlier today.
Jumping straight to the chase, we deleted the native vlan from our VTP server (and hence most of our switches) in Auckland. We believe this caused some of our switches to stop talking spanning tree on those trunks and resulted in a layer 2 loop/broadcast storm. Service was restored when an engineer re-created the vlan on the VTP server.
We have been making a number of changes over the last 12 months with the goal of improving stability on the network[1]. Most of these changes have revolved around implementing best practise and ensuring consistency across all our switches.
The particular changes that resulted in the outage where around the interface configuration on Citylink switch to switch trunks.
Previously a trunk configuration would look something like:
interface GigabitEthernet0/1
switchport trunk native vlan 999
switchport trunk allowed vlan 1-4095
Under normal circumstances no traffic should be using the native vlan. To enforce this situation we removed the native vlan from the allowed list, e.g.
interface GigabitEthernet0/1
switchport trunk native vlan 999
switchport trunk allowed vlan 1-998,1000-4095
This change was completed without any issues.
The second part of the tidy up was to remove the native vlan completely from the network. We lab tested this scenario and there was no impact when removing the native vlan.
What we suspect happened is one or more switches had their spanning tree process bound to the native vlan. When the native vlan was removed the switches stopped talking spanning tree resulting in a loop.
Service was restored by an engineer going to site and using the console to re-create the native vlan on the VTP server.
Also, a big thanks to John from FX who assisted us restoring service with a very timely visit to the Sky Tower :)
Again, sorry for any inconvenience caused.
Dylan Hall
Network Engineer
CityLink Ltd
[1] Sadly the irony here hasn't gone unnoticed.