As promised, a more detailed write up about the Citylink outage in
Auckland earlier today.
Jumping straight to the chase, we deleted the native vlan from our VTP
server (and hence most of our switches) in Auckland. We believe this
caused some of our switches to stop talking spanning tree on those
trunks and resulted in a layer 2 loop/broadcast storm. Service was
restored when an engineer re-created the vlan on the VTP server.
We have been making a number of changes over the last 12 months with the
goal of improving stability on the network[1]. Most of these changes
have revolved around implementing best practise and ensuring consistency
across all our switches.
The particular changes that resulted in the outage where around the
interface configuration on Citylink switch to switch trunks.
Previously a trunk configuration would look something like:
interface GigabitEthernet0/1
switchport trunk native vlan 999
switchport trunk allowed vlan 1-4095
Under normal circumstances no traffic should be using the native vlan.
To enforce this situation we removed the native vlan from the allowed
list, e.g.
interface GigabitEthernet0/1
switchport trunk native vlan 999
switchport trunk allowed vlan 1-998,1000-4095
This change was completed without any issues.
The second part of the tidy up was to remove the native vlan completely
from the network. We lab tested this scenario and there was no impact
when removing the native vlan.
What we suspect happened is one or more switches had their spanning tree
process bound to the native vlan. When the native vlan was removed the
switches stopped talking spanning tree resulting in a loop.
Service was restored by an engineer going to site and using the console
to re-create the native vlan on the VTP server.
Also, a big thanks to John from FX who assisted us restoring service
with a very timely visit to the Sky Tower :)
Again, sorry for any inconvenience caused.
Dylan Hall
Network Engineer
CityLink Ltd
[1] Sadly the irony here hasn't gone unnoticed.