Hi all On Wed, Jun 11, 2008 at 10:50:08PM +1200, Erin Salmon said:
From what we've seen so far, it looks like APE exploded - packets coming out of everywhere gunging up the tubes.
All fixed now, I imagine we'll hear from Citylink shortly.
Or not so shortly - sorry for the delay getting back to everybody. Looks like somebody on APE managed to loop the APE fabric back on to us. We saw lots of this in the logs: Jun 11 22:31:14: %SW_MATM-4-MACFLAP_NOTIF: Host 00xx.xxxx.xxxx in vlan y is flapping between port Po2 and port Po1 where lots is, well lots: # sudo zgrep flapping cisco.log.4.gz | wc -l 15713 of those, the switch showing the most flapping activity logged 5924 messages, of which 5908 mention g0/40. We had a word in the ear of the ISP attached to g0/40, they did admit that they were making changes earlier that evening, and they do have a couple of connections to APE. This isn't actually a particularly uncommon occurrence - many of the ISP's that have multiple connections to APE have managed to achieve this over the years, so I don't particularly want to hang these particular guys out to dry, since they're by no means the only folks to make this mistake. What I do observe, though, is that it gets more annoying everytime it happens, and I'd like it to stop. Essentially, we have two ways of making this stop. We can rely on spanning-tree, and hope it spots the loop and blocks somewhere, or we can impose MAC filters so that even if there is a loop, only the ISP's approved devices can appear on that port. Relying on spanning tree clearly doesn't work - it gets filtered, there are incompatibilites between different implementations, and lots of people don't understand it. MAC filtering works extremely effectively, but has been a pain to administer - discovering the MAC filters in place is a constant surprise to exchange users, and it requires that exchange users have to interact with Citylink when they'd rather not do so (usually at 3am, when they're stressed, and we're sleepy). So up until now MAC filtering has been applied haphazardly, because it caused such an increase in workload for both exchange participants and Citylink. Happily, there is now a third way - since completing our 10GE upgrades a few weeks back, we've now mainly Cisco 3750's and 2960's in the core of the exchange, which means that we're now running sufficiently recent versions of IOS that we can support secure static MAC aging. Essentially, that means we can lock each exchange port to a fixed number of MAC's (normally one), and if that MAC is idle for longer than a few minutes, or if the physical link on that port drops, then that MAC gets timed out and a new one can take its place. No muss, no fuss - when you want to attach a new router, you unplug the old one, plug in the new one, if you can organise dropping the physical link you'll be working straight away, otherwise, it'll be five minutes before your new machine starts working. We've been testing aging MAC limits in Wellington, and have found them largely problem free(*), so we've started installing them on ports on APE - initially on ports newly going into service, but with last weeks performance, we're going to place them on every port within the next week. If you have any issues or concerns with this, please get in touch. Cheers Simon (*) The issue that I can see is that if an ISP has a device attached that is chatty (for eg, a layer 3 switch), then there's a chance that it will win the race to be the approved MAC for that port. There isn't much we can do about that, other than setting the MAC limit higher - the best approach would be to shut up the chatty switch, either by removing it, or configuring it quiet.