We've been experiencing ~15 peers flapping their BGP sessions on a regular basis facing the APE route servers this afternoon.

After a fair amount of digging through logs etc we think we've mostly pieced together what's happened.

Sometime this morning these peers started flapping their sessions. We've not sure what the trigger event was but once they started flapping the route servers became very busy. Once in this busy state the route servers were struggling to respond to BGP keepalives in a timely fashion. As a result sessions where the BGP timers had been changed from the default ( 3 x 60 seconds ) started to notice missing keepalives and bounced their sessions. Once these sessions started bouncing the route servers become even busier and the problem became self-sustaining. The route servers have now settled down but it's unclear if that's due to our fiddling with them or a peer ceasing some unfriendly behaviour.

We're working on building some new route servers with more horse power.

In the meantime some interesting facts (taken from rs2):

84 peers defaulted to 60 second keepalives (this includes inactive sessions).
22 peers forced the timer down to 30 seconds.
2 peers forced the timer down to 20 seconds.
2 peers forced the timer down to 15 seconds.
2 peers forced the timer down to 10 seconds.
2 peers forced the timer down to 8 seconds.
2 peers forced the timer down to 5 seconds.
1 peer forced the timer down to 2 seconds.

There was a *very* strong correlation between low keepalive timers and how badly a peer was affected. Suffice to say the peer with a 2 second keepalive spent more time standing up their session than with a working session.

This seems like a good time to question the wisdom of using a low keepalive on the APE (or WIX etc). Given your BGP session is with the route servers, not other APE participants the absence of a route server doesn't necessarily imply the inability to reach other peers. On the other hand if your APE connection fails outright, reacting promptly is beneficial. Does anyone have a strong opinion on this? 

As part of trying to reduce the load on the route servers we've removed the majority of the inactive peers. If this affects you and you'd like your session reinstated please drop an email to peering@citylink.co.nz.

At this stage we're only aware of one significant change recently, Pipe Networks have joined the APE and started announcing routes in the last 24 hours. They're now our second largest contributor with 595 routes at last count :)    One working theory is that the additional routes pushed another peers router over it's limit (memory, hardware resource, etc) and their router started crashing/resetting.

Dylan