We've been experiencing ~15 peers flapping their BGP sessions on a
regular basis facing the APE route servers this afternoon.
After a fair amount of digging through logs etc we think we've mostly
pieced together what's happened.
Sometime this morning these peers started flapping their sessions. We've
not sure what the trigger event was but once they started flapping the
route servers became very busy. Once in this busy state the route
servers were struggling to respond to BGP keepalives in a timely
fashion. As a result sessions where the BGP timers had been changed from
the default ( 3 x 60 seconds ) started to notice missing keepalives and
bounced their sessions. Once these sessions started bouncing the route
servers become even busier and the problem became self-sustaining. The
route servers have now settled down but it's unclear if that's due to
our fiddling with them or a peer ceasing some unfriendly behaviour.
We're working on building some new route servers with more horse power.
In the meantime some interesting facts (taken from rs2):
84 peers defaulted to 60 second keepalives (this includes inactive
sessions).
22 peers forced the timer down to 30 seconds.
2 peers forced the timer down to 20 seconds.
2 peers forced the timer down to 15 seconds.
2 peers forced the timer down to 10 seconds.
2 peers forced the timer down to 8 seconds.
2 peers forced the timer down to 5 seconds.
1 peer forced the timer down to 2 seconds.
There was a *very* strong correlation between low keepalive timers and
how badly a peer was affected. Suffice to say the peer with a 2 second
keepalive spent more time standing up their session than with a working
session.
This seems like a good time to question the wisdom of using a low
keepalive on the APE (or WIX etc). Given your BGP session is with the
route servers, not other APE participants the absence of a route server
doesn't necessarily imply the inability to reach other peers. On the
other hand if your APE connection fails outright, reacting promptly is
beneficial. Does anyone have a strong opinion on this?
As part of trying to reduce the load on the route servers we've removed
the majority of the inactive peers. If this affects you and you'd like
your session reinstated please drop an email to peering(a)citylink.co.nz.
At this stage we're only aware of one significant change recently, Pipe
Networks have joined the APE and started announcing routes in the last
24 hours. They're now our second largest contributor with 595 routes at
last count :) One working theory is that the additional routes pushed
another peers router over it's limit (memory, hardware resource, etc)
and their router started crashing/resetting.
Dylan