On Tue, 2010-03-16 at 17:06 +1300, Nathan Ward wrote:
I assume this problem exists because of soft inbound updates right? Does the max prefix thing hit before, or after that?

Actually, are soft inbound updates necessary? - I was under the understanding that when the prefix lists were updated the config files were generated and the daemons restarted (not at the same time on both RSes). If the daemons are restarted then BGP sessions are resetting anyway - soft inbound has no impact.


We assume the problem is due to the "soft-reconfiguration inbound". The filters on individual peers would prevent the majority of a full route table from being accepted into the actual route table.

The Quagga documentation on the max-prefix stuff is rather sparse so short of digging through the source code it's not clear if it will protect the bgpd process from memory exhaustion. There are also multiple options which I think boil down too:

1. stop accepting routes when the limit it reached.
2. drop the bgp session when the limit is reached.

We've opted for option 1.

We've had a bit of a debate around the use of "soft-reconfiguration inbound" and tend to agree with your comments. It's not required for the operation of the route servers. It does however provide useful debug info.


Dylan



On 16/03/2010, at 4:53 PM, Dylan Hall wrote:

> We've now added a "maximum-prefix 4096" to every peer on the route servers.
> 
> Hopefully this should prevent a re-occurrence before we are able to upgrade the route server hardware. 
> 
> Dylan
> 
> 
> On Tue, 2010-03-16 at 16:21 +1300, Dylan Hall wrote:
>> Both route servers are functioning as usual again. 
>> 
>> We've rolled back the software upgrade on rs1. 
>> 
>> We've spoken to the peer with the full route table and they have rectified the issue at their end.
>> 
>> If anyone requires an incident report for management please contact me off list :)
>> 
>> 
>> Dylan
>> 
>> 
>> On Tue, 2010-03-16 at 15:43 +1300, Dylan Hall wrote:
>>> One of the route servers (rs2) is functioning again, the other is still having issues[*].
>>> 
>>> It looks like one of our peers hit us with a full route table and both route servers ran out of memory and crashed.
>>> 
>>> We're looking into getting some larger boxes to host the route servers although this won't happen today.
>>> 
>>> We won't be processing any requests for routing updates until we have both route servers functioning again. Hopefully this won't take too long.
>>> 
>>> 
>>> Dylan
>>> 
>>> [*] rs1 is unwell because we've tried to install a newer version of quagga which is now refusing to start. 
>>> 
>>> 
>>> 
>>> On Tue, 2010-03-16 at 14:22 +1300, Dylan Hall wrote:
>>>> Something appears to have happened to both of the APE route servers shortly after 2pm today. 
>>>> 
>>>> We're investigating at the moment.
>>>> 
>>>> Dylan
>>>> 
>>>> _______________________________________________
>>>> NZNOG mailing list
>>>> 
>>>> NZNOG@list.waikato.ac.nz
>>>> http://list.waikato.ac.nz/mailman/listinfo/nznog
>>> 
>>> _______________________________________________
>>> NZNOG mailing list
>>> 
>>> NZNOG@list.waikato.ac.nz
>>> http://list.waikato.ac.nz/mailman/listinfo/nznog
>> 
>> _______________________________________________
>> NZNOG mailing list
>> 
>> NZNOG@list.waikato.ac.nz
>> http://list.waikato.ac.nz/mailman/listinfo/nznog
> 
> !DSPAM:22,4b9f00bd13881069756770!
> _______________________________________________
> NZNOG mailing list
> NZNOG@list.waikato.ac.nz
> http://list.waikato.ac.nz/mailman/listinfo/nznog
> 
> 
> !DSPAM:22,4b9f00bd13881069756770!

_______________________________________________
NZNOG mailing list
NZNOG@list.waikato.ac.nz
http://list.waikato.ac.nz/mailman/listinfo/nznog