Something appears to have happened to both of the APE route servers shortly after 2pm today. We're investigating at the moment. Dylan
Yes – I can confirm they are down for me also. 192.203.154.1 4 9560 293572 226136 0 0 0 00:12:11 Active 192.203.154.2 4 9560 287522 226683 0 0 0 00:11:42 Active From: nznog-bounces(a)list.waikato.ac.nz [mailto:nznog-bounces(a)list.waikato.ac.nz] On Behalf Of Dylan Hall Sent: Tuesday, 16 March 2010 2:23 p.m. To: nznog(a)list.waikato.ac.nz Subject: [nznog] APE route servers down Something appears to have happened to both of the APE route servers shortly after 2pm today. We're investigating at the moment. Dylan Craig Spiers | Network Manager Solarix Networks Limited DDI: +64 9 974 4753 | Mob: +64 21 857 183 | Office: +64 9 974 4750 | FAX: +64 9 974 4760 Email: craig.spiers(a)solarix.co.nz | Web: www.solarix.net.nz CAUTION: This email is confidential. If it is not intended for you please do not read, distribute or copy it or any attachments. Please notify the sender by return email and delete the original message and any attachments.Any views expressed in this email may be those of the individual sender and may not necessarily reflect the views of Solarix Networks Limited. Please consider the environment before printing this email! Disclaimer added by CodeTwo Exchange Rules http://www.codetwo.com
One of the route servers (rs2) is functioning again, the other is still having issues[*]. It looks like one of our peers hit us with a full route table and both route servers ran out of memory and crashed. We're looking into getting some larger boxes to host the route servers although this won't happen today. We won't be processing any requests for routing updates until we have both route servers functioning again. Hopefully this won't take too long. Dylan [*] rs1 is unwell because we've tried to install a newer version of quagga which is now refusing to start. On Tue, 2010-03-16 at 14:22 +1300, Dylan Hall wrote:
Something appears to have happened to both of the APE route servers shortly after 2pm today.
We're investigating at the moment.
Dylan
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
Both route servers are functioning as usual again. We've rolled back the software upgrade on rs1. We've spoken to the peer with the full route table and they have rectified the issue at their end. If anyone requires an incident report for management please contact me off list :) Dylan On Tue, 2010-03-16 at 15:43 +1300, Dylan Hall wrote:
One of the route servers (rs2) is functioning again, the other is still having issues[*].
It looks like one of our peers hit us with a full route table and both route servers ran out of memory and crashed.
We're looking into getting some larger boxes to host the route servers although this won't happen today.
We won't be processing any requests for routing updates until we have both route servers functioning again. Hopefully this won't take too long.
Dylan
[*] rs1 is unwell because we've tried to install a newer version of quagga which is now refusing to start.
On Tue, 2010-03-16 at 14:22 +1300, Dylan Hall wrote:
Something appears to have happened to both of the APE route servers shortly after 2pm today.
We're investigating at the moment.
Dylan
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
We've now added a "maximum-prefix 4096" to every peer on the route servers. Hopefully this should prevent a re-occurrence before we are able to upgrade the route server hardware. Dylan On Tue, 2010-03-16 at 16:21 +1300, Dylan Hall wrote:
Both route servers are functioning as usual again.
We've rolled back the software upgrade on rs1.
We've spoken to the peer with the full route table and they have rectified the issue at their end.
If anyone requires an incident report for management please contact me off list :)
Dylan
On Tue, 2010-03-16 at 15:43 +1300, Dylan Hall wrote:
One of the route servers (rs2) is functioning again, the other is still having issues[*].
It looks like one of our peers hit us with a full route table and both route servers ran out of memory and crashed.
We're looking into getting some larger boxes to host the route servers although this won't happen today.
We won't be processing any requests for routing updates until we have both route servers functioning again. Hopefully this won't take too long.
Dylan
[*] rs1 is unwell because we've tried to install a newer version of quagga which is now refusing to start.
On Tue, 2010-03-16 at 14:22 +1300, Dylan Hall wrote:
Something appears to have happened to both of the APE route servers shortly after 2pm today.
We're investigating at the moment.
Dylan
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
I assume this problem exists because of soft inbound updates right? Does the max prefix thing hit before, or after that? Actually, are soft inbound updates necessary? - I was under the understanding that when the prefix lists were updated the config files were generated and the daemons restarted (not at the same time on both RSes). If the daemons are restarted then BGP sessions are resetting anyway - soft inbound has no impact. On 16/03/2010, at 4:53 PM, Dylan Hall wrote:
We've now added a "maximum-prefix 4096" to every peer on the route servers.
Hopefully this should prevent a re-occurrence before we are able to upgrade the route server hardware.
Dylan
On Tue, 2010-03-16 at 16:21 +1300, Dylan Hall wrote:
Both route servers are functioning as usual again.
We've rolled back the software upgrade on rs1.
We've spoken to the peer with the full route table and they have rectified the issue at their end.
If anyone requires an incident report for management please contact me off list :)
Dylan
On Tue, 2010-03-16 at 15:43 +1300, Dylan Hall wrote:
One of the route servers (rs2) is functioning again, the other is still having issues[*].
It looks like one of our peers hit us with a full route table and both route servers ran out of memory and crashed.
We're looking into getting some larger boxes to host the route servers although this won't happen today.
We won't be processing any requests for routing updates until we have both route servers functioning again. Hopefully this won't take too long.
Dylan
[*] rs1 is unwell because we've tried to install a newer version of quagga which is now refusing to start.
On Tue, 2010-03-16 at 14:22 +1300, Dylan Hall wrote:
Something appears to have happened to both of the APE route servers shortly after 2pm today.
We're investigating at the moment.
Dylan
_______________________________________________ NZNOG mailing list
NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
_______________________________________________ NZNOG mailing list
NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
_______________________________________________ NZNOG mailing list
NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
!DSPAM:22,4b9f00bd13881069756770! _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
!DSPAM:22,4b9f00bd13881069756770!
On Tue, 2010-03-16 at 17:06 +1300, Nathan Ward wrote:
I assume this problem exists because of soft inbound updates right? Does the max prefix thing hit before, or after that?
Actually, are soft inbound updates necessary? - I was under the understanding that when the prefix lists were updated the config files were generated and the daemons restarted (not at the same time on both RSes). If the daemons are restarted then BGP sessions are resetting anyway - soft inbound has no impact.
We assume the problem is due to the "soft-reconfiguration inbound". The filters on individual peers would prevent the majority of a full route table from being accepted into the actual route table. The Quagga documentation on the max-prefix stuff is rather sparse so short of digging through the source code it's not clear if it will protect the bgpd process from memory exhaustion. There are also multiple options which I think boil down too: 1. stop accepting routes when the limit it reached. 2. drop the bgp session when the limit is reached. We've opted for option 1. We've had a bit of a debate around the use of "soft-reconfiguration inbound" and tend to agree with your comments. It's not required for the operation of the route servers. It does however provide useful debug info. Dylan
On 16/03/2010, at 4:53 PM, Dylan Hall wrote:
We've now added a "maximum-prefix 4096" to every peer on the route servers.
Hopefully this should prevent a re-occurrence before we are able to upgrade the route server hardware.
Dylan
On Tue, 2010-03-16 at 16:21 +1300, Dylan Hall wrote:
Both route servers are functioning as usual again.
We've rolled back the software upgrade on rs1.
We've spoken to the peer with the full route table and they have rectified the issue at their end.
If anyone requires an incident report for management please contact me off list :)
Dylan
On Tue, 2010-03-16 at 15:43 +1300, Dylan Hall wrote:
One of the route servers (rs2) is functioning again, the other is still having issues[*].
It looks like one of our peers hit us with a full route table and both route servers ran out of memory and crashed.
We're looking into getting some larger boxes to host the route servers although this won't happen today.
We won't be processing any requests for routing updates until we have both route servers functioning again. Hopefully this won't take too long.
Dylan
[*] rs1 is unwell because we've tried to install a newer version of quagga which is now refusing to start.
On Tue, 2010-03-16 at 14:22 +1300, Dylan Hall wrote:
Something appears to have happened to both of the APE route servers shortly after 2pm today.
We're investigating at the moment.
Dylan
_______________________________________________ NZNOG mailing list
NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
_______________________________________________ NZNOG mailing list
NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
_______________________________________________ NZNOG mailing list
NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
!DSPAM:22,4b9f00bd13881069756770! _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
!DSPAM:22,4b9f00bd13881069756770!
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
On 16 Mar 2010, at 02:43, Dylan Hall wrote:
One of the route servers (rs2) is functioning again, the other is still having issues[*]. It looks like one of our peers hit us with a full route table and both route servers ran out of memory and crashed. We're looking into getting some larger boxes to host the route servers although this won't happen today. We won't be processing any requests for routing updates until we have both route servers functioning again. Hopefully this won't take too long.
Hi, Is it normal for networks in NZ to do bilateral sessions, in addition to the MLP route-servers ? Andy
On Wed, 2010-04-28 at 15:34 +0100, Andy Davidson wrote:
Hi,
Is it normal for networks in NZ to do bilateral sessions, in addition to the MLP route-servers ?
Andy
We're not privy to all the sessions on the APE/WIX but anecdotal evidence suggests they are quite common. Some service providers sell transit services over APE/WIX so use a "private" bgp session to announce a more complete route table. I also believe it's common for certain big telco's and research networks that don't peer with the route servers to peer on a case by case basis, usually where they have some kind of commercial arrangement in place. Dylan
participants (4)
-
Andy Davidson
-
Craig Spiers
-
Dylan Hall
-
Nathan Ward