Re: [nznog] Vector, did you try turning it off and then on again
Thats the best reply so far.
We have just proposed triple redundancy to a customer. Fibre wireless and satellite the ongoing costs aren't too bad it's the setup costs that are a killer. 10 mbp satellite gear isn't cheap. If you've got the money it's the way to go.
....
Bill Walker
Sent from my phone.
-----Original Message-----
From: "John Russell"
[Words]
A 9 hour outage seems long, but it's hardly absurd. Though we haven't seen a cause report from Vector yet, if it was spade fade, 9 hours is pretty good. I wasn't on call when it broke, but my colleague who was tells me he was told "Hardware Failure" was the problem. If something like a 6509 plane died, I can see how you might hit 9 hrs: The thing faults, alarms go off, customers call, the on-duty tech at Vector has to do some basic diagnosis, wave his hands in the air and run around like a monkey for a bit, call the on-call call engineer. Then that engineer has to wake up and/or finish his beer and get home from the pub, then do some more diagnostics, see the dead module and then do that thing where you put your hands flat on your forehead and drag them down your face while going "Gahhhhh". Then he gets to go pull a spare from stores, or call Cisco for a part and then THEY get to kick off THEIR internal process to get the thing to you. Then there's truck time to the site, waiting for an elevator to L48 of the sky tower that isn't already full of tourists in orange jumpsuits ready to jump off and/or carts full of crab canapes, swaping out the dead unit, sanity checking the restored services and making config changes if required, etc. And all that is just if everything does go to plan, and you don't find out that your spare hardware is in the lab, and the lab is locked and the guy who has the key has gone fishing, or that Cisco have already given their only spare WS-X6516 to someone else, and so on and so on. So, anyway, thing is, 9hrs, sure. However it really shouldn't matter that much. The ISP network I am currently fussing over, for example, has vector connectivity to the Sky Tower, which carries APE and some other stuff. This all broke when Vector went down. Domestic traffic, however, simply switched over to other peering links, as it should because it is, you know, The Internet. Some Vector-only stuff broke, of course, but core services just failed over and carried on. If your network connection is absolutely critical for your business, and it's wholly dependent on one vendor, you should perhaps rethink your approach. Talk to your ISP, explain that you need redundancy in your connection. Chuck in something like a DSL connection beside that Vector link. Ask your ISP about sourcing you a router than can connect to both, and setting up BGP or MPLS-based failover to your secondary link in the event that your Vector link fails. Now your connection is vendor independent (if not ISP-independent - if they fail, you fail. Try to pick an ISP that Does Not Fail Much). If you can't get DSL, ask your ISP about Wireless or IPStar or something similar. If you can fail from optical to satellite, that's pretty good diversity. And it's not complicated to do. You may not get the same performance, of course, but you'll have _a_ connection, which is better than _no_ connection, especially if it's only for a short time. This setup won't, of course, solve the problem of a local power cut killing your Vector link. I'm not even certain why we're discussing that. Talk to Vector and your electrician to get a cable run from your generator and/or UPS-backed distribution board to wherever your building vector switch is. Plug the switch into it. And you're done. It's a one-off cost, and likely not a large one. Even without a UPS, the worst that can happen is that the switch powers down when the cut takes place, then boots back up when your generator starts. A few minutes, tops. If you _don't have_ a generator, and your building power is out, I guess you'll be sitting in the dark, looking at your blank screen, and won't care if your internet connection is down. This is a perfect opportunity to go to the pub, and have a drink with the Vector engineer. Assuming he's awake. JSR -- John S Russell Big Geek. Doing Geek Stuff. _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog No virus found in this incoming message. Checked by AVG. Version: 8.0.101 / Virus Database: 270.4.3/1526 - Release Date: 30/06/2008 8:43 a.m.
(Hope I am not breaking any confidential things here, but I saw nothing that said it was on the paper work I saw flash past my desk) -- At 12:33PM on Sunday 29thJune 2008 a Cisco core Ethernet switch, located at our Hobson Street node, lost power and rebooted. Upon rebooting the switch lost it’s configuration and raised an alarm. This alarm was mistakingly interpreted as being a repeat of an earlier SDH alarm (initially received at 11.27 AM) and as a result no action was taken by our NOC until approximately 5:30 PM. Customer calls received by our customer call centre were also incorrectly assumed to be related to the above SDH alarms (based on advice from our NOC) and customers were incorrectly advised accordingly. At approximately 5:30PM our NOC became aware of the true nature of the outage. At 5:46 PM an engineer was dispatched to reload the configuration on the Cisco switch. This was completed and all services were restored by 9:26 PM. -- I do not like how Telco's think they know better than their customers... One would assume (and thanks to the on-call tech at Citylink I spoke to on Sunday, and also as indicated by him, many others did as well, who ACTUALLY checked out to see if the APE was alive, because non/customer(s) queried about the APE's status, as it was connectivity to the APE that was impaired) Vector was arrogant, one would assume when customers start to call the call centre and say, hey there is stuff broken, even though there was an alarm raised previous, SOMEONE would have gone... OMG we suddenly have a large increase in calls here with the same issue..... perhaps something is wrong..... but no... Telco knows best . On Wed, 2008-07-02 at 07:13 +1200, Bill Walker wrote:
Thats the best reply so far.
We have just proposed triple redundancy to a customer. Fibre wireless and satellite the ongoing costs aren't too bad it's the setup costs that are a killer. 10 mbp satellite gear isn't cheap. If you've got the money it's the way to go.
.... Bill Walker Sent from my phone.
-----Original Message----- From: "John Russell"
To: "NZNOG" Sent: 1/07/2008 11:41 p.m. Subject: Re: [nznog] Vector, did you try turning it off and then on again On [Various Times], [Many People Wrote] wrote:
[Words]
A 9 hour outage seems long, but it's hardly absurd. Though we haven't seen a cause report from Vector yet, if it was spade fade, 9 hours is pretty good.
I wasn't on call when it broke, but my colleague who was tells me he was told "Hardware Failure" was the problem. If something like a 6509 plane died, I can see how you might hit 9 hrs:
The thing faults, alarms go off, customers call, the on-duty tech at Vector has to do some basic diagnosis, wave his hands in the air and run around like a monkey for a bit, call the on-call call engineer. Then that engineer has to wake up and/or finish his beer and get home from the pub, then do some more diagnostics, see the dead module and then do that thing where you put your hands flat on your forehead and drag them down your face while going "Gahhhhh". Then he gets to go pull a spare from stores, or call Cisco for a part and then THEY get to kick off THEIR internal process to get the thing to you. Then there's truck time to the site, waiting for an elevator to L48 of the sky tower that isn't already full of tourists in orange jumpsuits ready to jump off and/or carts full of crab canapes, swaping out the dead unit, sanity checking the restored services and making config changes if required, etc. And all that is just if everything does go to plan, and you don't find out that your spare hardware is in the lab, and the lab is locked and the guy who has the key has gone fishing, or that Cisco have already given their only spare WS-X6516 to someone else, and so on and so on.
So, anyway, thing is, 9hrs, sure.
However it really shouldn't matter that much. The ISP network I am currently fussing over, for example, has vector connectivity to the Sky Tower, which carries APE and some other stuff. This all broke when Vector went down. Domestic traffic, however, simply switched over to other peering links, as it should because it is, you know, The Internet. Some Vector-only stuff broke, of course, but core services just failed over and carried on.
If your network connection is absolutely critical for your business, and it's wholly dependent on one vendor, you should perhaps rethink your approach. Talk to your ISP, explain that you need redundancy in your connection. Chuck in something like a DSL connection beside that Vector link. Ask your ISP about sourcing you a router than can connect to both, and setting up BGP or MPLS-based failover to your secondary link in the event that your Vector link fails. Now your connection is vendor independent (if not ISP-independent - if they fail, you fail. Try to pick an ISP that Does Not Fail Much). If you can't get DSL, ask your ISP about Wireless or IPStar or something similar. If you can fail from optical to satellite, that's pretty good diversity. And it's not complicated to do. You may not get the same performance, of course, but you'll have _a_ connection, which is better than _no_ connection, especially if it's only for a short time.
This setup won't, of course, solve the problem of a local power cut killing your Vector link. I'm not even certain why we're discussing that. Talk to Vector and your electrician to get a cable run from your generator and/or UPS-backed distribution board to wherever your building vector switch is. Plug the switch into it. And you're done. It's a one-off cost, and likely not a large one. Even without a UPS, the worst that can happen is that the switch powers down when the cut takes place, then boots back up when your generator starts. A few minutes, tops. If you _don't have_ a generator, and your building power is out, I guess you'll be sitting in the dark, looking at your blank screen, and won't care if your internet connection is down. This is a perfect opportunity to go to the pub, and have a drink with the Vector engineer. Assuming he's awake.
JSR -- John S Russell Big Geek. Doing Geek Stuff.
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
No virus found in this incoming message. Checked by AVG. Version: 8.0.101 / Virus Database: 270.4.3/1526 - Release Date: 30/06/2008 8:43 a.m. _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
*scratches his head* Core switch....lost power....don't people usually UPS things like that? If it is on UPS, then most likely someone bumped the cable and suddenly things stop working...an hour after the first fault? Someone's NOC team need to find the software to monitor when their equipment goes down *cough*nagios-for-example*cough* Philip Seccombe -----Original Message----- From: Chris Hodgetts [mailto:chris(a)archnetnz.com] Sent: Wed 7/2/2008 9:38 AM To: Bill Walker Cc: NZNOG Subject: Re: [nznog] Vector, did you try turning it off and then on again (Hope I am not breaking any confidential things here, but I saw nothing that said it was on the paper work I saw flash past my desk) -- At 12:33PM on Sunday 29thJune 2008 a Cisco core Ethernet switch, located at our Hobson Street node, lost power and rebooted. Upon rebooting the switch lost it's configuration and raised an alarm. This alarm was mistakingly interpreted as being a repeat of an earlier SDH alarm (initially received at 11.27 AM) and as a result no action was taken by our NOC until approximately 5:30 PM. Customer calls received by our customer call centre were also incorrectly assumed to be related to the above SDH alarms (based on advice from our NOC) and customers were incorrectly advised accordingly. At approximately 5:30PM our NOC became aware of the true nature of the outage. At 5:46 PM an engineer was dispatched to reload the configuration on the Cisco switch. This was completed and all services were restored by 9:26 PM. -- I do not like how Telco's think they know better than their customers... One would assume (and thanks to the on-call tech at Citylink I spoke to on Sunday, and also as indicated by him, many others did as well, who ACTUALLY checked out to see if the APE was alive, because non/customer(s) queried about the APE's status, as it was connectivity to the APE that was impaired) Vector was arrogant, one would assume when customers start to call the call centre and say, hey there is stuff broken, even though there was an alarm raised previous, SOMEONE would have gone... OMG we suddenly have a large increase in calls here with the same issue..... perhaps something is wrong..... but no... Telco knows best . On Wed, 2008-07-02 at 07:13 +1200, Bill Walker wrote:
Thats the best reply so far.
We have just proposed triple redundancy to a customer. Fibre wireless and satellite the ongoing costs aren't too bad it's the setup costs that are a killer. 10 mbp satellite gear isn't cheap. If you've got the money it's the way to go.
.... Bill Walker Sent from my phone.
-----Original Message----- From: "John Russell"
To: "NZNOG" Sent: 1/07/2008 11:41 p.m. Subject: Re: [nznog] Vector, did you try turning it off and then on again On [Various Times], [Many People Wrote] wrote:
[Words]
A 9 hour outage seems long, but it's hardly absurd. Though we haven't seen a cause report from Vector yet, if it was spade fade, 9 hours is pretty good.
I wasn't on call when it broke, but my colleague who was tells me he was told "Hardware Failure" was the problem. If something like a 6509 plane died, I can see how you might hit 9 hrs:
The thing faults, alarms go off, customers call, the on-duty tech at Vector has to do some basic diagnosis, wave his hands in the air and run around like a monkey for a bit, call the on-call call engineer. Then that engineer has to wake up and/or finish his beer and get home from the pub, then do some more diagnostics, see the dead module and then do that thing where you put your hands flat on your forehead and drag them down your face while going "Gahhhhh". Then he gets to go pull a spare from stores, or call Cisco for a part and then THEY get to kick off THEIR internal process to get the thing to you. Then there's truck time to the site, waiting for an elevator to L48 of the sky tower that isn't already full of tourists in orange jumpsuits ready to jump off and/or carts full of crab canapes, swaping out the dead unit, sanity checking the restored services and making config changes if required, etc. And all that is just if everything does go to plan, and you don't find out that your spare hardware is in the lab, and the lab is locked and the guy who has the key has gone fishing, or that Cisco have already given their only spare WS-X6516 to someone else, and so on and so on.
So, anyway, thing is, 9hrs, sure.
However it really shouldn't matter that much. The ISP network I am currently fussing over, for example, has vector connectivity to the Sky Tower, which carries APE and some other stuff. This all broke when Vector went down. Domestic traffic, however, simply switched over to other peering links, as it should because it is, you know, The Internet. Some Vector-only stuff broke, of course, but core services just failed over and carried on.
If your network connection is absolutely critical for your business, and it's wholly dependent on one vendor, you should perhaps rethink your approach. Talk to your ISP, explain that you need redundancy in your connection. Chuck in something like a DSL connection beside that Vector link. Ask your ISP about sourcing you a router than can connect to both, and setting up BGP or MPLS-based failover to your secondary link in the event that your Vector link fails. Now your connection is vendor independent (if not ISP-independent - if they fail, you fail. Try to pick an ISP that Does Not Fail Much). If you can't get DSL, ask your ISP about Wireless or IPStar or something similar. If you can fail from optical to satellite, that's pretty good diversity. And it's not complicated to do. You may not get the same performance, of course, but you'll have _a_ connection, which is better than _no_ connection, especially if it's only for a short time.
This setup won't, of course, solve the problem of a local power cut killing your Vector link. I'm not even certain why we're discussing that. Talk to Vector and your electrician to get a cable run from your generator and/or UPS-backed distribution board to wherever your building vector switch is. Plug the switch into it. And you're done. It's a one-off cost, and likely not a large one. Even without a UPS, the worst that can happen is that the switch powers down when the cut takes place, then boots back up when your generator starts. A few minutes, tops. If you _don't have_ a generator, and your building power is out, I guess you'll be sitting in the dark, looking at your blank screen, and won't care if your internet connection is down. This is a perfect opportunity to go to the pub, and have a drink with the Vector engineer. Assuming he's awake.
JSR -- John S Russell Big Geek. Doing Geek Stuff.
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
No virus found in this incoming message. Checked by AVG. Version: 8.0.101 / Virus Database: 270.4.3/1526 - Release Date: 30/06/2008 8:43 a.m. _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog -- This message was scanned by Turnstone Spam Filter and is believed to be clean. Click here to report this message as spam. http://spamfilter.turnstone.co.nz/cgi-bin/learn-msg.cgi?id=2FBEA27F14.CC895
Monitoring isn't going to help much if they ignore/misinterpret the alarm, even after customers start complaining... On Wed, 2008-07-02 at 21:51 +1200, Philip Seccombe wrote:
*scratches his head* Core switch....lost power....don't people usually UPS things like that? If it is on UPS, then most likely someone bumped the cable and suddenly things stop working...an hour after the first fault?
Someone's NOC team need to find the software to monitor when their equipment goes down *cough*nagios-for-example*cough*
Philip Seccombe
-----Original Message----- From: Chris Hodgetts [mailto:chris(a)archnetnz.com] Sent: Wed 7/2/2008 9:38 AM To: Bill Walker Cc: NZNOG Subject: Re: [nznog] Vector, did you try turning it off and then on again
(Hope I am not breaking any confidential things here, but I saw nothing that said it was on the paper work I saw flash past my desk)
--
At 12:33PM on Sunday 29thJune 2008 a Cisco core Ethernet switch, located at our Hobson Street node, lost power and rebooted. Upon rebooting the switch lost it's configuration and raised an alarm.
This alarm was mistakingly interpreted as being a repeat of an earlier SDH alarm (initially received at 11.27 AM) and as a result no action was taken by our NOC until approximately 5:30 PM.
Customer calls received by our customer call centre were also incorrectly assumed to be related to the above SDH alarms (based on advice from our NOC) and customers were incorrectly advised accordingly.
At approximately 5:30PM our NOC became aware of the true nature of the outage. At 5:46 PM an engineer was dispatched to reload the configuration on the Cisco switch. This was completed and all services were restored by 9:26 PM.
--
I do not like how Telco's think they know better than their customers...
One would assume (and thanks to the on-call tech at Citylink I spoke to on Sunday, and also as indicated by him, many others did as well, who ACTUALLY checked out to see if the APE was alive, because non/customer(s) queried about the APE's status, as it was connectivity to the APE that was impaired)
Vector was arrogant, one would assume when customers start to call the call centre and say, hey there is stuff broken, even though there was an alarm raised previous, SOMEONE would have gone... OMG we suddenly have a large increase in calls here with the same issue..... perhaps something is wrong..... but no... Telco knows best .
Thats the best reply so far.
We have just proposed triple redundancy to a customer. Fibre wireless and satellite the ongoing costs aren't too bad it's the setup costs that are a killer. 10 mbp satellite gear isn't cheap. If you've got the money it's the way to go.
.... Bill Walker Sent from my phone.
-----Original Message----- From: "John Russell"
To: "NZNOG" Sent: 1/07/2008 11:41 p.m. Subject: Re: [nznog] Vector, did you try turning it off and then on again On [Various Times], [Many People Wrote] wrote:
[Words]
A 9 hour outage seems long, but it's hardly absurd. Though we haven't seen a cause report from Vector yet, if it was spade fade, 9 hours is pretty good.
I wasn't on call when it broke, but my colleague who was tells me he was told "Hardware Failure" was the problem. If something like a 6509 plane died, I can see how you might hit 9 hrs:
The thing faults, alarms go off, customers call, the on-duty tech at Vector has to do some basic diagnosis, wave his hands in the air and run around like a monkey for a bit, call the on-call call engineer. Then that engineer has to wake up and/or finish his beer and get home from the pub, then do some more diagnostics, see the dead module and then do that thing where you put your hands flat on your forehead and drag them down your face while going "Gahhhhh". Then he gets to go pull a spare from stores, or call Cisco for a part and then THEY get to kick off THEIR internal process to get the thing to you. Then there's truck time to the site, waiting for an elevator to L48 of
sky tower that isn't already full of tourists in orange jumpsuits ready to jump off and/or carts full of crab canapes, swaping out the dead unit, sanity checking the restored services and making config changes if required, etc. And all that is just if everything does go to plan, and you don't find out that your spare hardware is in the lab, and the lab is locked and the guy who has the key has gone fishing, or that Cisco have already given their only spare WS-X6516 to someone else, and so on and so on.
So, anyway, thing is, 9hrs, sure.
However it really shouldn't matter that much. The ISP network I am currently fussing over, for example, has vector connectivity to the Sky Tower, which carries APE and some other stuff. This all broke when Vector went down. Domestic traffic, however, simply switched over to other peering links, as it should because it is, you know, The Internet. Some Vector-only stuff broke, of course, but core services just failed over and carried on.
If your network connection is absolutely critical for your business, and it's wholly dependent on one vendor, you should perhaps rethink your approach. Talk to your ISP, explain that you need redundancy in your connection. Chuck in something like a DSL connection beside
On Wed, 2008-07-02 at 07:13 +1200, Bill Walker wrote: the that
Vector link. Ask your ISP about sourcing you a router than can connect to both, and setting up BGP or MPLS-based failover to your secondary link in the event that your Vector link fails. Now your connection is vendor independent (if not ISP-independent - if they fail, you fail. Try to pick an ISP that Does Not Fail Much). If you can't get DSL, ask your ISP about Wireless or IPStar or something similar. If you can fail from optical to satellite, that's pretty good diversity. And it's not complicated to do. You may not get the same performance, of course, but you'll have _a_ connection, which is better than _no_ connection, especially if it's only for a short time.
This setup won't, of course, solve the problem of a local power cut killing your Vector link. I'm not even certain why we're discussing that. Talk to Vector and your electrician to get a cable run from your generator and/or UPS-backed distribution board to wherever your building vector switch is. Plug the switch into it. And you're done. It's a one-off cost, and likely not a large one. Even without a UPS, the worst that can happen is that the switch powers down when the cut takes place, then boots back up when your generator starts. A few minutes, tops. If you _don't have_ a generator, and your building power is out, I guess you'll be sitting in the dark, looking at your blank screen, and won't care if your internet connection is down. This is a perfect opportunity to go to the pub, and have a drink with the Vector engineer. Assuming he's awake.
JSR -- John S Russell Big Geek. Doing Geek Stuff.
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
No virus found in this incoming message. Checked by AVG. Version: 8.0.101 / Virus Database: 270.4.3/1526 - Release Date: 30/06/2008 8:43 a.m. _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
-- This message was scanned by Turnstone Spam Filter and is believed to be clean. Click here to report this message as spam. http://spamfilter.turnstone.co.nz/cgi-bin/learn-msg.cgi?id=2FBEA27F14.CC895
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
On 2/07/2008, at 9:53 PM, Jasper Bryant-Greene wrote:
Monitoring isn't going to help much if they ignore/misinterpret the alarm, even after customers start complaining...
Well, it does if correctly configured. A well configured monitoring system has some intelligence of dependencies between network components, to limit the amount of noise generated when a big outage occurs so that situations exactly like this one do not happen. I build these sorts of things for my customers a bit, and while they can sometimes be complicated to set up initially, they get over it pretty quickly when at 4AM they get one or two messages about a radio or fibre or something being down and know they need to get in their car, instead of 50 messages about all manner of rubbish and have to have a cold shower to wake themselves up and then figure try out what's wrong once their head is clear. There is, of course, no substitute for well trained network monitoring/ management centre staff. From reading that bit of info squinting a bit with my head on it's side, this looks to me like a classic case of someone relying solely on alarms (ie SNMP traps) instead of periodic testing. That sort of stuff is old telco think. -- Nathan Ward
Bear in mind, though that Vector does provide our Electricity as well, and perhaps they were trying to save power in this time of energy conservation. Perhaps my previous comments about Teleco's should have really been Utilities... apologies. On Wed, 2008-07-02 at 21:51 +1200, Philip Seccombe wrote:
*scratches his head* Core switch....lost power....don't people usually UPS things like that? If it is on UPS, then most likely someone bumped the cable and suddenly things stop working...an hour after the first fault?
Someone's NOC team need to find the software to monitor when their equipment goes down *cough*nagios-for-example*cough*
Philip Seccombe
-----Original Message----- From: Chris Hodgetts [mailto:chris(a)archnetnz.com] Sent: Wed 7/2/2008 9:38 AM To: Bill Walker Cc: NZNOG Subject: Re: [nznog] Vector, did you try turning it off and then on again
(Hope I am not breaking any confidential things here, but I saw nothing that said it was on the paper work I saw flash past my desk)
--
At 12:33PM on Sunday 29thJune 2008 a Cisco core Ethernet switch, located at our Hobson Street node, lost power and rebooted. Upon rebooting the switch lost it's configuration and raised an alarm.
This alarm was mistakingly interpreted as being a repeat of an earlier SDH alarm (initially received at 11.27 AM) and as a result no action was taken by our NOC until approximately 5:30 PM.
Customer calls received by our customer call centre were also incorrectly assumed to be related to the above SDH alarms (based on advice from our NOC) and customers were incorrectly advised accordingly.
At approximately 5:30PM our NOC became aware of the true nature of the outage. At 5:46 PM an engineer was dispatched to reload the configuration on the Cisco switch. This was completed and all services were restored by 9:26 PM.
--
I do not like how Telco's think they know better than their customers...
One would assume (and thanks to the on-call tech at Citylink I spoke to on Sunday, and also as indicated by him, many others did as well, who ACTUALLY checked out to see if the APE was alive, because non/customer(s) queried about the APE's status, as it was connectivity to the APE that was impaired)
Vector was arrogant, one would assume when customers start to call the call centre and say, hey there is stuff broken, even though there was an alarm raised previous, SOMEONE would have gone... OMG we suddenly have a large increase in calls here with the same issue..... perhaps something is wrong..... but no... Telco knows best .
Thats the best reply so far.
We have just proposed triple redundancy to a customer. Fibre wireless and satellite the ongoing costs aren't too bad it's the setup costs that are a killer. 10 mbp satellite gear isn't cheap. If you've got the money it's the way to go.
.... Bill Walker Sent from my phone.
-----Original Message----- From: "John Russell"
To: "NZNOG" Sent: 1/07/2008 11:41 p.m. Subject: Re: [nznog] Vector, did you try turning it off and then on again On [Various Times], [Many People Wrote] wrote:
[Words]
A 9 hour outage seems long, but it's hardly absurd. Though we haven't seen a cause report from Vector yet, if it was spade fade, 9 hours is pretty good.
I wasn't on call when it broke, but my colleague who was tells me he was told "Hardware Failure" was the problem. If something like a 6509 plane died, I can see how you might hit 9 hrs:
The thing faults, alarms go off, customers call, the on-duty tech at Vector has to do some basic diagnosis, wave his hands in the air and run around like a monkey for a bit, call the on-call call engineer. Then that engineer has to wake up and/or finish his beer and get home from the pub, then do some more diagnostics, see the dead module and then do that thing where you put your hands flat on your forehead and drag them down your face while going "Gahhhhh". Then he gets to go pull a spare from stores, or call Cisco for a part and then THEY get to kick off THEIR internal process to get the thing to you. Then there's truck time to the site, waiting for an elevator to L48 of
sky tower that isn't already full of tourists in orange jumpsuits ready to jump off and/or carts full of crab canapes, swaping out the dead unit, sanity checking the restored services and making config changes if required, etc. And all that is just if everything does go to plan, and you don't find out that your spare hardware is in the lab, and the lab is locked and the guy who has the key has gone fishing, or that Cisco have already given their only spare WS-X6516 to someone else, and so on and so on.
So, anyway, thing is, 9hrs, sure.
However it really shouldn't matter that much. The ISP network I am currently fussing over, for example, has vector connectivity to the Sky Tower, which carries APE and some other stuff. This all broke when Vector went down. Domestic traffic, however, simply switched over to other peering links, as it should because it is, you know, The Internet. Some Vector-only stuff broke, of course, but core services just failed over and carried on.
If your network connection is absolutely critical for your business, and it's wholly dependent on one vendor, you should perhaps rethink your approach. Talk to your ISP, explain that you need redundancy in your connection. Chuck in something like a DSL connection beside
On Wed, 2008-07-02 at 07:13 +1200, Bill Walker wrote: the that
Vector link. Ask your ISP about sourcing you a router than can connect to both, and setting up BGP or MPLS-based failover to your secondary link in the event that your Vector link fails. Now your connection is vendor independent (if not ISP-independent - if they fail, you fail. Try to pick an ISP that Does Not Fail Much). If you can't get DSL, ask your ISP about Wireless or IPStar or something similar. If you can fail from optical to satellite, that's pretty good diversity. And it's not complicated to do. You may not get the same performance, of course, but you'll have _a_ connection, which is better than _no_ connection, especially if it's only for a short time.
This setup won't, of course, solve the problem of a local power cut killing your Vector link. I'm not even certain why we're discussing that. Talk to Vector and your electrician to get a cable run from your generator and/or UPS-backed distribution board to wherever your building vector switch is. Plug the switch into it. And you're done. It's a one-off cost, and likely not a large one. Even without a UPS, the worst that can happen is that the switch powers down when the cut takes place, then boots back up when your generator starts. A few minutes, tops. If you _don't have_ a generator, and your building power is out, I guess you'll be sitting in the dark, looking at your blank screen, and won't care if your internet connection is down. This is a perfect opportunity to go to the pub, and have a drink with the Vector engineer. Assuming he's awake.
JSR -- John S Russell Big Geek. Doing Geek Stuff.
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
No virus found in this incoming message. Checked by AVG. Version: 8.0.101 / Virus Database: 270.4.3/1526 - Release Date: 30/06/2008 8:43 a.m. _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
-- This message was scanned by Turnstone Spam Filter and is believed to be clean. Click here to report this message as spam. http://spamfilter.turnstone.co.nz/cgi-bin/learn-msg.cgi?id=2FBEA27F14.CC895
participants (5)
-
Bill Walker
-
Chris Hodgetts
-
Jasper Bryant-Greene
-
Nathan Ward
-
Philip Seccombe