An explanation of what went wrong at Mayoral Drive on 4/11
ComputerWorld article. http://s0.tx.co.nz/at/tep34n279457j138688i194855f2c4680224a4t9s4z During testing a UPS failed to start a generator, then its batteries went flat and the systems it was feeding had to be restarted. Fire and Police were hit hard because they'd only very recently consolidated to a single database from three region-based ones, and hadn't had a DR drill (one was planned) so it was the first time they'd actually done a cut-over. That sod Murphy strikes again! -- Matthew Poole "Don't use force. Get a bigger hammer."
On Wed, 2007-11-21 at 09:38 +1300, Matthew Poole wrote:
ComputerWorld article. http://s0.tx.co.nz/at/tep34n279457j138688i194855f2c4680224a4t9s4z
"However, the switch failed to connect to the generator and the systems ran down the batteries before the failure was noticed." Wow. That is kind of embarrasing. I would have thought that during routine tests you make sure that what you're testing actually works. It must be time to revise that particular test procedure... Cheers! -- Andrew Ruthven, Wellington, New Zealand At home: andrew(a)etc.gen.nz | This space intentionally | left blank.
On 21/11/2007, at 9:54 AM, Andrew Ruthven wrote:
On Wed, 2007-11-21 at 09:38 +1300, Matthew Poole wrote:
ComputerWorld article. http://s0.tx.co.nz/at/tep34n279457j138688i194855f2c4680224a4t9s4z
"However, the switch failed to connect to the generator and the systems ran down the batteries before the failure was noticed."
Wow. That is kind of embarrasing. I would have thought that during routine tests you make sure that what you're testing actually works. It must be time to revise that particular test procedure...
Don't worry, it's not a people or procedure problem: "When Computerworld spoke to Telecom last week, the failure had been traced to a component in the UPS switch, which had been replaced. Investigations continue into what made the component fail and how to avoid a recurrence of the problem, said the spokeswoman." Clearly the reason power went out is a component failure. -- Nathan Ward
On Wed, 21 Nov 2007, Nathan Ward wrote:
Don't worry, it's not a people or procedure problem: *SNIP* Clearly the reason power went out is a component failure.
I can't tell if you're being sarcastic or not. Maybe I'm just tired and my sarcasm-ometer isn't pegging properly? -- Matthew Poole "Don't use force. Get a bigger hammer."
On 21/11/2007, at 10:09 AM, Matthew Poole wrote:
On Wed, 21 Nov 2007, Nathan Ward wrote:
Don't worry, it's not a people or procedure problem: *SNIP* Clearly the reason power went out is a component failure.
I can't tell if you're being sarcastic or not. Maybe I'm just tired and my sarcasm-ometer isn't pegging properly?
A friend of mine and I decided a while ago that we'd use \s to signal sarcasm/lack of seriousness in text based communication - almost an opposite of the seriousness/emphasis that /s bring. Let's assume my comments were full of \s :-) I'm also laughing (with /s) about the fancy new database hadn't had it's DR procedures tested yet. Seems like something that'd be a good idea to test before going live. I'm constantly surprised at the lack of understanding/implementation of reliable architectures and procedures outside the Internet industry. Perhaps we should all get in to ITS for 5 years or so to sort out all this cruft. -- Nathan Ward
On Wed, 21 Nov 2007, Nathan Ward wrote:
I'm also laughing (with /s) about the fancy new database hadn't had
Not really "new", just centralised. Still i/CAD, still all the same information, but now based in AKL rather than instances in AKL, WLG and CHC. But, yes, consolidation like that is a pretty significant change.
it's DR procedures tested yet. Seems like something that'd be a good idea to test before going live. I'm constantly surprised at the lack
Yes, it would. However, to give them their due, they deliberately operate in a "system degraded" mode on a monthly basis to ensure that their DR procedues work properly. So while it was a bit silly to have not tested the DR before making the consolidated system live it wasn't a show-stopper because they knew that in the event of a failure they could continue to operate - that may have been part of the reason for not testing before going live, silly as it was in hind-sight.
of understanding/implementation of reliable architectures and procedures outside the Internet industry. Perhaps we should all get in to ITS for 5 years or so to sort out all this cruft.
DR/BC pays big money if you're halfway decent. Unfortunately most companies don't actually want to fork out for "halfway decent" and instead prefer "lowest bidder minus one" (aka Bob from IT) -- Matthew Poole "Don't use force. Get a bigger hammer."
However, the switch failed to connect to the generator and the systems ran down the batteries before the failure was noticed.
Is it just me... or does this sentence leave one to believe that mayhaps the switch in question was not actually plugged into the generator circuit, only the ups.? Just because I have 'erm seen similar things in my time. -JoelW
On 11/21/07, Joel Wiramu Pauling
However, the switch failed to connect to the generator and the systems ran down the batteries before the failure was noticed.
Is it just me... or does this sentence leave one to believe that mayhaps the switch in question was not actually plugged into the generator circuit, only the ups.?
Generally there will be a switch which controls whether the genset is feeding power to the UPS or not. If you're just testing that the generator is working you'd probably have that switch turned off, whilst in normal operation you'd have it turned on. Some time ago I was involved in a blackout in a building which we had just moved into where this switch was in the wrong position, and as a result we saw a similar care to what happened here (UPS working fine, generator working fine, but one not feeding the other) - the difference being that we managed to get to the switch in time and enable it (Kudos to the guy who ran up about 19 flights of stairs to do it!) Regardless of whether the switch failed, was in the wrong position, or anything else I can't see how this is anyones fault other than Telecoms - you don't run a UPS/Generator test and not actually monitor that the UPS/Generator are functioning correctly. They had around 30 minutes to detect the problem (based on the quoted life of the batteries) which should have been plenty to either fix the problem (if it was just a switch in the wrong state) or to backout the test. Scott.
They weren't monitoring the UPS? Why didn't they get a trap or alert the UPS
didn't have a incoming feed? mains or Generator?
Most modern systems, including large (by NZ standards) have snmp traps etc
to warn of impending doom
_____
From: Scott Howard [mailto:scott(a)doc.net.au]
Sent: Wednesday, 21 November 2007 12:47
To: Joel Wiramu Pauling
Cc: NZNOG
Subject: Re: [nznog] An explanation of what went wrong at Mayoral Drive on
4/11
On 11/21/07, Joel Wiramu Pauling
However, the switch failed to connect to the generator and the systems ran down the batteries before the failure was noticed.
Is it just me... or does this sentence leave one to believe that mayhaps the switch in question was not actually plugged into the generator circuit, only the ups.? Generally there will be a switch which controls whether the genset is feeding power to the UPS or not. If you're just testing that the generator is working you'd probably have that switch turned off, whilst in normal operation you'd have it turned on. Some time ago I was involved in a blackout in a building which we had just moved into where this switch was in the wrong position, and as a result we saw a similar care to what happened here (UPS working fine, generator working fine, but one not feeding the other) - the difference being that we managed to get to the switch in time and enable it (Kudos to the guy who ran up about 19 flights of stairs to do it!) Regardless of whether the switch failed, was in the wrong position, or anything else I can't see how this is anyones fault other than Telecoms - you don't run a UPS/Generator test and not actually monitor that the UPS/Generator are functioning correctly. They had around 30 minutes to detect the problem (based on the quoted life of the batteries) which should have been plenty to either fix the problem (if it was just a switch in the wrong state) or to backout the test. Scott.
I find all this incredible... Common theme to this in summary... Why isn't the UPS monitored ? (assuming it isn't) Why didn't they abort their testing, back out, and restore power when things started going pear shaped ? It's a no brainer that the UPS battery has a finite life span which (in this case) is designed to carry the load between the power outage and when the generator goes online, and 30 Minutes is a very generous life span. After five minutes, they should have aborted the test and backed out and then investigated why it failed. But, hey, the spokeswoman from Telecom is telling the story... Oh yeah, the proper generator test is to cut the mains supply to the essential bus to make sure the generator starts up and pickup the load. On Wed, 2007-11-21 at 18:32 +1300, Russell Sharpe wrote:
They weren't monitoring the UPS? Why didn't they get a trap or alert the UPS didn't have a incoming feed? mains or Generator? Most modern systems, including large (by NZ standards) have snmp traps etc to warn of impending doom
______________________________________________________________________ From: Scott Howard [mailto:scott(a)doc.net.au] Sent: Wednesday, 21 November 2007 12:47 To: Joel Wiramu Pauling Cc: NZNOG Subject: Re: [nznog] An explanation of what went wrong at Mayoral Drive on 4/11
On 11/21/07, Joel Wiramu Pauling
wrote: >However, the switch failed to connect to the generator and the systems ran down the batteries before the failure was noticed.
Is it just me... or does this sentence leave one to believe that mayhaps the switch in question was not actually plugged into the generator circuit, only the ups.?
Generally there will be a switch which controls whether the genset is feeding power to the UPS or not. If you're just testing that the generator is working you'd probably have that switch turned off, whilst in normal operation you'd have it turned on.
Some time ago I was involved in a blackout in a building which we had just moved into where this switch was in the wrong position, and as a result we saw a similar care to what happened here (UPS working fine, generator working fine, but one not feeding the other) - the difference being that we managed to get to the switch in time and enable it (Kudos to the guy who ran up about 19 flights of stairs to do it!)
Regardless of whether the switch failed, was in the wrong position, or anything else I can't see how this is anyones fault other than Telecoms - you don't run a UPS/Generator test and not actually monitor that the UPS/Generator are functioning correctly. They had around 30 minutes to detect the problem (based on the quoted life of the batteries) which should have been plenty to either fix the problem (if it was just a switch in the wrong state) or to backout the test.
Scott.
_______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
I have worked in the past for said Telco albeit 10+ years ago, and another
in the not so distant past...
I currently work for a major NZ Data Centre provider...
When you are testing DR power and other systems scenarios as part of your
"Annual Maintenance Plan" in our case "Monthly Critical Systems tests" you
have people to eyeball the equipment to alert to potential issues...
IMHO. This type of incident enforces my feeling that some providers are only
interested in profits, not customer service. :(
and a gross lack of investment
_____
From: Lindsay Druett [mailto:lindsay(a)wired.net.nz]
Sent: Wednesday, 21 November 2007 19:01
To: Russell Sharpe; 'NZNOG'
Subject: Re: [nznog] An explanation of what went wrong at Mayoral Drive on
4/11
I find all this incredible...
Common theme to this in summary...
Why isn't the UPS monitored ? (assuming it isn't)
Why didn't they abort their testing, back out, and restore power when things
started going pear shaped ?
It's a no brainer that the UPS battery has a finite life span which (in this
case) is designed to carry the load between the power outage and when the
generator goes online, and 30 Minutes is a very generous life span. After
five minutes, they should have aborted the test and backed out and then
investigated why it failed.
But, hey, the spokeswoman from Telecom is telling the story...
Oh yeah, the proper generator test is to cut the mains supply to the
essential bus to make sure the generator starts up and pickup the load.
On Wed, 2007-11-21 at 18:32 +1300, Russell Sharpe wrote:
They weren't monitoring the UPS? Why didn't they get a trap or alert the UPS
didn't have a incoming feed? mains or Generator?
Most modern systems, including large (by NZ standards) have snmp traps etc
to warn of impending doom
_____
From: Scott Howard [mailto:scott(a)doc.net.au]
Sent: Wednesday, 21 November 2007 12:47
To: Joel Wiramu Pauling
Cc: NZNOG
Subject: Re: [nznog] An explanation of what went wrong at Mayoral Drive on
4/11
On 11/21/07, Joel Wiramu Pauling
However, the switch failed to connect to the generator and the systems ran down the batteries before the failure was noticed.
Is it just me... or does this sentence leave one to believe that mayhaps the switch in question was not actually plugged into the generator circuit, only the ups.? Generally there will be a switch which controls whether the genset is feeding power to the UPS or not. If you're just testing that the generator is working you'd probably have that switch turned off, whilst in normal operation you'd have it turned on. Some time ago I was involved in a blackout in a building which we had just moved into where this switch was in the wrong position, and as a result we saw a similar care to what happened here (UPS working fine, generator working fine, but one not feeding the other) - the difference being that we managed to get to the switch in time and enable it (Kudos to the guy who ran up about 19 flights of stairs to do it!) Regardless of whether the switch failed, was in the wrong position, or anything else I can't see how this is anyones fault other than Telecoms - you don't run a UPS/Generator test and not actually monitor that the UPS/Generator are functioning correctly. They had around 30 minutes to detect the problem (based on the quoted life of the batteries) which should have been plenty to either fix the problem (if it was just a switch in the wrong state) or to backout the test. Scott. _______________________________________________ NZNOG mailing list NZNOG(a)list.waikato.ac.nz http://list.waikato.ac.nz/mailman/listinfo/nznog
Please can we have this conversation in plain text?
On Wed, 2007-11-21 at 18:32 +1300, Russell Sharpe wrote:
They weren't monitoring the UPS? Why didn't they get a trap or alert the UPS didn't have a incoming feed? mains or Generator? Most modern systems, including large (by NZ standards) have snmp traps etc to warn of impending doom
I would expect the alerts were shushed so that there wouldn't be a pager storm during an expected maintenance window. But a pair of eyes should have noticed the problem... Cheers! -- Andrew Ruthven Wellington, New Zealand At home: andrew(a)etc.gen.nz | This space intentionally | left blank.
Duh - the pager system failed because the power was off to Mayoral Drive ;-) -----Original Message----- From: Andrew Ruthven [mailto:andrew(a)etc.gen.nz] Sent: Wednesday, 21 November 2007 8:47 p.m. To: 'NZNOG' Subject: Re: [nznog] An explanation of what went wrong at MayoralDrive on 4/11 On Wed, 2007-11-21 at 18:32 +1300, Russell Sharpe wrote:
They weren't monitoring the UPS? Why didn't they get a trap or alert the UPS didn't have a incoming feed? mains or Generator? Most modern systems, including large (by NZ standards) have snmp traps etc to warn of impending doom
I would expect the alerts were shushed so that there wouldn't be a pager storm during an expected maintenance window. But a pair of eyes should have noticed the problem... Cheers! -- Andrew Ruthven Wellington, New Zealand At home: andrew(a)etc.gen.nz | This space intentionally | left blank.
On Wed, 2007-11-21 at 10:04 +1300, Nathan Ward wrote:
On 21/11/2007, at 9:54 AM, Andrew Ruthven wrote:
On Wed, 2007-11-21 at 09:38 +1300, Matthew Poole wrote:
ComputerWorld article. http://s0.tx.co.nz/at/tep34n279457j138688i194855f2c4680224a4t9s4z
"However, the switch failed to connect to the generator and the systems ran down the batteries before the failure was noticed."
Wow. That is kind of embarrasing. I would have thought that during routine tests you make sure that what you're testing actually works. It must be time to revise that particular test procedure...
Don't worry, it's not a people or procedure problem: "When Computerworld spoke to Telecom last week, the failure had been traced to a component in the UPS switch, which had been replaced. Investigations continue into what made the component fail and how to avoid a recurrence of the problem, said the spokeswoman."
Clearly the reason power went out is a component failure.
But it was only noticed *after* the UPS batteries went flat? Oh well... -- Andrew Ruthven, Wellington, New Zealand At home: andrew(a)etc.gen.nz | This space intentionally | left blank.
participants (9)
-
Andrew Ruthven
-
Jeremy Strachan
-
Joel Wiramu Pauling
-
Lindsay Druett
-
Martin Kealey
-
Matthew Poole
-
Nathan Ward
-
Russell Sharpe
-
Scott Howard