Re: [nznog] Outage update from Telecom

19 Jun 2005

      In message <Pine.LNX.4.56.0506201840470.8951(a)blue.darkmere.gen.nz>, Lin Nah writes:
...
http://y3m.net/files/audio/telecom.fault.interview-2005-06-20.mp3
http://y3m.net/files/audio/telecom.fault.weldon-2005-06-20.mp3
Thanks for recording these and making them available.  

The spin of both appears to be that this was an unforseeable event,
which is difficult to accept.  Unlikely, perhaps, but not unforseeable
(imagine one break, then look for single points of failure that are
exposed by that break).

The stock exchange in particular appears not to have done a particularly
thorough single-points-of-failure analysis, or deliberately chosen not
to attempt to eliminate some of them.  For instance it appears that they
have a single transit provider, a single firewall, etc -- the failure of
any of which will take them "off the air".  The same appears to be true of
many other systems, some of which can probably not afford to eliminate all
the single points of failure (eg, EFTPOS in a small business, even small
business DSL), and some of which probably should (eg, aircraft control).

It's also a reminder that as much as a provider may claim n-9s
reliability for their network, at some level using a single provider for
all traffic still represents a single point of failure for your own
infrastructure, however unlikely it might seem.  Another completely
independent provider at least gives you a statistically less correlated
set of risk; although ensuring it's completely independent can be
non-trivial.

Still, given a 99.999% guarentee, and 4.5 hours of downtime (10:48
through 16:18), that means it'll be 4.5 hours / (1/100000) / 24 / 365.25
= 51.3 years until the next outage....
...
Let's not forget that TelstraClear and many other ISPs in NZ resell
Telecom (either wholesale or UBS).  TCL has patches of their own network
(eg Chch and wgtn are two examples).  However the last mile home for
majority of NZers is still Telecom.
The "last mile" isn't so much of a problem providing that the
single-provider portion really is issolated to the "last mile".  If the
handoff of each of those "last mile" connections is fairly local to the
users, and there's redundant, independent, backhaul from there to the
other infrastructure, then a "core network" failure in one provider need
not necessarily take out all those users.

If there's only a single handoff between the provider and the reseller
then obviously it's much more vulnerable as now much larger portions of
the providers network must remain operational for those end users to
stay working.

Hopefully a single-point-of-failure analysis will guide a design which
allows as much of the network to remain functional as possible even in
the face of several failures.  Andy's points of multiple network
exchanges, in multiple locations, is of course one good mitigating
strategy.

Finally it's a reminder that sometimes having the "redundant" part of your
network down requires high priority attention; your systems might still
be working, but if you're now only one failure away from a total failure
it's not a situation you want to be in for long.  (From what I heard the
first break in Telecom's situation occured yesterday, and was still
being worked on today.  Not fatal in itself, but without that failure
still being present the hungry backhoe wouldn't have caused as much
chaos.)

Ewen

Re: [nznog] Outage update from Telecom

Ewen McNeill