In message
http://y3m.net/files/audio/telecom.fault.interview-2005-06-20.mp3 http://y3m.net/files/audio/telecom.fault.weldon-2005-06-20.mp3
Thanks for recording these and making them available. The spin of both appears to be that this was an unforseeable event, which is difficult to accept. Unlikely, perhaps, but not unforseeable (imagine one break, then look for single points of failure that are exposed by that break). The stock exchange in particular appears not to have done a particularly thorough single-points-of-failure analysis, or deliberately chosen not to attempt to eliminate some of them. For instance it appears that they have a single transit provider, a single firewall, etc -- the failure of any of which will take them "off the air". The same appears to be true of many other systems, some of which can probably not afford to eliminate all the single points of failure (eg, EFTPOS in a small business, even small business DSL), and some of which probably should (eg, aircraft control). It's also a reminder that as much as a provider may claim n-9s reliability for their network, at some level using a single provider for all traffic still represents a single point of failure for your own infrastructure, however unlikely it might seem. Another completely independent provider at least gives you a statistically less correlated set of risk; although ensuring it's completely independent can be non-trivial. Still, given a 99.999% guarentee, and 4.5 hours of downtime (10:48 through 16:18), that means it'll be 4.5 hours / (1/100000) / 24 / 365.25 = 51.3 years until the next outage....
Let's not forget that TelstraClear and many other ISPs in NZ resell Telecom (either wholesale or UBS). TCL has patches of their own network (eg Chch and wgtn are two examples). However the last mile home for majority of NZers is still Telecom.
The "last mile" isn't so much of a problem providing that the single-provider portion really is issolated to the "last mile". If the handoff of each of those "last mile" connections is fairly local to the users, and there's redundant, independent, backhaul from there to the other infrastructure, then a "core network" failure in one provider need not necessarily take out all those users. If there's only a single handoff between the provider and the reseller then obviously it's much more vulnerable as now much larger portions of the providers network must remain operational for those end users to stay working. Hopefully a single-point-of-failure analysis will guide a design which allows as much of the network to remain functional as possible even in the face of several failures. Andy's points of multiple network exchanges, in multiple locations, is of course one good mitigating strategy. Finally it's a reminder that sometimes having the "redundant" part of your network down requires high priority attention; your systems might still be working, but if you're now only one failure away from a total failure it's not a situation you want to be in for long. (From what I heard the first break in Telecom's situation occured yesterday, and was still being worked on today. Not fatal in itself, but without that failure still being present the hungry backhoe wouldn't have caused as much chaos.) Ewen