article page | 1 | 2 | 3 |
4 | 5 | 6 | 7 | 8 | 9 |
But
consider the “bad apple” example
- i.e. if one system crashes for
24 hours, and all others work without
interruption for the month, the
24 hour outage might be averaged
across a large base of installed
units. When the number of installed
units is large enough, this yields
a number that would remain within
the required 5 nines of up time.
More specifically, if a service
provider installs a product described
as "carrier grade" or
providing five-nines availability,
should the service provider expect
that this is the product reliability
standard expected of every single
network element, or should the
service provider expect that some
of the elements may perform at
a much degraded level, and that
it is only the world wide “law
of large numbers" that is
used to measure carrier-grade?
You see, it isn’t just that
one bad apple in a group of network
elements can skew the overall numbers – there
are actual customers and real traffic
affected by that “bad apple” element.
Way beyond some theoretical measure,
the effect on customers might be
quite severe – certainly
outside the guarantees of any customer
SLA, and almost certainly extracting
a penalty payment from the carrier,
and likely attracting a fine from
the regulator as well. As we shall
see later, the problem of “one
versus many” is being addressed
by several groups.
The bottom line is that units
will fail, so five-nines hardware
availability is actually a design
game of building systems which
are always covered, i.e. they are
redundant. This is the next telecom
rule-of-thumb: no “single
point of failure” (SPOF)
shall exist.
Supplying Redundancy
Redundancy is the addition
of information, resources, or time
beyond what is needed for normal
system operation. Both hardware
and software can be made redundant.
Hardware redundancy is
the addition of extra hardware,
usually as backups or failovers
or for tolerating faults. Software
redundancy is
the addition of extra software,
beyond the baseline of feature
implementation that is used to
detect and react to faults. Information
redundancy is
the addition of extra information
beyond that required to implement
a given function and includes configuration
and data duplication, replication,
and backup databases. Hardware
and software work together using
information to provide redundancy,
usually by having software monitor
for abnormalities and initiate
the configuration changes necessary
to switch service to backup hardware,
servers, and standby software programs.
IBM says of SPOF – “A
single point of failure exists
when a critical function is provided
by a single component. If that
component fails, the system has
no other way to provide that
function and essential services
become