|
article page | 1 |
2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
did not count. And even above this,
scheduled maintenance down time did
not count toward availability. The
Voice NOC had seemingly qualified
away rigor in order to meet
the mandated goal of five-nines.
(Once again proving the rule that
you get what you measure.)
Essentially
the Voice NOC was not wrong, nor
the Data division right. What the
NOC was measuring was compliance
with a customer-oriented SLA, a Service
Level Agreement, and not really an
availability measure. The Data division
was measuring availability as “measured
uptime,” or the probability
the network could be used at any
specific time the customer wished
to use it. Today we clearly understand
the difference between SLAs and Availability,
and define them in separate and individually
appropriate ways. So the justification
session worked, and Wedge Greene
henceforth started attending standards
meetings as an Operations guy.
|
|
High availability is consistently
used by telecommunications vendors
to describe their products, perhaps
even more often than the umbrella
term “carrier grade”.
|
|
In practice, these numbers
are mostly estimates that
the manufacturer makes
about the reliability of
their equipment. Because
telecom equipment frequently
is expensive, not deployed
in statistically significant
sample set sizes, and rushed
into service as soon as
it passes laboratory tests,
the manufacturer estimates
its reliability.
It is not clear when the
measure of reliability
of an individual network
element became the measure
of overall network availability – but
it did: customers don’t
care if one element fails,
or dozens fail. Customers
only care that the service
they are paying for and
rely on works as offered.
It is also interesting
to note that this five-nines
either transferred from
network elements to computing
systems. Today
|
|
|
|
Hardware
Origins
Let us backtrack a bit. It is likely
that the origin of five-nines availability
comes from design specifications for
network elements. It also seems likely
that the percent measure came before
the current MTBF (Mean Time Between
Failures) and MTTF (Mean Time To Fix)
measurement, since it is a simply expressed
figure and the MTTF requirements often
match the % calculation while being
expressed as an odd-ish number. However,
99.999% is not so accurate, fundamentally
when you examine it closely, because
of the fuzziness of the definition
of availability. So in MIL-HDBK-217
and Bellcore/Telecordia TR332 they
standardized these measures. The basic
hardware design measures became:
MTBF
- the average time between failures
in hardware modules
MTTR - is the
time taken to repair a failed hardware
module.
|
|
computer server reliability is critical
to the network availability, so it
is actually convenient that both
seek the same standard for describing
quality.
Defining Availability
IEEE Reliability Society - Reliability
Engineering subgroup defines “Reliability
[as] a design engineering discipline
which applies scientific knowledge
to assure a product will perform
its intended function for the required
duration within a given environment.” Reliability
is the flip side of the availability
coin.
From a common
sense perspective, the meaning
of availability is clear. But when
you want to measure it, and then
hold someone to task for delivering
that availability, you must define
an operational definition for
it. Federal
|
article
page | 1 | 2 |
3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|
|