|
Failures and Catastrophes
In reliability and safety work, the word "failure"
has several nuances [Ref. 1]. In its broadest sense, "failure"
can include any event that is unfavorable to reliability
or safety. However, in the context of the catastrophic
single-point failure (CSPF) concept, single-point
failure usually means a spatially local ("single-point")
material failure in a system a permanent,
physical failure of a single component or interconnection.
The typical CSPF consists of primary (i.e., spontaneous
or self-inflicted) damage or a similar degraded
physical condition that requires a repair to correct
it.
The term CSPF does not apply to functional failures
(although one or more functional failures will,
in general, result from a CSPF), nor to unfavorable
system environments. An operator or maintenance
error per se can be considered a CSPF,
but it rarely is. However, a material failure
caused by a maintenance error or an unfavorable
system environment, or even by a design error,
can be a CSPF. I will leave catastrophic undefined.
One persons catastrophe can be another persons
inconvenience. I will assume that the meaning
of "catastrophic" is understood consistently
by all stakeholders in the context of the system
being analyzed, and thus requires no general definition.
"Hard"
versus "Soft" CSPFs
A material failure or a failure mode that guarantees
a subsequent catastrophic accident whenever it
occurs is a hard CSPF. A soft CSPF,
on the other hand, increases the probability of
a catastrophic accident substantially, but not
to 100%. The probability of occurrence is not
100% because additional conditions must exist or arise
before a soft CSPF becomes a hard CSPF. (Examples
are given later in this article.)
One might think that a soft CSPF could not be
a "real" CSPF because it does not guarantee
a catastrophic accident. However, things are not
quite that simple.
|
|
It is neither
probability nor expected loss considerations per
se which have given rise to the CSPF prohibition.
Rather, it is perceived uncertainties
in the values claimed for CSPF probabilities,
and therefore in the expected losses resulting
from those probabilities.
|
|
Typical CSPF Prohibitions
The basic Federal Aviation Administration (FAA)
guideline for designing safe transport-category
airplane systems in the United States [Ref. 2]
includes the following statement:
"In
any system or subsystem, the failure of any single
element, component, or connection during any one
flight ... should be assumed, regardless of its
probability. Such single failures should not prevent
continued safe flight and landing, or significantly
reduce the capability of the airplane or the ability
of the crew to cope with the resulting failure
conditions."
A subsequent draft version of this document [Ref. 3], a
combined FAA/Joint Airworthiness Authority effort,
uses the same words except that the second sentence
becomes: "Such single failures should not
be catastrophic." Section 901 of the Code
of Regulations for Airworthiness Standards for
Transport Category Airplanes [Ref. 4] makes a
similar statement about airplane propulsion systems:
"... no single failure or malfunction ...
will jeopardize the safe operation of the airplane
...." In effect, the FAA is saying, "If
it is possible that a failure within a system
in your proposed airplane all by itself
will cause a crash, we are not about to
certify your airplane based on your arguing that
the failure has a very low probability."
Similar requirements have appeared in many other
documents. As Pat Clemens said [Ref. 5], there
is indeed a "persistent pogrom" against
CSPFs.
How
much sense does it make?
The
Three Facets of Risk
Clemens points out [Ref. 5] that requirements
almost never specify allowable failure probabilities
for CSPFs, even though (as he contends) "probability
is the very component of the risk doublet that
eliminating single points is meant to control."
By "risk doublet," Clemens means a pair
consisting of (1) the probability of a particular
failure causing a loss, and (2) the severity of
that loss or more precisely, the values
of the losses caused by the particular failure
averaged over those missions during which the
failure occurred. The product of the probability
of a particular failure and its corresponding
average-loss value is the per-mission expected
loss caused by the possibility of the failure
occurring.
There
are often facets of risk besides expected loss
that ought to be considered [Ref. 6]. One such
facet is the variability in loss values
from mission to mission. That variability is not
recognized if one assumes that individual-mission
loss values will be constant, when in reality
they will vary from mission to mission and accident
to accident. Frequent low-severity losses may
be tolerable, whereas much less frequent but very
high-severity losses may not, even though the
expected loss is the same in both cases. Furthermore,
there is a third facet of risk that is almost
always more important than variability: uncertainty
that the "claimed" value for expected
loss has been correctly estimated.
It is neither probability nor expected loss considerations
per se which have given rise to the CSPF
prohibition. Rather, it is perceived
uncertainties in the values claimed for CSPF probabilities,
and therefore in the expected losses resulting
from those probabilities. And that is the key
point of this article.
|