Catastrophic Single-Point Failures

by Ted W. Yellman
Bellevue, Washington
 


Failures and Catastrophes

In reliability and safety work, the word "failure" has several nuances [Ref. 1]. In its broadest sense, "failure" can include any event that is unfavorable to reliability or safety. However, in the context of the catastrophic single-point failure (CSPF) concept, single-point failure usually means a spatially local ("single-point") material failure in a system – a permanent, physical failure of a single component or interconnection. The typical CSPF consists of primary (i.e., spontaneous or self-inflicted) damage or a similar degraded physical condition that requires a repair to correct it.

The term CSPF does not apply to functional failures (although one or more functional failures will, in general, result from a CSPF), nor to unfavorable system environments. An operator or maintenance error per se can be considered a CSPF, but it rarely is. However, a material failure caused by a maintenance error or an unfavorable system environment, or even by a design error, can be a CSPF. I will leave catastrophic undefined. One person’s catastrophe can be another person’s inconvenience. I will assume that the meaning of "catastrophic" is understood consistently by all stakeholders in the context of the system being analyzed, and thus requires no general definition.

"Hard" versus "Soft" CSPFs
A material failure or a failure mode that guarantees a subsequent catastrophic accident whenever it occurs is a hard CSPF. A soft CSPF, on the other hand, increases the probability of a catastrophic accident substantially, but not to 100%. The probability of occurrence is not 100% because additional conditions must exist or arise before a soft CSPF becomes a hard CSPF. (Examples are given later in this article.)

One might think that a soft CSPF could not be a "real" CSPF because it does not guarantee a catastrophic accident. However, things are not quite that simple.
 

“It is neither probability nor expected loss considerations per se which have given rise to the CSPF prohibition. Rather, it is perceived uncertainties in the values claimed for CSPF probabilities, and therefore in the expected losses resulting from those probabilities.”


Typical CSPF Prohibitions

The basic Federal Aviation Administration (FAA) guideline for designing safe transport-category airplane systems in the United States [Ref. 2] includes the following statement:

"In any system or subsystem, the failure of any single element, component, or connection during any one flight ... should be assumed, regardless of its probability. Such single failures should not prevent continued safe flight and landing, or significantly reduce the capability of the airplane or the ability of the crew to cope with the resulting failure conditions."

A subsequent draft version of this document [Ref. 3], a combined FAA/Joint Airworthiness Authority effort, uses the same words except that the second sentence becomes: "Such single failures should not be catastrophic." Section 901 of the Code of Regulations for Airworthiness Standards for Transport Category Airplanes [Ref. 4] makes a similar statement about airplane propulsion systems: "... no single failure or malfunction ... will jeopardize the safe operation of the airplane ...." In effect, the FAA is saying, "If it is possible that a failure within a system in your proposed airplane – all by itself – will cause a crash, we are not about to certify your airplane based on your arguing that the failure has a very low probability."

Similar requirements have appeared in many other documents. As Pat Clemens said [Ref. 5], there is indeed a "persistent pogrom" against CSPFs.

How much sense does it make?

The Three Facets of Risk
Clemens points out [Ref. 5] that requirements almost never specify allowable failure probabilities for CSPFs, even though (as he contends) "probability is the very component of the risk doublet that eliminating single points is meant to control." By "risk doublet," Clemens means a pair consisting of (1) the probability of a particular failure causing a loss, and (2) the severity of that loss – or more precisely, the values of the losses caused by the particular failure averaged over those missions during which the failure occurred. The product of the probability of a particular failure and its corresponding average-loss value is the per-mission expected loss caused by the possibility of the failure occurring.

There are often facets of risk besides expected loss that ought to be considered [Ref. 6]. One such facet is the variability in loss values from mission to mission. That variability is not recognized if one assumes that individual-mission loss values will be constant, when in reality they will vary from mission to mission and accident to accident. Frequent low-severity losses may be tolerable, whereas much less frequent but very high-severity losses may not, even though the expected loss is the same in both cases. Furthermore, there is a third facet of risk that is almost always more important than variability: uncertainty that the "claimed" value for expected loss has been correctly estimated.

It is neither probability nor expected loss considerations per se which have given rise to the CSPF prohibition. Rather, it is perceived uncertainties in the values claimed for CSPF probabilities, and therefore in the expected losses resulting from those probabilities. And that is the key point of this article.