Design for Safety

by Dev Raheja and Brian Moriarty
Washington DC
 


Avoid White-Space Risks
There is one analysis that is often ignored. It aims at failures that are difficult to predict. Sudden acceleration of the Audi 5000 when put in reverse gear is a typical example (this took place years back). It can be as simple as a pin catching road debris on Nissan Altima exhaust pipes, which could catch fire from the heat of the engine. This analysis is informally called “unknown unknowns analysis,” and it should be performed before approving the preliminary design. One tool for this kind of analysis is called “sneak path analysis.” In electronics and software, it is called “sneak circuit analysis.” This is used for discovering hidden problems, which usually turn up in rare events such as deployment of airbags, or when there is a major accident in which a fireman may touch high-voltage battery terminals, unaware of the consequences. Questions are asked such as, “Will the airbag open when it is supposed to?” “Will it open at the wrong time?” “Will the system give a false warning?” or “Will the system behave failsafe in the event of an unknown fault?” Depending on how critical the functions are, you can use other types of analyses, such as event tree analysis, worst-case analysis and diagnostics capability analysis to discover unexpected failures.

The sneak failures are more likely to be in the embedded software, where practically no reliability analysis is done. Frequently, the specifications are faulty because they are not derived from the system performance specification. At a minimum, we should perform Failure Mode and Effects Analysis (FMEA) on all critical software functions, develop reusable modules, make sure it cannot accept unreasonable input, and define the product behavior in case of an unreasonable output. Then we should pay special attention to software structure and architecture so that engineering changes are quick and do not require complete regression testing. The most preferred structure is the top-down structure, in which the code is partitioned functionally and the order of execution flows top to bottom. The fault-tolerant architecture is the most frequently used architecture.

For Fast Results, Go Hollywood
If we want a higher level of safety at the lowest cost, then we need to treat safety like a new product. In fact, it is a product. It is a deliverable item to the customer, and the customer can tell if you delivered it or not. The next question is, “Which industry is the best at delivering new products and in a short time?” The answer is Hollywood. The reason Hollywood delivers the results in a short time is that every product or activity, no matter how small, is thought out in detail. Movie producers and directors act as knowledge leaders more than as task leaders. Lots of activities are developed in tandem, such as music, choreography, cinematography, scouting locations, securing location permits, publicity and recording sound. If any critical activity is delayed even by a day, the executive producer is notified. His or her job is to re-allocate the resources and make necessary adjustments to meet deadlines. If the deadlines cannot be met, there are early warning triggers that alert the director to redefine the scope of the project, or reschedule. But most important of all, no detailed work is done until the script is approved. The script determines the budget and the schedule.

The script is equivalent to a performance specification. When everything is defined in the script, the movie goes into production and finishes on time. Most movies are finished in a few weeks once the filming starts. The efficiency of the process resembles that of a symphony orchestra.

The same concepts apply to designing new systems. The script — i.e., the system performance specification — must be thoroughly reviewed, and the work must be performed in tandem. The chief design engineer is equivalent to a movie director, the project manager is equivalent to the executive producer, and team leaders are equivalent to assistant producers and directors. As soon as the specification is approved, a number of activities can be performed in parallel: the system hazard analysis, subsystem hazard analysis, operations and support hazard analysis, fault tree analyses, design reviews, the process FMEA for safety, and serviceability/logistics hazard analysis. It is not too early for supply-chain management and manufacturing to work on improving existing processes and components for safety-critical features. If they wait to do this until a month before going into production, it will be impossible to lower the risk. Then the project manager is the one responsible for the failure. Manufacturing and supply-chain management should qualify critical components and processes well before going into production. They should participate in concept design review to make sure the design does not require something they cannot deliver. If you do not have the resources to do all these activities in parallel, then you should be willing to extend the schedule, but ignoring these activities is sure to increase risks.

"Early budgets are illusions..."


Once the analyses are done, critical design characteristics for safety should be identified. These could be strength of a joint, alignment of an assembly, the type of heat treating, or proper installation of a component. For each critical product characteristic, we need to identify critical process characteristics that should be in place all the time. The critical process characteristic can be the accuracy of setup, the frequency of tool maintenance, etc.

Of course, these analyses have to be done right. There is no point in filling out the FMEA forms while watching a football game just to meet the requirements. The usual problem is that there is often no one responsible for reviewing the quality of the work. That is why Hollywood has assistant directors who review the work before the director steps in to review the work of the assistant directors. The equivalent of an assistant director in business can be an independent audit team. In software development, these are called independent verification and validation teams. We need to develop such teams to assess the thoroughness of safety analysis and mitigation actions.

Early Budgets are Illusions
Most budgets are made after concept approval. This is all right to start with, but as the product definition changes, the budget must change also. If you are going to make 700 changes in the specification, as in the case of the 1995 Lincoln Continental, then your initial budget is obsolete. The scope of the project has changed. You require a new budget, a new schedule and a new contingency budget. Most companies, since they write poor specifications, do not see this as a problem. They keep adding one requirement at a time and fail to adjust the budget. They end up creating special budgets later and spend several times more than they would have if the specification had been done right in the first place. It is much cheaper to avoid problems if you see them in advance. A realistic budget depends on a realistic and holistic performance specification that prevents life-cycle hazards that occur in manufacturing, testing, warranty, storage, shipping, handling, preventive maintenance, diagnostics and repair.

Conclusion
There are three kinds of organizations: those that make things happen, those that watch things happen, and those that wonder what happened. If we write the holistic specifications and perform thorough hazard analyses, then we don’t have to wonder what happened. We will make things happen our way.

About the Authors
Dev Raheja, founder and president of Design for Competitiveness, Inc., has been an international consultant and trainer in new product development since 1981. Prior to this, he served in management positions at Booz Allen Hamilton, General Electric and Cooper Industries. He has received several awards, including the Scientific Achievement Award from the System Safety Society. An author of the book Assurance Technologies: Principles and Practices, he also teaches courses in Practical Reliability Engineering and Designing for Safety.

Brian Moriarty is a product assurance staff engineer with more than 43 years of experience in system safety, reliability, maintainability, quality assurance and human factors. Currently, he is Northrop Grumman’s Senior Safety Engineer for the FAA En Route Automation Modernization (ERAM) system in Washington DC. A Fellow and Past President of the System Safety Society and a director of the Reliability and Maintainability Symposium, he is also the co-author of the book System Safety Engineering and Management, which is used in safety programs throughout the world.