Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

I recently read Engineering a Safer World by Nancy G. Leveson, at Joshua Achiam’s recommendation, and really enjoyed it, so get ready for another book summary! I’m not very happy with the summary I have -- it feels less compelling than the book, partly because the book provides a ton of examples that I don’t have the space to do -- but hopefully it is enough to get the key points across.

The main motivation of this book is to figure out how we can improve safety engineering. Its primary thesis is that the existing methods used in engineering are insufficient for the current challenges, and must be replaced by a method the author favors called STAMP. Note that the book is primarily concerned with mechanical systems that may also have computerized automation (think aerospace, chemical, and mechanical engineering); the conclusions should not be expected to apply directly to AI.

The standard model of safety engineering and its deficiencies

Historically, safety engineering has been developed as a reaction to the high level of accidents we had in the past, and as a result focused on the easiest gains first. In particular, there were a lot of gains to be had simply by ensuring that machines didn’t break. (Rohin’s note: I’m editorializing a bit here, the author doesn’t explicitly say this but I think she believes it.) This led to a focus on reliability: given a specification for how a machine should operate, we aim to decrease the probability that the machine fails to meet that specification. For example, the specification for a water tank would be to contain the water up to a given pressure, and one way to improve the reliability of the tank would be to use a stronger material or make a thicker tank to make it less likely that the tank ruptures.

Under this model, an accident happens when a machine fails to meet its specification. So, we can analyze the accident by looking at what went wrong, and tracing back the physical causes to the first point at which a specification was not met, giving us a root cause that can show us what we need to fix in order to prevent similar accidents in the future. We can call this sort of analysis an event chain analysis.

However, in the last few decades there have been quite a few changes that make this model worse than it once was. The pace of technological change has risen, making it harder to learn from experience. The systems we build have become complex enough that there is a lot more coupling or interaction effects between parts of the system that we could fail to account for. Relatedly, the risks we face are getting large enough that we aren’t willing to tolerate even a single accident. Human operators (e.g. factory workers) are no longer able to rely on easily understood and predictable mechanical systems, instead having to work with computerized automation which they cannot understand as well. At this point, event chain analysis and safety-via-reliability are no longer sufficient for safety engineering.

Consider for example the Flight 965 accident. In this case, the pilots got clearance to fly towards the Rozo waypoint in their descent, listed as (R) on their (paper) approach charts. One of the pilots pressed R in the flight management system (FMS), which brought up a list of waypoints that did not include Rozo, and executed the first one (presumably believing that Rozo, being the closest waypoint, would show up first). As a result, the plane turned towards the selected waypoint, and crashed into a mountain.

The accident report for this incident placed the blame squarely on the pilots, firstly for not planning an appropriate path, and secondly for not having situational awareness of the terrain and that they needed to discontinue their approach. But most interestingly, the report blames the pilots for not reverting to basic radio navigation when the FMS became confusing. The author argues that the design of the automation was also flawed in this case, as the FMS stopped displaying the intermediate fixes to the chosen route, and the FMS’s navigational information used a different naming convention than the one in the approach charts. Surely this also contributed to the loss? In fact, in lawsuit appeals, the software manufacturer was held to be 17% liable.

However, the author argues that this is the exception, not the rule: typically event chain analysis proceeds until a human operator is found who did something unexpected, and then the blame can safely be placed on them. Operators are expected to “use common sense” to deviate from procedures when the procedures are unsafe, but when an accident happens, blame is placed on them for deviating from procedures. This is often very politically convenient, and is especially easy to justify thanks to hindsight bias, where we can identify exactly the right information and cues that the operator “should have” paid attention to, ignoring that in the moment there were probably many confusing cues and it was far from obvious which information to pay attention to. My favorite example has to be this quote from an accident report:

“Interviews with operations personnel did not produce a clear reason why the response to the [gas] alarm took 31 minutes. The only explanation was that there was not a sense of urgency since, in their experience, previous [gas] alarms were attributed to minor releases that did not require a unit evacuation.”

It is rare that I see such a clear example of a self-refuting paragraph. In the author’s words, “this statement is puzzling, because the statement itself provides a clear explanation for the behavior, that is, the previous experience”. It definitely sounds like the investigators searched backwards through the causal chain, found a situation where a human deviated from protocol, and decided to assign blame there.

This isn’t just a failure of the accident investigation -- the entire premise of some “root cause” in an event chain analysis implies that the investigators must end up choosing some particular point to label as The Root Cause, and such a decision is inevitably going to be determined more by the particular analysts involved rather than by features of the accident.

Towards a new approach

How might we fix the deficiencies of standard safety engineering? The author identifies several major changes in assumptions that are necessary for a new approach:

1. Blame is the enemy of safety. Safety engineering should focus on system behavior as a whole, where interventions can be made at many points on different levels, rather than seeking to identify a single intervention point.

2. Reliability (having machines meet their specifications) is neither necessary nor sufficient for safety (not having bad outcomes). Increased reliability can lead to decreased safety: if we increase the reliability of the water tank by making it using a stronger material, we may decrease the risk of rupture, but we may dramatically increase the harm when a rupture occurs since the water will be at a much higher pressure. This applies to software as well: highly reliable software need not be safe, as its specifications may not be correct.

3. Accidents involve the entire sociotechnical system, for which an event chain model is insufficient. Interventions on the sociological level (e.g. make it easy for low-level operators to report problems) should be considered part of the remit of safety engineering.

4. Major accidents are not caused by simultaneous occurrence of random chance events. Particularly egregious examples come from probabilistic risk analysis, where failures of different subsystems are often assumed to be independent, neglecting the possibility of a common cause, whether physical (e.g. multiple subsystems failing during a power outage) or sociological (e.g. multiple safety features being disabled as part of cost-cutting measures). In addition, systems tend to migrate towards higher risk over time, because environmental circumstances change, and operational practices diverge from the designed practices as they adapt to the new circumstances, or simply to be more efficient.

5. Operator behavior is a product of the environment in which it occurs. To improve safety, we must change the environment rather than the human. For example, if an accident occurs and an operator didn’t notice a warning light that could have let them prevent it, the solution is not to tell the operators to “pay more attention” -- that approach is doomed to fail.

A detour into systems theory

The new model proposed by the author is based on systems theory, so let’s take a moment to describe it. Consider possible systems that we may want to analyze:

First, there are some systems with organized simplicity, in which it is possible to decompose the system into several subsystems, analyze each of the subsystems independently, and then combine the results relatively easily to reach overall conclusions. We might think of these as systems in which analytic reduction is a good problem-solving strategy. Rohin’s note: Importantly, this is different from the philosophical question of whether there exist phenomena that cannot be reduced to e.g. physics: that is a question about whether reduction is in principle possible, whereas this criterion is about whether such reduction is an effective strategy for a computationally bounded reasoner. Most of physics would be considered to have organized simplicity.

Second, there are systems with unorganized complexity, where there is not enough underlying structure for analytic reduction into subsystems to be a useful tool. However, in such systems the behavior of individual elements of the system is sufficiently random (or at least, well-modeled as random) that statistics can be applied to it, and then the law of large numbers allows us to understand the system as an aggregate. A central example would be statistical mechanics, where we cannot say much about the motion of individual particles in a gas, but we can say quite a lot about the macroscopic behavior of the gas as a whole.

Systems theory deals with systems that have organized complexity. Such systems have enough organization and structure that we cannot apply statistics to it (or equivalently, the assumption of randomness is too incorrect), and are also sufficiently complex that analytic reduction is not a good technique (e.g. perhaps any potential decomposition into subsystems would be dominated by combinatorially many interaction effects between subsystems). Sociological systems are central examples of such systems: the individual components (humans) are very much not random, but neither are their interactions governed by simple laws as would be needed for analytic reduction. While systems theory cannot provide nearly the same level of precision as statistics or physics, it does provide useful concepts for thinking about such systems.

The first main concept in systems theory is that of hierarchy and emergence. The idea here is that systems with organized complexity can be decomposed into several hierarchical levels, with each level built “on top of” the previous one. For example, companies are built on top of teams which are built on top of individual employees. The behavior of components in a particular layer is described by some “language” that is well-suited for that layer. For example, we might talk about individual employees based on their job description, their career goals, their relationship with their manager, and so on, but we might talk about companies based on their overall direction and strategy, the desires of their customer base, the pressures from regulators, and so on.

Emergence refers to the phenomenon that there can be properties of higher levels arising from lawful interactions at lower levels that nonetheless are meaningless in the language appropriate for the lower levels. For example, it is quite meaningful to say that the pressures on a company from government regulation caused them to (say) add captions to their videos, but if we look at the specific engineer who integrated the speech recognition software into the pipeline, we would presumably say “she integrated the speech recognition into the pipeline because she had previously worked with the code” rather than “she integrated it because government regulations told her to do so”. As another example, safety is an emergent system property, while reliability is not.

The second main concept is that of control. We are usually not satisfied with just understanding the behavior of systems; we also want to make changes to it (as in the case of making them safer). In systems theory, this is thought of as control, where we impose some sort of constraint on possible system behavior at some level. For example, employee training is a potential control action that could aim to enforce the constraint that every employee knows what to do in an emergency. An effective controller requires a goal, a set of actions to take, a model of the system, and some way to sense the state of the system.

STAMP: A new model underlying safety engineering

The author then introduces a new model called Systems-Theoretic Accident Model and Processes (STAMP), which aims to present a framework for understanding how accidents occur (which can allow us to prevent them and/or learn from them). It contains three main components:

Safety constraints: In systems theory, a constraint is the equivalent of a specification, so these are just the safety-relevant specifications. Note that such specifications can be found at all levels of the hierarchy.

Hierarchical safety controllers: We use controllers to enforce safety constraints at any given level. A control algorithm may be implemented by a mechanical system, a computerized system, or humans, and can exist at any level of the hierarchy. A controller at level N will typically depend on constraints at level N - 1, and thus the design of this controller influences which safety constraints are placed at level N - 1.

Process models: An effective controller must have a model of the process it is controlling. Many accidents are the result of a mismatch between the actual process and the process model of the controller.

This framework can be applied towards several different tasks, and in all cases the steps are fairly similar: identify the safety constraints you want, design or identify the controllers enforcing those constraints, and then do some sort of generic reasoning with these components.

If an accident occurs, then at the highest level, either the control algorithm(s) failed to enforce the safety constraints, or the control actions were sent correctly but were not followed. In the latter case, the controllers at the lower level should then be analyzed to see why the control actions were not followed. Ultimately, this leads to an analysis on multiple levels, which can identify several things that went wrong rather than one Root Cause, that can all be fixed to improve safety in the future.

Organizational safety

So far we’ve covered roughly chapters 1-4 of the book. I’ll now jump straight to chapter 13, which seems particularly important and relevant, as it deals with how organizational structure and management should be designed to support safety.

One major point that the author makes is that safety is cost-effective for long-term performance as long as it is designed into the system from the start, rather than added on at the last minute. Performance pressure on the other hand inevitably leads to cuts in safety.

In order to actually get safety designed into the system from the start, it is crucial that top management demonstrates a strong commitment to safety, as without this employees will inevitably cut corners on safety as they will believe it is in their incentives to do so. Other important factors include a concrete corporate safety policy, as well as a strong corporate safety culture. It is important that safety is part of the design process, rather than tacked on at the end. In the author’s words, putting safety into the quality assurance organization is the worst place for it. [...] It sets up the expectation that safety is an after-the-fact or auditing activity only.

In addition, it is important that information can flow well. Going from the bottom to the top, it should be possible for low-level operators to report potential problems in a way that they are actually acted on and the relevant information reaches top management. From the top to the bottom, safety information and training should be easily available and accessible to employees when they need it.

It is also important to have controls to prevent the general tendency of systems to migrate towards higher risk, e.g. by relaxing safety requirements as time passes without any incidents. The next chapter describes SUBSAFE, the author’s example of a well-run safety program, in which the control is for everyone to periodically watch a video reminding them of the importance of their particular safety work (in particular, the video shows the loss of the USS Thresher, an event that caused SUBSAFE to be created).

Perhaps obviously, it is important for an organization to have a dedicated safety team. This is in contrast to making everyone responsible for safety. In the author’s words: While, of course, everyone should try to behave safely and to achieve safety goals, someone has to be assigned responsibility for ensuring that the goals are achieved.

If you start by designing for safety, it is cost-effective, not opposed to long-term money-maximizing. Once there is performance pressure, then you see cuts in safety. Also sometimes people fix symptoms instead of underlying causes, and then they just keep seeing symptoms forever and conclude they are inevitable.

Miscellaneous notes

The remaining chapters of the book apply STAMP in a bunch of different areas with many examples, including an entire chapter devoted to the STAMP treatment of a friendly fire accident. I also really liked the discussion of human factors in the book, but decided not to summarize it as this has already gotten quite long.

Summary of the summary

I’ll conclude with a quote from the book’s epilogue:

What seems to distinguish those experiencing success is that they:

1. Take a systems approach to safety in both development and operations

2. Have instituted a learning culture where they have effective learning from events

3. Have established safety as a priority and understand that their long-term success depends on it

Relationship to AI safety

A primary motivation for thinking about AI is that it would be very impactful for our society, and very impactful technologies need not have good impacts. “Society” clearly falls into the “organized complexity” class of systems, and so I expect that the ideas of safety constraints and hierarchical control algorithms will be useful ways to think about possible impacts of AI on society. For example, if we want to think about the possibility of AI systems differentially improving technical progress over “wisdom”, such that we get dangerous technologies before we’re ready for them, we may want to sketch out hierarchical “controllers” at the societal level that could solve this problem. Ideally these would eventually turn into constraints on the AI systems that we build, e.g. “AI systems should report potentially impactful new technologies to such-and-such committee”. I see the AI governance field as doing this sort of work using different terminology.

Technical AI alignment (in the sense of intent alignment (AN #33)) does not seem to benefit as much from this sort of an approach. The main issue is that we are often considering a fairly unitary system (such as a neural net, or the mathematical model of expected utility maximization) to which the hierarchical assumption of systems theory does not really apply.

To be clear, I do think that there in fact is some hierarchy. For example, in image classifiers where low levels involve edge detectors while high levels involve dog-face detectors. However, we do not have the language to talk about these hierarchies, nor the algorithms to control the intermediate layers. While Circuits (AN #111) is illustrating this hierarchy for image classifiers, it does not give us a language that we can (currently) use to talk about advanced AI systems. As a result, we are reduced to focusing on the incentives we provide to the AI system, or speculating on the levels of hierarchy that might be internal to advanced AI systems, neither of which seem particularly conducive to good work.

In the language of this book, I work on intent alignment because I expect that the ability to enforce the constraint “the AI system tries to do what its operator wants” will be a very useful building block for enforcing whatever societal safety constraints we eventually settle on, and it seems possible to make progress on it today. There are several arguments for risk that this ignores (see e.g. here (AN #50) and here (AN #103)); for some of these other risks, the argument is that we can handle those using similar mechanisms as we have before (e.g. governance, democracy, police, etc), as long as we have handled intent alignment.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 6:46 AM

I'm excited to see this cross-over into AI safety discussions. I work on what we often call "reliability engineering" in software, and I think there's a lot of lessons there that apply here, especially the systems-based or highly-contextualized approach, since it acknowledges the same kind of failure as, say, was pointed out in The Design of Everyday Things: just because you build something to spec doesn't mean it works if humans make mistakes using it.

I've not done a lot to bring that over to LW or AF, other than a half-assed post about normalization of deviance. I'm not a great explainer, so I often feel like it's not the most valuable thing for me to do, but I think getting people pointed to this field seems neglected and valuable to get them thinking more about how real systems fail today, rather than theorizing about how AI might fail in the future, or the relatively constrained ways AI fails today.

I got the book (thanks to Conjecture) after doing the Intro to ML Safety Course where the book was recommended. I then browsed through the book and thought of writing a review of it - and I found this post instead, which is a much better review than I would have written, so thanks a lot for this! 

Let me just put down a few thoughts that might be relevant for someone else considering picking up this book.

Target audience: Right at the beginning of the book, the author says "This book is written for the sophisticated practitioner rather than the academic researcher or the general public." I think this is relevant, as the book goes to a level of detail way beyond what's needed to get a good overview of engineering safety.

Relevance to AI safety: I feel like most engineering safety concepts are not applicable to alignment, firstly because an AGI would likely not have any human involvement in its optimization process, and secondly the basic underlying STAMP constructs of safety constraints, hierarchical safety control structures, and process models are simply more applicable to engineering systems. As stated in p100, "“STAMP focuses particular attention on the role of constraints in safety management.“ and I highly doubt an AGI can be bounded by constraints. " Nevertheless, Chapter 8 STPA: A New Hazard Analysis Technique that describes STPA (System Theoretic Process Analysis) may be somewhat relevant to designing safety interlocks. Also, the final chapter (13) on Managing Safety and the Safety Culture, is broadly applicable to any field that involves safety. 

Criticisms on conventional techniques: The book often mentions that techniques like STAMP and STPA is superior than other conventional techniques like HAZOP and gives quotes by reviewers that attest to their superiority. I don't know if those criticisms are really fair, given how it is not really adopted at least in the oil and gas industry that, for all its flaws, takes safety very seriously. Perhaps the criticisms could be fair for very outdated safety practices. Nevertheless, the general concepts of engineering safety feels quite similar whether it uses 'conventional' techniques or the 'new' techniques described in the book. 

Overall, I think this book provides a good overview of engineering safety concepts, but for the general audience (or alignment researchers) it goes into too much detail on specific case studies and arguments.