Engineering a Safer World: Risk Modelling — and Safety Engineering? — for AI Loss of Control

Oliver Sourbut

Engineering a safer world.

I, and I imagine many of my readers, are eager to contribute to that effort^[1].

It's also, conveniently, the title of a book!

It's not a book by me… but nevertheless I recommend it. You should consider reading it, especially the first chapters. You can find it online free from MIT. It's by MIT professor Nancy Leveson, computer scientist turned innovator in safety science, safety engineering, and software safety.

Generally, this book is a particularly approachable and competent example of systems thinking or what I might call a cybernetic perspective — applied to safety science. This essay is in part a brief book review of Leveson's Engineering a Safer World.

I care about this systems perspective in part because... loss of control. It's something which, I think, is quite intuitive for many people at a very abstract level. Of course we could lose control of 'systems smarter than us'. But in my experience it tends to be quite slippery. It's difficult to get a proper analytical grip on 'loss of control', and it's difficult to communicate about.

The systems thinking or cybernetic framing is one I find particularly helpful for getting to grips with 'loss of control', and it has other benefits besides. That's the rest of what this essay is doing: how does systems thinking apply to AI and loss of control specifically?

I did some of this loss of control risk modelling at the UK government's AI Safety/Security Institute (AISI) in 2024, and now this kind of risk modelling is part of the background strategy, prioritisation, and foresight work we do at the Future of Life Foundation^[2].

This essay corresponds closely to a talk I delivered at the Technical AI Safety Conference 2026.

Safety and systems

The first lesson from this perspective is: safety is a system property.

There's never a single 'root cause' of a disaster.

Even acute disasters have a buildup. The bit you see — the tip of the iceberg — is the acute disaster. An explosion. A crash. A fire. Human extinction. But the lead up to that always has systemic, process failures.

Disasters might be precipitated by some particular hazardous activity, or even malicious action. But behind all of those is a network of interrelated systems and processes which could have — and we might say should have — done something to prevent the disaster.

These systemic weaknesses are like health indicators for the systems in question. Leveson encourages us to consider these processes and systems as the central objects of consideration when trying to engineer a safer world.

'Loss of control' on this perspective is less of a single decisive moment and more of an unfolding process, or even a vicious cycle. We'll return to the vicious cycles later.

Which systems?

OK, we need to pay attention to some systems. But there are lots of systems! Which systems?

There's definitely an art to this — not one I claim particularly to possess. It's here we need to apply contextual knowledge, analyst's judgement, and whatever expertise we can muster. But there are some general principles the systems view can offer.

My statement of a principle I like: think up the chain.

In engineering, consider:

A deployment. It goes with monitoring systems, maintenance systems, iteration and evolution processes.
Design, development, perhaps manufacturing. Research.
Those systems are in turn overseen by management which should include communication, reporting, decision-making systems. Passing down design constraints, objectives, resource allocation, prioritisation. Getting status updates, reports, outcomes.
...Even above that, we can start to talk about 'governance':
- Literal governments: regulatory activity
- Also company boards and other governance commitments and structures
- Courts
We can also include — I think it's important to include — broader societal sensemaking, democratic deliberation, how we are making sense of what's happening as a society and controlling the processes unfolding around us.

This is a very 'cybernetic' perspective, one which I think is quite powerful.

When these get corrupted, things get out of control.

In her book, Leveson gives various worked examples. She also offers some generic examples as starting places for analysts like me.

Here's one I find quite instructive. It's actually one we used as a starting seed for some risk modelling at AISI.

In general, this systems perspective offers a really expressive language both for communicating and for analysing or troubleshooting. One of the beauties of this language is you can conceptually 'zoom in and out', asking for example which people and processes (subsystems) make up or implement a given system or relationship, and their requirements and so on. Or draw a bigger box abstracting around a collection of closely related systems for a more birds-eye view.

The regulatory lag dilemma

The accompanying text from this point in the book is sobering reading. Leveson writes,

The only requirement is that responsibility for safety is distributed in an appropriate way throughout the sociotechnical system… If companies or industries are unwilling or incapable of performing their public safety responsibilities, then government has to step in to achieve the overall public safety goals. But a much better solution is for company management to take responsibility — Leveson, Engineering a Safer World (4.2, The Hierarchical Safety Control Structure)

but elsewhere in the same chapter,

As in any control loop, time lags may affect the flow of control actions and feedback… For example, standards can take years to develop or change — a time scale that may keep them behind current technology and practice. — Leveson, Engineering a Safer World (4.2, The Hierarchical Safety Control Structure)

(emphases mine).

This dilemma of slow-moving, ponderous regulatory oversight is particularly pernicious in an area as fast-moving as AI. More on this dilemma later.

Which health indicators?

We have something like a picture of which systems might be important for safety.

Where do we look for health indicators? Relatedly, where are the hazards? And the flipside, where do we look for interventions for robustness?

A systems-theoretic lens has a few general suggestions.

'Control' systems

This is a very generic picture of a control system. You'll see pictures like it across biology, computer science, cybernetics, reinforcement learning, military strategy, engineering, control theory. Terminology might vary but it's the same general picture, very straightforward and explanatory.

What's a controller? It might be me riding my unicycle, it might be a government department attempting to understand and nudge and industry or even an economy, it might be a mouse looking for food, or an electronic component of an autonomous vehicle.

What does such a controller need to do a good job? It needs sensing: some way of getting observations, feedback about the relevant aspects of its situation and environment. It needs actuation: some way of acting back on the world, taking actions, applying influence. Between those, it needs understanding — something like a model, perhaps a 'world model' — interpreting its observations and their implications, and the implications of its available options. And it needs a way of deciding appropriately what to do in order to carry out whatever responsibilities or agenda it has.

In the context of safety, we're talking about safety controllers, applying safety constraints. It doesn't mean dictating what happens in any 'controlled' processes, just applying enough constraints that we can be confident that safety is maintained.

So: sensing, understanding, deciding, acting.

When analysing a given system, we can ask what it has by way of sensing, understanding, deciding, and acting. Are they adequate? Degradations to these look like hazards, or even attack surfaces in an adversarial setting. And improvements look like safety opportunities.

Control examples in AI safety

Let's get more concrete.

Take the example of an AI deployer, and their opportunities for sensing. How do they 'see' what's going on? Logs! So spoofed logs or tampered evaluations or something like that would substantially compromise their ability to understand what's going on. So this is potentially cutting off that sensing relationship. Alternatively, rogue replication of an agent would also be a compromise to that sensing, because it would create an agent outside of the nominal sensing channels.

Now take an AI regulator, and their understanding and deciding. How might those be inadequate? They might simply have insufficient capacity for the needed analysis and foresight: I think a quite common failure mode for regulators. On the more adversarial side, they could be subject to lobbying or capture — and this could be compromising their understanding or even their decision-making process. Even if they have perfectly adequate sensing and actuation affordances in principle, they might not choose to use them in the way that we'd hope.

For a final example, take the slightly esoteric system 'society' or 'the public at large', and its ability to act. Well, generic economic or political disempowerment of course (perhaps tautologically) means that even if there were actions we might have wanted to use to apply pressure to something we think is unsafe, those actions might not be available or effective any more. Also, rogue replication! Whatever actions society might have wanted to do to get an AI system under control, if it's replicating that's much harder. Replication is a big deal, and we wrote a paper on it at AISI.

An abstract picture of AI development and operation

Let's really simplify and adapt Leveson's generic development and operations diagram for the case of AI. This is sweeping huge amounts of detail under the rug for now.

There's an agent. That's important. There's a user. (That might be a person or an organisation or some other system, perhaps automated.)

The agent gets there by a combination of the output of a research and development process and control by a deployment and monitoring system. Those are in turn controlled or overseen by company leadership. (There might be more than one company involved.)

These companies are meant to be somewhat kept in order by 'society' — think regulators, courts, insurance, public associations, and so on.

Already, even at this very coarse level of granularity, there's a lot we could say by inspecting and enumerating all these relationships. What sensing and acting affordances are there (or might we want there to be) here? How are understanding and decision-making happening: could they be improved? And of course one of the nice things about this language is that we can recursively unpack things if we want, most obviously here in 'R&D' and 'Society' which of course have far more detail to them.

What makes AI special

For now, I want to focus on something which makes AI a bit different.

Agents are increasingly capable and have increasingly general affordances. That means, whether directed to — or whether acting out a misaligned objective — AI can in principle act back on any level of this supposed control hierarchy.

That turns this into potentially an adversarial, security engineering problem, and it brings us to the vicious cycles I mentioned earlier.

For me, it's right to think of 'loss of control' as often a matter of a breach or compromise at one or more of these levels, followed by escalations. Not just a singular event, but unfolding in potentially vicious cycles. Even if we can identify some final 'point of no return', preceding that there was some series of events that precipitated it.

There might be several acute events: perhaps it's a breach of containment by an agent, or a passing into law of a bill harmful to oversight. Or the failure to pass a particular law, perhaps missing an opportunity window to do that. It might be a particularly ill-advised deployment decision. It might be a very consequential court ruling. We can think of these as the overall system migrating towards an increasingly, and eventually entirely irrecoverable total loss of control.

Vicious cycles

Some of these might be obvious, but let's enumerate a few of those ways AI systems could act back on the oversight hierarchy.

Propaganda, epistemic disruption, and political influence — harm democratic deliberation and political decisionmaking.

Lobbying and legal influence — could erode regulatory and court oversight.

R&D automation — potentially interrupts quite a few feedback and oversight relationships! One which I think is underdiscussed: fewer human participants means fewer whistleblowers, less internal scrutiny, less governance and decisionmaking robustness. More concentration of that influence. That could mean more single points of failure. It’s famously difficult to maintain a conspiracy of more than one or two people: ‘two can keep a secret if one is dead’, as they say! And besides conspiracy, compared to larger teams, individuals and small groups may be far more susceptible to capture, coercion, corruption, or plain foolishness and rash decisionmaking.

Backdooring, sabotage, poisoning — all great ways to break oversight in R&D.

Exfiltration, privilege escalation — these are bread and butter for compromising monitoring oversight, whether from ‘inside’ or ‘outside’.

Trickery and coercion — most tools don’t do this. AI does. Already this is sometimes breaking some people’s model of their relationship with AI and sometimes their wider situation.

Virtuous cycles?

BUT. Something else we can also do is ask how the new AI building blocks in our repertoire could be used to fortify some of these situations.

This is an area the Future of Life Foundation is really interested in, along with some collaborators at Forethought.

Really briefly, some things that immediately stand out when looking at this picture.

There are so many tantalising opportunities for improving how societies communicate and coordinate! Tools supporting collective epistemics, democratic discourse, conflict resolution, coordination.

AI tech itself is moving fast, and other tech may be accelerated and disrupted soon — but ‘open source intelligence’ and more targeted foresight can both benefit from well-designed AI-powered applications. That could be one of the more promising ways to improve agility of societal decision-making — recall the governance dilemma of slow-moving government and regulatory oversight.

Ironically, R&D automation or acceleration, directed sensibly, could provide an antidote to some risks. It’s double-edged. There are all kinds of possibilities here, including perhaps development of hardware and support for confidential oversight of high-stakes systems. Applying AI-assisted efforts here might make a difference between having these useful aids at an important time or not.

Talk of scalable oversight and automation of AI interpretability inherently depend on the right capabilities being reliably usable from AI.

Being able to attribute AI outputs, and perhaps more importantly AI activity, could be crucial to keeping the running of economic, industrial, cultural, and other societal functions understandable and monitorable. It might be that AI-powered ‘forensics’ and provenance investigations end up making this more achievable.

Counteracting deception and other manipulations, we may be in a position to develop fiduciary or guardian AI systems — tools and assistants which people can slot into their workflows to protect them from trickery and other confusions as the world gets more complex around us.

Summing up

Engineering a Safer World — worth a read! Especially the first chapters.

Safety is a system property: even acute disasters have systemic failures we can make healthier.

Think up the chain: safety in something as encompassing as AI is a societal-level sociotechnical problem. Compromise at any of these levels could initiate or escalate loss of control.

AI is very special; it can ‘act back’ on the nominal control hierarchy. That could give rise to vicious cycles, one of the main classes of mechanisms by which AI loss of control could escalate to truly terminal states.

Not all cycles are vicious! AI, used well, could also help robustify and improve the health of some of the mechanisms and systems we use to live and work together safely.

^{^}
Modulo concerns about over-eagerly optimising for safety, especially where that cuts into other aspects of flourishing.
^{^}
Not to be confused with the Future of Life Institute! We're good friends, but different organisations with different approaches.

[-]RedMan2mo30

https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe#jQBYeQq5ZqnLr24Ax I'm happy to see this post, I've been hoping someone would take this and run with it for years.

I think STAMP and the related CAST/STPA methodologies are going to be way more effective at preventing bad outcomes than the current methods.

10