Reading through the recent “vision for alignment” writings, I notice a convergence in underlying posture even when the techniques differ: alignment is increasingly framed as a problem of 'control'. The dominant focus is on how to constrain, monitor and recover from systems we do not fully trust. Primary safety leverage is placed in externally applied oversight, evaluation and safeguard stacks.
This does not mean control-oriented approaches are wrong. Control is an essential component of any system deployment. What feels wrong is that control now appears to dominate how alignment is conceptualized, crowding out other ways of thinking about what it would mean for advanced systems to behave well.
This dominance of control-reliant framing is echoed in the stated visions and research directions of the frontier labs. Some examples...
Anthropic's Vision
Anthropic provides unusually clear articulation of the control-first posture, both in informal alignment writing and in formal governance commitments. Looking at Anthropic’s institutional posture: their Responsible Scaling Policy (RSP) and its updates - RSP defines safety in terms of graduated safety/security measures that scale with model capability:
“AI Safety Level Standards (ASL Standards) are core to our risk mitigation strategy… As model capabilities increase, so will the need for stronger safeguards…”
The update reiterates a commitment “not to train or deploy models unless we have implemented adequate safeguards.”
“the problem of scaling up human oversight … to … oversee systems that are smarter than we are.”
“behavioral oversight is very likely to get harder and harder as models get more capable…”
Interpretability is discussed in this essay largely as a way to preserve or extend these oversight/feedback loops, including “monitor[ing] for misalignment during training"
OpenAI's Vision
OpenAI’s public alignment and safety posture is articulated most clearly through its Preparedness Framework, which specifies how alignment is to be operationalized as models advance. The framework emphasizes identifying and measuring severe risks from increasingly capable models, and tying deployment decisions to the presence of appropriate safeguards.
The document further structures alignment around capability tracking, risk thresholds, and mitigations, categorizing risks (e.g., cyber, bio, autonomy, persuasion) and tying deployment decisions to the presence of safeguards.
OpenAI’s more recent domain-specific safety posts follow the same pattern. For example, in its discussion of AI and cyber risk, OpenAI frames safety as a matter of layered defences, combining monitoring, access controls, and policy enforcement to prevent misuse as capabilities grow.
Google DeepMind's Vision
Google DeepMind’s public “vision for alignment” reads as a two-track control stack: (1) build better model-level mitigations (training + oversight), but (2) assume you still need system-level safety frameworks - capability thresholds, monitoring, access controls, and protocols that kick in as risks rise.
This is stated quite explicitly across the places where DeepMind actually lays out its strategy:
In DeepMind’s AGI safety write-up (as summarized on the Alignment Forum), their approach distinguishes model-level mitigation from system-level control.
On DeepMind’s blog, Taking a responsible path to AGI foregrounds proactive risk assessment, monitoring, and security measures. It explicitly says:
“Through effective monitoring and established computer security measures, we’re aiming to mitigate harm…” and ties “transparency” to interpretability as a facilitating layer: “We do extensive research in interpretability…”
Their Frontier Safety Framework updates make the “thresholds + protocols” posture concrete. In the 2025 strengthening update, they introduce capability levels (CCLs) and expand protocols around misalignment scenarios.
These reflect the same broader pattern: alignment is conceptualized primarily as a problem of applying the right type and right level of control.
A recent field-level snapshot reinforces this picture in unusually direct terms. In AI in 2025: gestalt, a year-end synthesis of the technical alignment landscape, the current state of the field is described as follows:
“The world’s de facto strategy remains ‘iterative alignment’, optimising outputs with a stack of alignment and control techniques everyone admits are individually weak.”
The same piece is explicit about the fragility of this approach, noting that:
“current alignment methods are brittle"
and that many of the practical gains observed over the past year are not due to fundamental improvements in model robustness, but instead come from external safeguards, such as auxiliary models and layered defences. Alignment is seen less as a principled account of what trustworthy system behavior should look like, and more as an accumulation of layered mitigations: evaluate, patch, scaffold, monitor, repeat.
This does not mean the approach is irrational or misguided. Iterative alignment may well be the only viable strategy under current constraints.
So, what feels wrong is the reliance on a single dominant framing when the stakes are this high. Alignment remains hard not only because of technical difficulties, but also because of how the problem is being conceptualized.
The Need for a Cause-Based Framing of AI Alignment
There are indeed headline risks that frontier labs, policy bodies, and governance frameworks repeatedly emphasize. These include Biological & Chemical, Cybersecurity, Autonomy / Long-range autonomy, AI Self-improvement, Autonomous Replication and Adaptation, Catastrophic harm, harmful manipulation et al[1]. These risks are real, and they rightly motivate much of the current emphasis on monitoring, safeguards, and deployment constraints. They arise as system development progresses along different dimensions[2].
However, these descriptions primarily characterize what failure might look like at scale, not why such failures arise. They enumerate outcomes and scenarios, rather than the underlying kinds of system-level or interaction-level breakdowns that make those outcomes possible. When risks are framed primarily in terms of catastrophic end states rather than underlying causes, the most natural response is containment and control. In the absence of a model of how different risks are generated, alignment defaults to patching visible failures as they appear.
This framing has real costs. Without a cause-based model of risk, it becomes difficult to:
distinguish failures that originate inside the system from those that arise at the interface with the world
reason about which risks are likely to compound
assess which mitigation strategies target root causes versus surface behaviors
identify where intrinsic system properties might meaningfully reduce risk rather than merely requiring tighter control
Instead, all risks - ranging from misuse and manipulation to self-improvement and long-horizon autonomy end up getting treated as instances of a single problem: insufficient control. The result is a conceptual flattening, where alignment is approached as a matter of applying the right amount of oversight, monitoring, and restriction, rather than as a problem with multiple distinct failure causes.
If different failures arise from different kinds of breakdowns, then treating alignment as a single control problem is conceptually inadequate.
In the next post, I propose a cause-based grouping of AI risk. The goal is not to add yet another list of dangers, but to make explicit the underlying reasons that generate them and, in doing so, to open up a broader conversation about how alignment might be pursued beyond control alone.
What Feels Wrong Today
Reading through the recent “vision for alignment” writings, I notice a convergence in underlying posture even when the techniques differ: alignment is increasingly framed as a problem of 'control'. The dominant focus is on how to constrain, monitor and recover from systems we do not fully trust. Primary safety leverage is placed in externally applied oversight, evaluation and safeguard stacks.
This does not mean control-oriented approaches are wrong. Control is an essential component of any system deployment. What feels wrong is that control now appears to dominate how alignment is conceptualized, crowding out other ways of thinking about what it would mean for advanced systems to behave well.
This dominance of control-reliant framing is echoed in the stated visions and research directions of the frontier labs. Some examples...
Anthropic's Vision
Anthropic provides unusually clear articulation of the control-first posture, both in informal alignment writing and in formal governance commitments. Looking at Anthropic’s institutional posture: their Responsible Scaling Policy (RSP) and its updates - RSP defines safety in terms of graduated safety/security measures that scale with model capability:
In recent essay Alignment remains a hard, unsolved problem, the author frames 'outer alignment' largely as the problem of oversight, writing (emphasis mine):
OpenAI's Vision
OpenAI’s public alignment and safety posture is articulated most clearly through its Preparedness Framework, which specifies how alignment is to be operationalized as models advance. The framework emphasizes identifying and measuring severe risks from increasingly capable models, and tying deployment decisions to the presence of appropriate safeguards.
The document further structures alignment around capability tracking, risk thresholds, and mitigations, categorizing risks (e.g., cyber, bio, autonomy, persuasion) and tying deployment decisions to the presence of safeguards.
OpenAI’s more recent domain-specific safety posts follow the same pattern. For example, in its discussion of AI and cyber risk, OpenAI frames safety as a matter of layered defences, combining monitoring, access controls, and policy enforcement to prevent misuse as capabilities grow.
Google DeepMind's Vision
Google DeepMind’s public “vision for alignment” reads as a two-track control stack: (1) build better model-level mitigations (training + oversight), but (2) assume you still need system-level safety frameworks - capability thresholds, monitoring, access controls, and protocols that kick in as risks rise.
This is stated quite explicitly across the places where DeepMind actually lays out its strategy:
On DeepMind’s blog, Taking a responsible path to AGI foregrounds proactive risk assessment, monitoring, and security measures. It explicitly says:
Generalizing
These reflect the same broader pattern: alignment is conceptualized primarily as a problem of applying the right type and right level of control.
A recent field-level snapshot reinforces this picture in unusually direct terms. In AI in 2025: gestalt, a year-end synthesis of the technical alignment landscape, the current state of the field is described as follows:
The same piece is explicit about the fragility of this approach, noting that:
and that many of the practical gains observed over the past year are not due to fundamental improvements in model robustness, but instead come from external safeguards, such as auxiliary models and layered defences. Alignment is seen less as a principled account of what trustworthy system behavior should look like, and more as an accumulation of layered mitigations: evaluate, patch, scaffold, monitor, repeat.
This does not mean the approach is irrational or misguided. Iterative alignment may well be the only viable strategy under current constraints.
So, what feels wrong is the reliance on a single dominant framing when the stakes are this high. Alignment remains hard not only because of technical difficulties, but also because of how the problem is being conceptualized.
The Need for a Cause-Based Framing of AI Alignment
There are indeed headline risks that frontier labs, policy bodies, and governance frameworks repeatedly emphasize. These include Biological & Chemical, Cybersecurity, Autonomy / Long-range autonomy, AI Self-improvement, Autonomous Replication and Adaptation, Catastrophic harm, harmful manipulation et al[1]. These risks are real, and they rightly motivate much of the current emphasis on monitoring, safeguards, and deployment constraints. They arise as system development progresses along different dimensions[2].
However, these descriptions primarily characterize what failure might look like at scale, not why such failures arise. They enumerate outcomes and scenarios, rather than the underlying kinds of system-level or interaction-level breakdowns that make those outcomes possible. When risks are framed primarily in terms of catastrophic end states rather than underlying causes, the most natural response is containment and control. In the absence of a model of how different risks are generated, alignment defaults to patching visible failures as they appear.
This framing has real costs. Without a cause-based model of risk, it becomes difficult to:
Instead, all risks - ranging from misuse and manipulation to self-improvement and long-horizon autonomy end up getting treated as instances of a single problem: insufficient control. The result is a conceptual flattening, where alignment is approached as a matter of applying the right amount of oversight, monitoring, and restriction, rather than as a problem with multiple distinct failure causes.
If different failures arise from different kinds of breakdowns, then treating alignment as a single control problem is conceptually inadequate.
In the next post, I propose a cause-based grouping of AI risk. The goal is not to add yet another list of dangers, but to make explicit the underlying reasons that generate them and, in doing so, to open up a broader conversation about how alignment might be pursued beyond control alone.
References for risks named by some frontier labs and policy bodies:
In earlier work, I explored where different AI risks tend to arise by mapping them across interacting dimensions of capability, cognition, and beingness.