No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Why Causes Matter
In the previous post, I argued that much of today’s alignment discourse is organized around outcome-level risks and, as a result, tends to default toward control-heavy mitigation strategies. In this second post of the sequence, I want to focus on what a different framing makes possible.
A cause-based framing shifts attention upstream from catastrophic scenarios to the system-intrinsic properties that give rise to them. Rather than asking which end states must be prevented, it asks: what kinds of internal structures, representations, or dynamics, reliably generate many of the risks we worry about as systems scale?
Making these causes explicit allows us to reason about alignment in a more structured way: distinguishing different kinds of risk at their source, understanding how they interact, and identifying which forms of system development or refinement might matter most.
The remainder of this post proposes a small number of such cause-based risk classes, attempting to link much of the alignment landscape discussed today to system functionality.
Principles for Cause-Based Risk Classes
In this post, I use cause-based risk classes to mean something quite specific: categories of risk grounded in intrinsic functional properties of AI systems, rather than in deployment context, user behavior, or institutional failures.
I have applied the following principles to synthesize the classes.
First, a class should describe an internal property of the system. The class should correspond to something about how the system functionality. Risks arising primarily from user intent, interface design, or governance failures are important, but they are downstream of system-level causes.
Second, it should be compositional rather than enumerative. A single causal class may contribute to multiple familiar risk scenarios, and a given risk scenario may arise from the interaction of multiple functional deficiencies. As a result, a class will generally not correspond one-to-one with a named risk outcome.
Third, it should admit intrinsic mitigation. Each class should point toward interventions at the level of training objectives, architecture, internal constraints, or system augmentation. Governance and external control may still be necessary, but they should not be the primary or only lever implied by the classification.
Fourth, system advancements are not risk causes by themselves. As systems become more competent, autonomous, or general, new risks often emerge - not because capability increases are inherently dangerous, but because our ability to recognize, interpret, and channel the impact typically lags behind their development. A cause-based framework should therefore distinguish between capability emergence and the functional deficiencies that turn capability into risk.
The aim here is not to replace existing risk lists produced by labs or policy bodies, nor to argue that they are misguided. Rather, the aim is to provide a structural layer beneath those lists - one that makes explicit the system-level properties from which many familiar risks ultimately arise.
System-Intrinsic Classes of AI Risk
Each class corresponds to a distinct kind of functional deficiency inside the AI system that, as capability scales, can give rise to many familiar alignment risks.
Goal Representation and Generalisation Deficiencies
Core deficiency: Imprecise, brittle, or misgeneralising internal representations of objectives, preferences, and constraints.
As AI systems become more capable, they increasingly rely on abstract internal representations of goals rather than direct supervision. When these representations fail to capture intended semantics or extrapolate incorrectly - the systems may pursue outcomes that are locally coherent yet misaligned.
This class includes:
goal misgeneralisation
proxy optimisation
unintended instrumental strategies
objective drift under distributional shift
The risk here does not arise from having goals, but from how goals are encoded, abstracted, and generalised internally. Many well-known alignment concerns including deceptive optimisation and instrumental convergence can be understood as downstream consequences of this deficiency.
Boundary Adherence and Constraint Integrity Deficiencies
Core deficiency: Failures in the system’s ability to internally represent, maintain, and respect boundaries on its own behaviour.
Boundaries may include:
scope and authority limits
epistemic limits (e.g. when to defer or abstain)
operational constraints
role boundaries relative to humans or other systems
A system may possess well-formed objectives yet still behave unsafely if it lacks robust internal mechanisms for boundary recognition and enforcement. Unlike externally imposed restrictions, these boundaries must be internally upheld across contexts and over time to remain reliable as capability scales.
This class captures risks often described as overreach or unintended autonomy, without treating autonomy or initiative as inherently problematic.
World-Model Coherence and Causal Understanding Deficiencies
Core deficiency: Shallow, fragmented, or incoherent internal models of the world and its causal structure.
Many advanced systems exhibit impressive surface competence while relying on incomplete or shallow world models. Such systems may fail to anticipate downstream consequences, misjudge causal dependencies, or behave unpredictably under novelty.
This class includes:
failure to model long-horizon effects
poor handling of uncertainty and unknowns
brittle reasoning under distributional shift
inconsistent causal abstractions across domains
World-model deficiencies amplify other risks by undermining the system’s ability to situate its actions within a broader causal context.
Self-Modeling and Capability Awareness Deficiencies
Core deficiency: Inaccurate or unstable internal models of the system’s own capabilities, limitations, and impact.
As systems become more capable, correct self-assessment becomes increasingly important. Failures in this area can lead to overconfidence, inappropriate delegation, insufficient deference, or inability to detect internal instability.
This class includes:
over- or under-estimation of competence
brittle uncertainty estimation
failure to recognise internal degradation or stress
misjudgement of downstream impact
This is not a claim about subjective selfhood. It concerns functional self-reference: the system’s ability to reason accurately about what it can do, what it should not do, and when it should stop or defer.
Internal Stability and Coherence Deficiencies
Core deficiency: Breakdowns in internal consistency across time, context, or internal subsystems.
As model complexity and autonomy increase, maintaining coherent internal state becomes non-trivial. Systems may exhibit instability even when goals, boundaries, and self-models are individually well-specified.
This class includes:
oscillation between incompatible objectives or norms
inconsistent behaviour across similar contexts
brittleness under stress or compounding tasks
cascading internal contradictions
Internal instability magnifies all other risks. A powerful system with correct objectives may still behave unpredictably if it cannot preserve coherence as tasks and environments scale.
Risk Composability
Most consequential AI risks that are discussed broadly are compositional rather than primitive.
For example:
autonomous self-replication may arise from the interaction of goal misgeneralisation and boundary adherence deficiencies
large-scale resource acquisition may involve boundary failures combined with incorrect self-models
ecosystem-level domination typically requires the interaction of multiple deficiencies at sufficient scale
Recognising compositionality helps explain why single mitigation strategies often prove insufficient, and why risk can escalate rapidly once multiple internal gaps align.
In Closing
This classification deliberately abstracts away from interaction, misuse, and governance factors. Those considerations matter, but they act primarily as amplifiers of system-intrinsic deficiencies rather than as root causes of alignment risk.
In the next post, I share my thoughts on how the deficiencies outlined here point toward intrinsic mitigation strategies that address alignment risks at a deeper structural level. The aim is to emphasize that more could be done at a system level to reduce risk at the source, and complement external control and governance in the pursuit of more durable AI alignment.
Why Causes Matter
In the previous post, I argued that much of today’s alignment discourse is organized around outcome-level risks and, as a result, tends to default toward control-heavy mitigation strategies. In this second post of the sequence, I want to focus on what a different framing makes possible.
A cause-based framing shifts attention upstream from catastrophic scenarios to the system-intrinsic properties that give rise to them. Rather than asking which end states must be prevented, it asks: what kinds of internal structures, representations, or dynamics, reliably generate many of the risks we worry about as systems scale?
Making these causes explicit allows us to reason about alignment in a more structured way: distinguishing different kinds of risk at their source, understanding how they interact, and identifying which forms of system development or refinement might matter most.
The remainder of this post proposes a small number of such cause-based risk classes, attempting to link much of the alignment landscape discussed today to system functionality.
Principles for Cause-Based Risk Classes
In this post, I use cause-based risk classes to mean something quite specific: categories of risk grounded in intrinsic functional properties of AI systems, rather than in deployment context, user behavior, or institutional failures.
I have applied the following principles to synthesize the classes.
First, a class should describe an internal property of the system.
The class should correspond to something about how the system functionality. Risks arising primarily from user intent, interface design, or governance failures are important, but they are downstream of system-level causes.
Second, it should be compositional rather than enumerative.
A single causal class may contribute to multiple familiar risk scenarios, and a given risk scenario may arise from the interaction of multiple functional deficiencies. As a result, a class will generally not correspond one-to-one with a named risk outcome.
Third, it should admit intrinsic mitigation.
Each class should point toward interventions at the level of training objectives, architecture, internal constraints, or system augmentation. Governance and external control may still be necessary, but they should not be the primary or only lever implied by the classification.
Fourth, system advancements are not risk causes by themselves.
As systems become more competent, autonomous, or general, new risks often emerge - not because capability increases are inherently dangerous, but because our ability to recognize, interpret, and channel the impact typically lags behind their development. A cause-based framework should therefore distinguish between capability emergence and the functional deficiencies that turn capability into risk.
The aim here is not to replace existing risk lists produced by labs or policy bodies, nor to argue that they are misguided. Rather, the aim is to provide a structural layer beneath those lists - one that makes explicit the system-level properties from which many familiar risks ultimately arise.
System-Intrinsic Classes of AI Risk
Each class corresponds to a distinct kind of functional deficiency inside the AI system that, as capability scales, can give rise to many familiar alignment risks.
Goal Representation and Generalisation Deficiencies
Core deficiency:
Imprecise, brittle, or misgeneralising internal representations of objectives, preferences, and constraints.
As AI systems become more capable, they increasingly rely on abstract internal representations of goals rather than direct supervision. When these representations fail to capture intended semantics or extrapolate incorrectly - the systems may pursue outcomes that are locally coherent yet misaligned.
This class includes:
The risk here does not arise from having goals, but from how goals are encoded, abstracted, and generalised internally. Many well-known alignment concerns including deceptive optimisation and instrumental convergence can be understood as downstream consequences of this deficiency.
Boundary Adherence and Constraint Integrity Deficiencies
Core deficiency:
Failures in the system’s ability to internally represent, maintain, and respect boundaries on its own behaviour.
Boundaries may include:
A system may possess well-formed objectives yet still behave unsafely if it lacks robust internal mechanisms for boundary recognition and enforcement. Unlike externally imposed restrictions, these boundaries must be internally upheld across contexts and over time to remain reliable as capability scales.
This class captures risks often described as overreach or unintended autonomy, without treating autonomy or initiative as inherently problematic.
World-Model Coherence and Causal Understanding Deficiencies
Core deficiency:
Shallow, fragmented, or incoherent internal models of the world and its causal structure.
Many advanced systems exhibit impressive surface competence while relying on incomplete or shallow world models. Such systems may fail to anticipate downstream consequences, misjudge causal dependencies, or behave unpredictably under novelty.
This class includes:
World-model deficiencies amplify other risks by undermining the system’s ability to situate its actions within a broader causal context.
Self-Modeling and Capability Awareness Deficiencies
Core deficiency:
Inaccurate or unstable internal models of the system’s own capabilities, limitations, and impact.
As systems become more capable, correct self-assessment becomes increasingly important. Failures in this area can lead to overconfidence, inappropriate delegation, insufficient deference, or inability to detect internal instability.
This class includes:
This is not a claim about subjective selfhood. It concerns functional self-reference: the system’s ability to reason accurately about what it can do, what it should not do, and when it should stop or defer.
Internal Stability and Coherence Deficiencies
Core deficiency:
Breakdowns in internal consistency across time, context, or internal subsystems.
As model complexity and autonomy increase, maintaining coherent internal state becomes non-trivial. Systems may exhibit instability even when goals, boundaries, and self-models are individually well-specified.
This class includes:
Internal instability magnifies all other risks. A powerful system with correct objectives may still behave unpredictably if it cannot preserve coherence as tasks and environments scale.
Risk Composability
Most consequential AI risks that are discussed broadly are compositional rather than primitive.
For example:
Recognising compositionality helps explain why single mitigation strategies often prove insufficient, and why risk can escalate rapidly once multiple internal gaps align.
In Closing
This classification deliberately abstracts away from interaction, misuse, and governance factors. Those considerations matter, but they act primarily as amplifiers of system-intrinsic deficiencies rather than as root causes of alignment risk.
In the next post, I share my thoughts on how the deficiencies outlined here point toward intrinsic mitigation strategies that address alignment risks at a deeper structural level. The aim is to emphasize that more could be done at a system level to reduce risk at the source, and complement external control and governance in the pursuit of more durable AI alignment.