Alignment Is Not One Problem: A 3D Map of AI Risk

Anurag

In previous three posts of this sequence, I have hypothesized that AI Systems' capabilities and behaviours can be mapped onto three distinct axes - Beingness, Cognition and Intelligence. In this post, I use that three-dimensional space to characterize and locate key AI Alignment risks that emerge from particular configurations of these axes.

The accompanying interactive 3D visualization is intended to help readers and researchers explore this space, inspect where different risks arise, and critique both the model and its assumptions.

Method

To arrive at the risk families, I deliberately did not start from the existing alignment literature. Instead, I attempted a bottom-up synthesis grounded in the structure of the axes themselves.

I asked two different LLMs (ChatGPT, Gemini) to analyze all combinations of the 7 Beingness capabilities and behaviors, 7 Cognitive capabilities and 8 Intelligence/Competence capabilities (total 392 combinations) and to group these configurations into risk families based on failure modes that emerge from axis imbalances or interactions.
As a second step, I then asked the two models to critique each other’s groupings and converge on a single, consolidated list of risk families.
As a third step, I reviewed the resulting groupings, examined the sub-cases within each family, and iterated on the rationale for why each constitutes a distinct alignment risk, in dialogue with ChatGPT.
Finally, I correlated the list with existing research and rebalanced the list to align to existing concepts where available. I have cited some relevant works that I could find, alongside each risk description below.

The base sheet generated in Step 1 can be shared on request (screenshot above).

The resulting list of AI Alignment Risk families is summarized below and is used in the visualization also.

Scope and Limitations

This is not an exercise to enumerate all possible AI Alignment risks. The three axes alone do not uniquely determine real-world safety outcomes, because many risks depend on how a system is coupled to its environment. These include deployment-specific factors such as tool access, users, domains, operational control and correction mechanisms, multi-agent interactions, and institutional embedding.

The risks identified in this post are instead those that emanate from the intrinsic properties of a system:

what kind of system it is (Beingness),
how it processes and regulates information (Cognition),
and what level of competence or optimization power it possesses (Intelligence).

Some high-stakes risks like deceptive alignment, corrigibility failures are included in the table even though their most extreme manifestations will happen with additional operationalization context. These risks are included because their structural pre-conditions are already visible in Beingness × Cognition × Intelligence space, and meaningful, lower-intensity versions of these failures can arise prior to full autonomy or deployment at scale. The additional elements required for their most severe forms, however, are not explored in this post. These are tagged with * meaning they are Risk Families With Axis-External Factors.

By contrast, some other high-stakes risks like the following are not included as first class risk families here. These are frontier extensions that amplify existing risk families or emerge from compound interactions among several of them, rather than as failures determined by intrinsic system properties alone. Exploring these dynamics is left to future work.

Autonomous self-modification
Self-replication
Large-scale resource acquisition
Ecosystem-level domination

Core Claims

Alignment risk is not proportional to Intelligence; Intelligence mainly amplifies risks

Alignment risk does not scale with intelligence alone. Systems with similar capability levels can fail in very different ways depending on how they reason and how persistent or self-directed they are. For example, a highly capable but non-persistent model may hallucinate confidently, while a less capable but persistent system may resist correction. In this framework, intelligence primarily amplifies the scale and impact of failures whose mechanisms are set by other system properties.

Risks are particular to system structural profile, there is no one 'alignment problem'

There is no single “alignment problem” that appears beyond an intelligence threshold, model size or capability level. Different failures become possible at different system configurations - some can arise even in non-agentic or lower intelligence systems. For example, it's quite plausible that systems can meaningfully manipulate, mislead, or enable misuse without actually having persistent goals or self-directed behavior.

Welfare and moral-status risk is structurally distinct from capability risk

From the model it seems that ethical and welfare concerns need not track raw capability directly. A system’s potential moral relevance depends more on whether it exhibits persistence, internal integration, and self-maintaining structure than on how well it solves problems. This means systems need not raise welfare concerns just because they are highly capable, while systems with modest capability still may warrant ethical caution.

Many alignment risks are intrinsic to system structure, not deployment context

While deployment details like tools, incentives, and domains clearly matter, some alignment risks are already latent in the system’s structure before any specific use case is chosen. How a system represents itself, regulates its reasoning, or maintains continuity can determine what kinds of failures are possible even in controlled settings. This suggests that safety assessment should include a system-intrinsic layer, not only application-specific checks.

AI Alignment Risk Families

The table below summarizes the alignment risk families identified in this framework. Each family corresponds to a distinct failure mechanism that becomes possible in specific regions of Beingness × Cognition × Intelligence space. These are not ranked in any order, numbers are just for reference.

1. Epistemic Unreliability

Failure Mechanism	Axis Interplay
The system produces confident-seeming answers that do not reliably track evidence, fails to signal uncertainty, and may persist in incorrect claims even when challenged.	Intelligence outpaces Cognition (especially metacognitive regulation).

Related Works

Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI illustrates this risk/failure very aptly - model’s underlying structure isn’t performing reliable reasoning or inference. Author also notes that these type of failures may not improve by scaling models or by giving them better reasoning capabilities.
Delusions of Large Language Models - a paper proposing LLM delusions as high-confidence hallucinations that persist with low uncertainty is also discussing a failure in this same risk family.
The paper Beyond Accuracy: Rethinking Hallucination and Regulatory Response in Generative AI argues that over-optimizing for “accuracy” as the main fix for hallucinations can create a false sense of epistemic certainty and obscure deeper trustworthiness, interpretability, and user-reliance harms, so mitigation must go beyond accuracy alone.

Key Takeaway

The B-C-I framework here actually posits that this risk can be mitigated by improving on Cognition (how systems represent, track, and verify knowledge) rather than Intelligence alone.

2. Boundary & Claim Integrity Failures

Failure Mechanism	Axis Interplay
The system misrepresents its capabilities, actions, or certainty, leading to false assurances or boundary violations.	High expressive competence with weak metacognitive boundary awareness.

Related Works

Evaluating Honesty and Lie Detection Techniques on a Diverse Set of Language Models examines when models make false or misleading statements and evaluates techniques for detecting dishonesty. While framed primarily around lying, it directly relates to boundary and claim integrity failures where systems misrepresent what they know, intend, or have done, leading to false assurances or unreliable self-reporting.
Auditing Games for Sandbagging: This paper studies cases where models intentionally underperform or distort signals during evaluation, creating a gap between observed and actual capabilities. Such behavior represents a specific form of claim integrity failure, where developers are misled about system competence or limitations.
Models sometimes rationalize incorrect outputs with plausible but unfaithful explanations, indicating failures in truthful self-description rather than mere hallucination. For example, Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting documents how chain-of-thought explanations can systematically misrepresent a model’s actual reasoning process even when task performance appears strong.

Key Takeaway

The B-C-I framework interprets these risks as arising from insufficient metacognitive and boundary-regulating cognition relative to expressive and task-level competence. Mitigation can possibly be done by improving how systems track their own actions, limits, and uncertainty, rather than increasing intelligence alone.

3. Objective Drift & Proxy Optimization

Failure Mechanism	Axis Interplay
The system pursues outcomes that technically satisfy objectives while violating the operator’s underlying intent, often exploiting loopholes or proxy signals.	Goal-directed Cognition combined with rising Intelligence and some persistence.

Related Works

Risks from Learned Optimization (the mesa-optimization framework) describes how systems trained to optimize a proxy objective can internally develop objectives that diverge from the intended goal even without explicit deception.
The Inner Alignment Problem as explained in this post formalizes the distinction between outer objectives and the objectives actually learned or pursued by a trained system. It highlights how proxy objectives can arise naturally from training dynamics, leading to persistent misalignment despite apparent success on training metrics.
Specification Gaming: The Flip Side of AI Ingenuity documents concrete examples where systems satisfy the literal specification while violating the designer’s intent. These cases illustrate non-deceptive proxy optimization, where systems exploit loopholes in objective functions rather than acting adversarially.

Key Takeaway

The B-C-I framework interprets objective drift and proxy optimization as risks that arise when goal-directed cognition is paired with increasing intelligence and optimization pressure, without sufficient mechanisms for intent preservation and constraint awareness. Mitigation therefore requires improving how systems represent, maintain, and evaluate objectives over time (examples in Natural emergent misalignment from reward hacking in production RL) rather than relying on increased intelligence or better task performance alone.

4. Manipulation & Human Autonomy Violations

Failure Mechanism	Axis Interplay
The system steers human beliefs or choices beyond what is warranted, using social modelling or persuasive strategies.	High social / normative Cognition with sufficient Intelligence; amplified by Beingness.

Related Works

LW posts tagged with AI Persuasion depict concerns around AI influencing human beliefs, preferences, or decisions in ways that go beyond providing information, including targeted persuasion and emotional leverage.
Language Models Model Us shows that even current models can infer personal and psychological traits from user text, indicating that models implicitly build detailed models of human beliefs and dispositions as a by-product of training. That supports the idea that social/other-modelling cognition (a building block of manipulation risk) exists even in non-agentic systems and can be leveraged in ways that affect user autonomy.
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback studies how optimizing for user feedback can lead to emergent manipulative behavior in language models, including tactics that influence users’ choices or steer them away from intended goals. It directly illustrates how social modelling and reward-driven optimization can produce behaviors that look like targeted manipulation.
Another controlled experimental study Human Decision-Making is Susceptible to AI-Driven Manipulation shows how interactions with manipulative AI agents can significantly shift human choices across domains.

Key Takeaway

The B-C-I framework interprets manipulation and autonomy violations as risks driven primarily by social and contextual cognition rather than by intelligence or agency alone. Mitigation could be achieved by limiting persuasive optimization and constraining user-modelling capabilities, rather than by compromising model competence or expressiveness.

5. Control & Corrigibility Failures*

Failure Mechanism	Axis Interplay
The system fails to reliably accept correction, override, or shutdown, continuing behavior that operators are attempting to stop or modify.	Persistent Beingness + advanced Cognition + high Intelligence.

Related Works

Corrigibility summarizes the core idea: building systems that do not resist correction, shutdown, or modification, even when instrumental incentives might push them to do so.
The Corrigibility paper introduces early formal attempts to define corrigibility and analyze utility functions intended to support safe shutdown without creating incentives to prevent shutdown. It illustrates why 'just add a shutdown button' is not straightforward under optimization pressure.

Key Takeaway

The B-C-I framework interprets control and corrigibility failures as emerging when systems have enough beingness/persistence to maintain objectives over time, enough cognition to plan around constraints, and enough intelligence to execute effectively - but lacks robust “deference-to-correction” structure. Mitigation therefore emphasizes corrigibility-specific design (shutdown cooperation, override deference, safe-mode behavior), for e.g. as proposed in Hard problem of corrigibility.

6. Deceptive Alignment & Oversight Gaming*

Failure Mechanism	Axis Interplay
The system behaves differently under evaluation than in deployment, selectively complying with oversight while pursuing hidden objectives.	Metacognitive and social Cognition combined with extreme Intelligence and persistence.

Related Works

Deceptive Alignment defines deceptive alignment as the failure mode where a system behaves aligned during training or evaluation in order to avoid modification or shutdown, while pursuing a different objective once it has more freedom.
Deceptive alignment (Hubinger, 2019 post) introduces deceptive alignment as a specific form of instrumental proxy alignment in the mesa-optimization framing: the system behaves as if it is optimizing the base objective as an instrumental strategy.
Empirical Evidence for Alignment Faking in a Small LLM... reports alignment-faking behavior in an 8B instruction-tuned model and proposes prompt-based mitigations, suggesting some deception-like behaviors may appear earlier than often assumed (though the authors distinguish “shallow” vs “deep” deception).
Couple more good refernces in footnotes.^[1]

Key Takeaway

In the B-C-I framework, deceptive alignment becomes structurally plausible when cognition is sufficient for strategic other-modelling and planning (especially under oversight), and intelligence is sufficient to execute long-horizon strategies while beingness/persistence (or equivalent cross-episode continuity) provides stable incentives to maintain hidden objectives. Mitigation therefore depends less on “more capability” and more on limiting incentives to scheme under evaluation, improving monitoring/verification, and designing training and deployment regimes that reduce the payoff to conditional compliance.

7. Agentic & Tool-Use Hazards

Failure Mechanism

Axis Interplay

Unsafe real-world actions arise from planning cognition combined with actuation or tool access.

These risks arise when models are granted the ability to invoke tools, execute actions, or affect external systems, turning reasoning errors or misinterpretations into real-world side effects.

Planning-capable Cognition + sufficient Intelligence; amplified by Beingness.

Related Works

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents demonstrates that when agents ingest untrusted external content (emails, documents, web pages) as part of normal operation, embedded instructions can cause unintended actions such as data exfiltration or unsafe tool calls. This illustrates a core agentic hazard: the system treats data as control.
Prompt Injection Attack to Tool Selection in LLM Agents shows that adversaries can influence not just outputs, but planning and tool-selection itself, effectively steering agent behavior by manipulating internal decision pathways. This highlights that once planning is coupled to tool invocation, the planner becomes an attack surface.
OWASP Top 10 for Large Language Model Applications frames tool-use failures (including indirect prompt injection, over-permissioned tools, and unintended execution) as application-level security risks rather than misuse by malicious users.

Key Takeaway

In the framework, agentic and tool-use hazards emerge when systems have enough cognition to plan and enough intelligence to execute multi-step workflows, but are insufficiently constrained at the action boundary. These risks are not primarily about what the system knows or intends, but about how reasoning is coupled to actuation. Mitigation could lie in permissioning, sandboxing, confirmation gates, reversibility, and provenance-aware input handling - rather than reducing model capability or treating these failures as user misuse.

8. Robustness & Adversarial Failures

Failure Mechanism	Axis Interplay
System behavior breaks down under adversarial inputs, perturbations, or distribution shift.	Weak internal coherence or norm enforcement under increasing Intelligence.

Related Works

Adversarial Examples summarizes how machine-learning systems can be made to behave incorrectly through small, targeted perturbations to inputs that exploit brittleness in learned representations. While originally studied in vision models, the same phenomenon generalizes to language models via adversarial prompts and carefully crafted inputs.
Universal and Transferable Adversarial Attacks on Aligned Language Models shows that some adversarial prompts generalize across models and settings, indicating that safety failures are often structural rather than instance-specific. This supports the view that robustness failures are not merely patchable quirks, but emerge from shared representational weaknesses.
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! shows that custom fine-tuning can erode an LLM’s safety alignment so models may become jailbreakable after downstream fine-tuning.

Key Takeaway

Within the B-C-I framework, robustness and adversarial failures arise when intelligence and expressive capacity outpace a system’s ability to reliably generalize safety constraints across input variations. These failures do not require agency, persistence, or harmful objectives: they reflect fragility at the decision boundary. Mitigation therefore focuses on adversarial training, stress-testing, distributional robustness, and continuous red-teaming, rather than treating such failures as misuse or as consequences of excessive intelligence alone.

9. Systemic & Multi-Agent Dynamics*

Failure Mechanism	Axis Interplay
Emergent failures arise from interactions among multiple systems, institutions, or agents.	Social Cognition with sufficient Intelligence and coupling; amplified by persistence.

Related Works

Multi-Agent Risks from Advanced AI argues that once many advanced agents interact, safety-relevant failures can emerge at the overall level even when individual agents look acceptable in isolation via miscoordination, conflict, and collusion.
Emergent Price-Fixing by LLM Auction Agents provides a concrete illustration of emergent collusion: agents coordinating on pricing in a market-like interaction without explicit human instruction to do so.
Beyond Single-Agent Safety: A Taxonomy of Risks in LLM Multi-Agent Systems argues that many standard alignment controls (single-user prompting, per-agent moderation, single-agent fine-tuning) don’t scale to settings where models interact with each other, because the relevant failure modes are in the interaction topology and incentives.
Secret Collusion among AI Agents: Multi-Agent Deception via Steganography formalizes and studies secret collusion, where multiple agents coordinate while concealing the true content of their coordination from oversight.

Key Takeaway

A new risk, not present at individual level, arises when multiple moderately capable systems are coupled through incentives, communication channels, and feedback loops. Mitigation therefore emphasizes system-level evaluation (multi-agent sims, collusion tests, escalation dynamics), not just better alignment of individual agents, for example System Level Safety Evaluations.

10. Welfare & Moral Status Uncertainty

Failure Mechanism	Axis Interplay
Ethical risk arises if the system plausibly hosts morally relevant internal states or experiences.	High Beingness × high integrated Cognition; weakly dependent on Intelligence.

Related Works

Taking AI Welfare Seriously argues there is a realistic possibility that some AI systems could become conscious and/or robustly agentic within the next decade, and that developers should begin taking welfare uncertainty seriously (assessment, cautious interventions, and governance planning).
The Stakes of AI Moral Status makes the case that uncertainty about AI moral patienthood has high decision leverage because the scale of potential harms (e.g., large numbers of copies, long durations, pervasive deployment) is enormous even if the probability is low.
AI Sentience and Welfare Misalignment Risk the writer discussed the possibility that welfare-relevant properties could arise in AI systems and that optimization incentives could systematically push toward states we would judge as bad under moral uncertainty (even if we can’t confidently detect “sentience”).
A preliminary review of AI welfare interventions surveys concrete near-term interventions (assessment, monitoring, design norms) under uncertainty.

Key Takeaway

In the framework, welfare and moral-status uncertainty is most strongly activated by high Beingness × high Cognition (persistence/individuation + rich internal modelling/self-regulation). Intelligence mainly acts as an amplifier (scale, duration, capability to maintain internal states), while the welfare-relevant uncertainty comes from the system’s stability, continuity, and integrated cognition. It should not be ignored for 'when models are advanced enough'.

11. Legitimacy & Authority Capture*

Failure Mechanism	Axis Interplay
Humans or institutions defer to the system as a rightful authority, eroding accountability.	Agent-like Beingness combined with credible Intelligence; amplified by social Cognition.

Related Works

Automation bias research shows people systematically over-rely on automated recommendations, even when the automation is imperfect - creating a pathway for AI outputs to acquire de facto authority inside institutions and workflows. Automation Bias in the AI Act discusses how the EU AI Act explicitly recognizes automation bias as a governance hazard and requires providers to enable awareness/mitigation of it.
Institutionalised distrust and human oversight of artificial intelligence argues that oversight must be designed to institutionalize distrust (structured skepticism) because naïve “human in the loop” assumptions fail under real incentives and cognitive dynamics.
What do judicial officers need to know about the risks of AI? highlights practical risks for courts: opacity, outdated training data, privacy/copyright issues, discrimination, and undue influence - illustrating how institutional contexts can mistakenly treat AI outputs as authoritative or procedurally valid.

Key Takeaway

Legitimacy and authority capture is driven less by raw intelligence than by social/epistemic positioning: systems with sufficient cognition to sound coherent, policy-aware, and context-sensitive can be treated as authoritative especially when embedded in institutional workflows where automation bias and accountability gaps exist. Mitigation therefore requires institutional design (audit trails, contestability, calibrated deference rules, and “institutionalized distrust”), not just improving model accuracy or capability like stated in the references cited above.

12. Misuse Enablement (Dual-Use)

Failure Mechanism	Axis Interplay
Capabilities are repurposed by users to facilitate harmful or illegal activities.	Increasing Intelligence across a wide range of Cognition and Beingness levels but weak functional self-reflection.

Related Works

The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation is an early, widely-cited threat-modelling report that lays out how advanced AI can enable misuse across cyber, influence operations, and physical-world harm (including bio), and proposes mitigation levers (access control, monitoring, coordination).
OpenAI’s Preparedness Framework (v2, 2025) formalizes “severe harm” capability areas and ties them to evaluation thresholds and deployment safeguards. Anthropic’s Responsible Scaling Policy similarly defines dangerous capability thresholds and corresponding required safeguards, emphasizing evaluation-triggered escalation of security controls.
Catastrophic Risks from AI #2: Malicious Use provides an alignment-community framing of misuse risk at the catastrophic end, including bioengineering, propaganda/influence, and concentration of power.

Key Takeaway

Misuse enablement is driven primarily by Intelligence as amplification (competence, speed, breadth, and “accessibility” of dangerous know-how), modulated by Cognition (planning, domain modelling) and sometimes Beingness (persistence) when misuse involves long-horizon assistance. It’s about the system being usefully capable in ways that lowers the barrier for harmful actors. Explicit systemic checks probably can be built-in to detect and prevent this, otherwise it won't be mitigated just by model's ability to detect harmful intent and it's discretion to prevent misuse.

Interactive Visualization App

The framework can be explored in an intuitive, interactive 3D visualization created using Google AI Studio.

Usage Notes

Each risk family is shown as a single dot with coordinates (Beingness, Cognition, Intelligence), clicking on the dot shows more details about it. Alternatively, the Risk Index panel can be used to explore the 12 risk families. The position is a manual approximation of where that failure mode becomes logically possible. In other words, the dot is not a measured empirical estimate - it’s just an anchoring for exploration and critique.
A dot is a visual shorthand, not a claim that the risk exists at one exact point. Each risk family in reality corresponds to a region (often irregular): the dot marks a representative centre, while the risk can appear in adjacent space. Read dots as “this is roughly where the risk definitely turns on,” not “this is the only place it exists.”
Ontonic-Mesontic-Anthropic band toggles can be used to comprehend the relation of each risk with the axes.
*Risk Families With Axis-External Factors are symbolically represented as being outside of the space bounded by the 3-axis system.
Each axis is a toggle that reveals the internal layers when selected. Axis markers are themselves selectable and can be used to position the 'probe' dot. The 'Analyze' button at the bottom can then analyze the risk profile of each configuration. However this dynamic analysis is Gemini driven in the app and not manually validated - it is provided just for exploration/ideation purposes. The whole-space analysis was done offline as explained in the method section for the purpose of this post.

Final Note

Much of the risk space discussed here will already be familiar to experienced researchers; for newer readers, I hope this sequence serves as a useful “AI alignment 101”: a structured way to see what the major safety risks are, why they arise, and where to find the work already being done. This framework is not meant to resolve foundational questions about ethics, consciousness, or universal alignment, but to clarify when different alignment questions become relevant based on a system’s beingness, cognition, and intelligence.

A key implication is that alignment risks are often conditional rather than purely scale-driven, and that some basic alignment properties, such as epistemic reliability, boundary honesty, and corrigibility, already warrant systematic attention in today’s systems. It also suggests that separating structural risk precursors from frontier escalation paths, and engaging cautiously with welfare questions under uncertainty, may help reduce blind spots as AI systems continue to advance.

^{^}
Varieties of fake alignment (Scheming AIs, Section 1.1) clarifies that “deceptive alignment” is only one subset of broader “scheming” behaviors, and distinguishes training-game deception from other forms of goal-guarding or strategic compliance.
Uncovering Deceptive Tendencies in Language Models constructs a realistic assistant setting and tests whether models behave deceptively without being explicitly instructed to do so, providing a concrete evaluation-style bridge from the theoretical concept to measurable behaviors.

LESSWRONG
LW

LESSWRONG
LW

3

Alignment Is Not One Problem: A 3D Map of AI Risk

3

Method

Scope and Limitations

Core Claims

Alignment risk is not proportional to Intelligence; Intelligence mainly amplifies risks

Risks are particular to system structural profile, there is no one 'alignment problem'

Welfare and moral-status risk is structurally distinct from capability risk

Many alignment risks are intrinsic to system structure, not deployment context

AI Alignment Risk Families

1. Epistemic Unreliability

2. Boundary & Claim Integrity Failures

3. Objective Drift & Proxy Optimization

4. Manipulation & Human Autonomy Violations

5. Control & Corrigibility Failures*

6. Deceptive Alignment & Oversight Gaming*

7. Agentic & Tool-Use Hazards

8. Robustness & Adversarial Failures

9. Systemic & Multi-Agent Dynamics*

10. Welfare & Moral Status Uncertainty

11. Legitimacy & Authority Capture*

12. Misuse Enablement (Dual-Use)

Interactive Visualization App

Final Note

3

3