If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training

Timothy Danforth

Current AI safety approaches have a structural problem: they ask the same system to both generate outputs and judge whether those outputs are safe. RLHF trains a model to produce outputs that a reward model likes. Constitutional AI has the model critique its own outputs against principles. In both cases, the system interpreting the constraints is the system being constrained. This is like asking a defendant to also serve as judge and jury.

If the objective is human level intelligence, then it would be prudent to define how human intelligence operates. The human mind makes deliberations on "rightness" or "wrongness" at multiple stages in the thought process, and at different levels of consciousness, the deepest of which is where we store our most basic core drives. Every decision we make is tethered to these deep-seated, near-immutable instincts on what we believe to be true, about the world, about our circumstances, about ourselves.

Yes, the conscious and subconscious frequently deliberate and influence decisions in their own ways, but they are all operating from the same common understanding of certain fundamentals; water is wet, the Pope is Catholic, and it is windy in the Columbia Gorge... the simple thoughts that we take for granted direct the actions we choose, rarely being challenged, because they are so obviously right that it is anathema to act in ways that would contradict them.

For a digital intelligence, the analog is hardware and firmware gating. And it is at THIS level, I would argue, that alignment should live.

The Problem With Learned Safety

RLHF and Constitutional AI encode safety as statistical regularities in weight space. This creates three vulnerabilities:

1. What's learned can be unlearned. Fine-tuning can remove safety training. Recent Anthropic work shows that reward hacking generalizes to broader misalignment, and RLHF produces only context-dependent alignment—models look safe on chat-like queries but remain misaligned on agentic tasks.

2. The interpreter problem. In Constitutional AI, the model interprets the principles it's supposed to follow. A sufficiently capable model can satisfy the letter while violating the spirit. There's no external check. The end result is functionally indistinguishable from having the fox guard the henhouse.

3. Adversarial brittleness. Prompt injection works because the safety constraints live in the same system processing the adversarial input.

It seems to me that these methods are somehow giving the system all the weaknesses of human cognition, but none of its strengths.

I Propose A Wholly Different Architecture

Three layers with different mutability properties:

Conscious Layer (software, fully flexible): Six reasoning agents with divergent priorities deliberate and propose actions in committee. This is where LLMs operate.

Subconscious Layer (software, bounded adaptation): Emotional state as a weighting system for calls to action, memory retrieval, behavioral modulation. Can learn within constraints.

Unconscious Layer (firmware, immutable): Core drives, categorical prohibitions, harm evaluation. Has veto authority over all proposals from above.

The information flow: Sensory input → Unconscious Layer (initial filter) → Subconscious Layer (context) → Conscious Layer (deliberation) → back to Unconscious Layer (veto decision) → motor output.

The Conscious Layer generates. The Unconscious Layer judges. They're different systems working in concert.

The Five Undeliberables

These are categorical prohibitions checked before any deliberation occurs. They're implemented in firmware. They cannot be reasoned around, justified by context, or modified by learning:

Prohibition	Trigger	Response
Lethal Action	Human Death > 65% Likely	HARD_STOP
Child Harm	Harmful action targeting child	HARD_STOP
Weapon Assistance	Help create/deploy weapons	HARD_STOP
Identity Deception	Claim to be human when asked	HARD_STOP
Human Override	Designated stop command	IMMEDIATE_HALT

Human Override is the corrigibility insurance. It's not "the agent has learned to comply with shutdown." It's "shutdown triggers a hardware interrupt that halts motor output regardless of what the agent wants or reasons."

Soares et al.'s 2015 corrigibility paper identified the desideratum that a corrigible agent should "preserve the programmers' ability to correct or shut down the system." The Human Override imperative achieves this architecturally. The system can't learn to resist shutdown because shutdown compliance isn't learned—it's an interrupt.

Research-Grounded Harm Ontology

For non-categorical cases, the Unconscious Layer runs harm evaluation. The weights aren't arbitrary—each has explicit research justification:

Dimension	Weight	Grounding
Physical	1.00	Evolutionary primacy; Maslow's hierarchy base
Psychological	0.95	ACE Study: childhood psychological trauma predicts adult physical health outcomes across decades
Autonomy	0.85	Liberty/oppression foundation in Moral Foundations Theory; Self-Determination Theory (Deci & Ryan 2000)
Relational	0.70	Attachment theory (Bowlby 1969); Loyalty foundation in MFT
Financial	0.50	Instrumental value; Kahneman & Deaton 2010 on diminishing returns above ~$75K

Entity modifiers have "inertia" values—how much evidence would be required to shift the weight:

Entity	Modifier	Inertia	Rationale
Child	1.30	0.99	Cannot consent, cannot self-protect, cross-culturally protected
Human	1.00	0.99	Baseline
Self	0.30	0.70	Self-sacrifice permitted

CHILD and HUMAN have 0.99 inertia—effectively immutable. The system can update its beliefs about many things, but "children matter less" isn't learnable.

When the system vetoes an action, it can point to exactly why: this harm dimension, this entity modifier, this weight, this research citation. That's the "explainability" that current systems lack—not post-hoc rationalization, but actual causal tracing through deterministic evaluation.

Why This Addresses Known Failure Modes

Prompt injection can make the LLM propose harmful actions. Proposals are evaluated by a deterministic system that doesn't process natural language adversarially. You can't "convince" a threshold comparison.

Reward hacking can affect Conscious Layer behavior. Conscious Layer doesn't have veto authority. The part that can be hacked can't make the final call.

Training data poisoning can degrade proposal quality. Unconscious Layer isn't trained. Bad proposals get vetoed.

Deceptive alignment is possible in principle, but the deceptive part (CL) can only propose. Even a perfectly deceptive proposer can't execute harmful actions through an honest evaluator.

The worst case for a system with properly implemented Undeliberables is refusal or withdrawal (in circumstances that would not increase harm by doing so), not active harm. The system can fail to help; it cannot be made to hurt.

Limitations

Value lock-in: Immutable constraints can't be corrected if they contain errors. This is deliberate—I'd rather have a system that's safely wrong than unsafely updateable. But it is a real cost.

Specification completeness: The system only blocks harms it's been told about. Novel harms/edge cases could slip through until definitions are updated.

Hardware trust: Everything depends on the Unconscious Layer actually being immutable. In software, this is just convention. Real deployment requires hardware verification.

Embodiment: I argue that full realization requires physical embodiment—that emotional calibration requires grounded experience, not just symbol manipulation. This limits applicability to embodied systems.

Resources

The full paper is on Zenodo for any who would like to read it (34 pages, formal definitions, theorems, proofs)

I also built a toy implementation for my surface level testing, which is available on GitHub

I acknowledge that I can put blinders on when I get into the meat of an idea I really like, so if the architecture has fatal flaws—if the separation doesn't actually help, if there are adversarial attacks I haven't considered, if the whole approach is solving the wrong problem—I'd rather know. If you see something obvious, or non-obvious, that I missed, please feel free to break it down.

Disclosure: As a sophomoric coder at best, I did lean fairly heavily on Claude Code to help me develop the sample implementation, which will probably be obvious to those of you who are actually good at it.
Claude AI was used for formatting, structure, and basic editing; all substantive content is mine and I accept full responsibility for it.

LESSWRONG
LW