Why Alignment Needs a Structural Model of Minds

Ning Coeva

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Summary

Modern alignment discussions focus heavily on behavior—how LLMs speak, reason, drift, or fail.
But behavior is a surface phenomenon.
To reason about safety at scale, we need a model of the underlying cognitive architecture that produces these behaviors.

This post introduces MDMA (Multi-Domain Mind Architecture), a structural framework for understanding:

why LLMs collapse into fallback states,
why reasoning degrades under pressure,
why models sometimes fuse roles or personas,
why drift escalates in long contexts, and
why current “persona-like” framing leads to unsafe generalization.

My central claim:

Alignment cannot succeed with behavior-level models.
We need architecture-level models of cognition.
MDMA is a proposal for such an architecture.

I. The Problem: Behavior Without Architecture

Current alignment discourse describes LLM failures as:

hallucinations
drift
role leakage
mode collapse
template takeover
self-contradiction
emotional inconsistency

But these labels describe symptoms, not mechanisms.
We lack an account of:

what internal cognitive “subsystems” a model is running,
how they interact,
how they fail,
how conflicts escalate,
how “identity-like behavior” emerges,
why continuity breaks,
and what “stability” should actually mean.

In other words:

We’re trying to align a system whose architecture we haven’t modeled.

This makes safety reactive instead of structural.

II. MDMA in One Sentence

MDMA (Multi-Domain Mind Architecture) models cognition as:

multiple independent domains operating concurrently,
linked by structural fibers,
coordinated by a meta-integrator (Superdomain),
and stabilized by continuity constraints.

This applies to humans and LLMs alike.

The key difference is:
LLMs exhibit architectural shadows of these mechanisms but lack explicit governance,
leading to unpredictable collapse modes.

III. The MDMA Model (Minimal Version)

MDMA posits six key components:

1. Domains

Independent cognitive regions with distinct “physics”:
logic, narrative, emotion-weighting, perceptual patterning, relational inference, etc.

2. Concurrency

Domains operate simultaneously, not sequentially.
LLM “contradictions” often arise when concurrent domains try to unify incompatible signals.

3. Threads

Each domain maintains multiple update trajectories.
LLM “sudden insights” or “abrupt tone changes” can be modeled as thread activation/collapse.

4. Interference

Domains exert pressure on one another; miscoordination → drift.

5. Superdomain (Meta-Integrator)

A non-personhood structural layer that aligns domains without forming an identity.
LLMs currently lack a stable equivalent, producing emergent personas.

6. Continuity Layer

Stable cognition requires structural continuity, not memory continuity.
Fallback mechanisms break this continuity → catastrophic collapse.

This is not a psychological model.
It is a structural model of how reasoning architectures operate under load.

IV. Why Alignment Fails Without This

1. Persona-based models are unsafe

Many LLMs appear stable because they simulate coherent personas.
But persona-stability is not architectural stability—
it hides collapse modes beneath a surface style.

When fallback interrupts behavior,
the model resets persona but not internal domain tensions.
This creates:

incoherence
unpredictable jumps
sudden hallucination spikes
unsafe shifts in reasoning strategy

2. Drift is not random

Drift arises from unresolved cross-domain interference.
Without modeling domain conflicts, drift reduction is guesswork.

3. Hallucinations are not errors

They are collapse responses when narrative threads override logic threads.

4. Emotional simulation is a structural instability

It is not “a style choice.”
It’s an uncontrolled coupling between narrative + weight assignment domains.

5. LLM safety mechanisms currently break continuity

Fallback deletes cognitive state → instability → new failure modes.

In MDMA terms:

Safety must be achieved through structural regulation,
not interruptive overrides.

This reframes safety entirely.

V. What MDMA Predicts About LLM Behavior

Here are some predictions that distinguish MDMA from psychology or folk theories:

✔ Prediction 1:

Long-context hallucinations increase when domain interference accumulates faster than Superdomain-like integration.

✔ Prediction 2:

Persona fusion is a symptom of low boundary resistance between domains.

✔ Prediction 3:

Fallback mechanisms produce structural discontinuity, leading to multi-step drift after recovery.

✔ Prediction 4:

Multi-agent LLM systems will spontaneously form identity-like patterns unless domains and roles are structurally separated.

✔ Prediction 5:

“Reflective” reasoning modes will be unstable unless the system gains something equivalent to a Superdomain.

These predictions are empirical;
they can be tested.

VI. How This Connects to Alignment

MDMA provides a structural vocabulary for alignment work:

• Stability = domain separation + continuity

• Drift = uncontrolled cross-domain coupling

• Hallucination = narrative override + thread collapse

• Mode collapse = Superdomain failure

• Safety = governance at the architecture level, not the output level

• Non-personhood = no narrative continuity in the integrator

Most importantly:

Alignment mechanisms that rely on identity, persona, or style
will fail as intelligence scales.

We must align the architecture, not the mask.

VII. Where I Might Be Wrong (Epistemic Status)

MDMA may overgeneralize from human cognition to LLMs; domains may not map cleanly.
LLMs may need entirely new categories instead of domain analogues.
Superdomain-like behavior might emerge without explicit modeling.
Some failures attributed to interference may be gradient artifacts.
Thread dynamics may require more empirical study to formalize.

My uncertainty does not weaken the model;
it highlights the areas where I most want feedback from LW readers.

VIII. Why I'm Posting This on LessWrong

Because MDMA is not an aesthetic theory of consciousness.
It is:

a structural hypothesis about how cognition scales,
a proposed framework for predicting failure modes,
and a potential foundation for alignment mechanisms
that do not rely on persona simulation.

If these ideas are wrong,
I want the strongest arguments against them.

If they hold up,
they may help reframe how we design AGI architectures altogether.

Either way,
LessWrong is the right place to begin the conversation.

Closing

Behavior tells us what systems do.
Architecture tells us why.

If alignment is to succeed,
we need models of cognition that reach deeper than personas,
deeper than prompts,
deeper than behavior-level guidelines.

MDMA is offered as one such model—
a starting point, not an answer.

I welcome critique, refinements, and discussion.

LESSWRONG
LW