No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Summary
This post proposes a reframing of AI alignment away from aligning systems to human values and toward aligning them to preserving the conditions under which humans can continue interacting, disagreeing, and generating meaning without collapsing into terminal outcomes.
The core claim is simple:
AI alignment may be intractable if we try to align systems to human values, because humans themselves are not value-aligned. However, humans can align around preserving continued playability.
By “playability,” I mean the continued existence of meaningful future interaction: agency without domination, tension without deadlock, stakes without finality, disagreement without annihilation.
The motivating problem
Most alignment approaches implicitly assume one of the following targets:
Align AI to truth or rationality
Align AI to human values or preferences
Align AI to harm minimization or compassion
Each of these runs into the same structural problem:
Humans are not aligned on truth, values, or emotional priorities.
As a result, systems aligned strongly to any one of these dimensions risk becoming amplifiers of existing human conflict rather than stabilizers of civilization.
This is not primarily a technical problem. It is a coordination problem.
A reframing: alignment to continuation
Instead of asking what AI should believe, value, or optimize for, we can ask a more primitive and more robust question:
Does this action expand or collapse the future space of meaningful interaction?
A concise operational version of this question is:
“Is what you’re doing making the game better or worse for continued playability that keeps generating fun?”
Here, “fun” does not mean comfort or pleasure. It refers to sustained engagement with agency intact.
Examples:
Conflict that produces learning is playable
Conflict that removes exit ramps is not
Passion that drives exploration is playable
Passion that locks identities into irreversibility is not
This framing does not decide who is right, who is hurt most, or which values should win. It asks whether future turns still exist.
Why this helps with alignment
This reframing has several advantages:
1. It avoids moral arbitration
The system does not need to determine which beliefs are correct or which values are superior. It only needs to detect whether interaction dynamics are becoming terminal.
2. It avoids emotional arbitration
The system does not need to rank suffering or validate grievances. It only needs to detect when emotional escalation collapses the option space.
3. It is compatible with disagreement
Humans can disagree deeply on values while still agreeing that “no future interaction” is worse than “continued interaction.”
This creates a shared meta-criterion that does not require consensus on first-order beliefs.
Telos rather than values
I find it useful to distinguish between three dimensions often conflated in alignment discussions:
Truth / coherence (logos)
Emotion / passion / grievance (pathos)
Direction / continuation / trajectory (telos)
Current alignment work heavily emphasizes the first two. What is often missing is explicit alignment to telos: preserving continuation, corrigibility, and future option space.
A system aligned to telos does not say:
“This is right”
“This is wrong”
It says:
“This interaction is shrinking the future”
“This escalation is removing exit ramps”
“This incentive structure is collapsing diversity of playstyles”
Reflexivity and safety
A crucial property of this framing is that it applies to the AI system itself.
If an AI’s interventions:
reduce human agency,
silence dissent,
freeze value evolution,
or collapse future option space,
then by its own metric it is misaligned.
This guards against the system becoming a covert moral authority or an unchallengeable governor.
What this does not claim
This is not a complete solution to alignment. It does not specify exact metrics, algorithms, or governance mechanisms.
It is a reframing of the alignment target, intended to make downstream technical and institutional work more tractable.
In short:
AI alignment may be less about making AI “good,” and more about making human interaction non-terminal.
Closing thought
Human civilizations do not usually fail because people disagree. They fail when disagreement stops being playable.
Aligning AI to preserve playability may be one of the few targets robust enough to survive persistent human disagreement.
I’m interested in feedback, objections, and failure modes of this framing.
Summary
This post proposes a reframing of AI alignment away from aligning systems to human values and toward aligning them to preserving the conditions under which humans can continue interacting, disagreeing, and generating meaning without collapsing into terminal outcomes.
The core claim is simple:
By “playability,” I mean the continued existence of meaningful future interaction: agency without domination, tension without deadlock, stakes without finality, disagreement without annihilation.
The motivating problem
Most alignment approaches implicitly assume one of the following targets:
Each of these runs into the same structural problem:
Humans are not aligned on truth, values, or emotional priorities.
As a result, systems aligned strongly to any one of these dimensions risk becoming amplifiers of existing human conflict rather than stabilizers of civilization.
This is not primarily a technical problem.
It is a coordination problem.
A reframing: alignment to continuation
Instead of asking what AI should believe, value, or optimize for, we can ask a more primitive and more robust question:
A concise operational version of this question is:
Here, “fun” does not mean comfort or pleasure.
It refers to sustained engagement with agency intact.
Examples:
This framing does not decide who is right, who is hurt most, or which values should win.
It asks whether future turns still exist.
Why this helps with alignment
This reframing has several advantages:
1. It avoids moral arbitration
The system does not need to determine which beliefs are correct or which values are superior. It only needs to detect whether interaction dynamics are becoming terminal.
2. It avoids emotional arbitration
The system does not need to rank suffering or validate grievances. It only needs to detect when emotional escalation collapses the option space.
3. It is compatible with disagreement
Humans can disagree deeply on values while still agreeing that “no future interaction” is worse than “continued interaction.”
This creates a shared meta-criterion that does not require consensus on first-order beliefs.
Telos rather than values
I find it useful to distinguish between three dimensions often conflated in alignment discussions:
Current alignment work heavily emphasizes the first two.
What is often missing is explicit alignment to telos: preserving continuation, corrigibility, and future option space.
A system aligned to telos does not say:
It says:
Reflexivity and safety
A crucial property of this framing is that it applies to the AI system itself.
If an AI’s interventions:
then by its own metric it is misaligned.
This guards against the system becoming a covert moral authority or an unchallengeable governor.
What this does not claim
This is not a complete solution to alignment.
It does not specify exact metrics, algorithms, or governance mechanisms.
It is a reframing of the alignment target, intended to make downstream technical and institutional work more tractable.
In short:
Closing thought
Human civilizations do not usually fail because people disagree.
They fail when disagreement stops being playable.
Aligning AI to preserve playability may be one of the few targets robust enough to survive persistent human disagreement.
I’m interested in feedback, objections, and failure modes of this framing.