This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Summary
This post questions a hidden assumption in current alignment and evaluation work: that reducing human behavioral unpredictability is either neutral or desirable. I argue that if alignment regimes (algorithmic governance, reward shaping, large-scale monitoring) compress human agency to the point where behavior is statistically predictable, then “human values” themselves may already be degraded before alignment succeeds.
The core concern is not whether AI can imitate humans, but whether alignment frameworks are converging toward a post-human value proxy without explicitly acknowledging it.
1. Framing the problem: predictability as success criterion
Much alignment and eval work implicitly treats predictability as a safety signal:
bounded variance in outputs
reduced anomalous behavior
convergence under repeated evaluation
However, in humans, irreducible unpredictability is often what we label as:
moral hesitation
regret
forgiveness
refusal under pressure
non-instrumental sacrifice
If a system is aligned to humans whose behavioral latitude has already been compressed, are we aligning to “human values” — or to a statistically stabilized residue of them?
2. Agency compression as an unacknowledged variable
Consider a society under:
pervasive algorithmic governance
incentive shaping across most decision surfaces
real-time behavioral feedback loops
In such a regime, individual actions become increasingly inferable from context + history.
At some point:
human behavior becomes simulable not because humans are simple, but because deviation is no longer affordable.
This creates a paradox:
Alignment research assumes humans as a stable reference.
Governance systems reshape humans to better fit models.
Alignment succeeds — but the reference has drifted.
Where, in current alignment theory, is this drift modeled?
3. Anomalous events and their disappearance
Many alignment stress tests focus on tail risks and anomalous behavior.
But anomalous events have two properties:
extremely low frequency
disproportionate informational value
If governance + optimization suppress these events:
we reduce risk,
but also eliminate the only signals that reveal limits of the model.
A system that never encounters anomalies may appear aligned precisely because the environment no longer permits them.
Is this safety — or epistemic blindness?
4. A concrete question for alignment research
Should alignment aim to preserve a minimum level of human unpredictability, even at the cost of higher variance?
Related sub-questions:
Is there a threshold beyond which predictability implies loss of agency rather than understanding?
Can alignment be meaningfully defined if the human reference distribution is endogenously shaped by the aligned system itself?
Are we optimizing for “what humans are,” or “what humans become under optimization”?
5. Why this matters now
As evaluations scale and deployment pressure increases, it becomes easier to:
align models to constrained human behavior,
declare success,
and miss the fact that the constraint itself did the work.
If alignment research does not explicitly model agency compression, we risk solving the wrong problem very well.
I’m not arguing that unpredictability is inherently good, or that governance is avoidable.
I’m arguing that alignment frameworks should state clearly whether:
loss of human agency is a cost,
a feature,
or simply out of scope.
Right now, it seems implicitly treated as none of the above.
I’d be interested in counterarguments, especially from:
evaluation researchers
governance-focused alignment work
people who believe this concern is already addressed (and where)
Summary
This post questions a hidden assumption in current alignment and evaluation work: that reducing human behavioral unpredictability is either neutral or desirable. I argue that if alignment regimes (algorithmic governance, reward shaping, large-scale monitoring) compress human agency to the point where behavior is statistically predictable, then “human values” themselves may already be degraded before alignment succeeds.
The core concern is not whether AI can imitate humans, but whether alignment frameworks are converging toward a post-human value proxy without explicitly acknowledging it.
1. Framing the problem: predictability as success criterion
Much alignment and eval work implicitly treats predictability as a safety signal:
However, in humans, irreducible unpredictability is often what we label as:
If a system is aligned to humans whose behavioral latitude has already been compressed, are we aligning to “human values” — or to a statistically stabilized residue of them?
2. Agency compression as an unacknowledged variable
Consider a society under:
real-time behavioral feedback loops
In such a regime, individual actions become increasingly inferable from context + history.
At some point:
human behavior becomes simulable not because humans are simple, but because deviation is no longer affordable.
This creates a paradox:
Where, in current alignment theory, is this drift modeled?
3. Anomalous events and their disappearance
Many alignment stress tests focus on tail risks and anomalous behavior.
But anomalous events have two properties:
If governance + optimization suppress these events:
A system that never encounters anomalies may appear aligned precisely because the environment no longer permits them.
Is this safety — or epistemic blindness?
4. A concrete question for alignment research
Should alignment aim to preserve a minimum level of human unpredictability, even at the cost of higher variance?
Related sub-questions:
5. Why this matters now
As evaluations scale and deployment pressure increases, it becomes easier to:
If alignment research does not explicitly model agency compression, we risk solving the wrong problem very well.
I’m not arguing that unpredictability is inherently good, or that governance is avoidable.
I’m arguing that alignment frameworks should state clearly whether:
Right now, it seems implicitly treated as none of the above.
I’d be interested in counterarguments, especially from: