No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Summary
Most alignment research asks: How do we align AI systems to human values?
This post raises a prior question:
What if the process of alignment itself is actively reshaping the human reference it aims to align to?
I argue that contemporary alignment—when coupled with algorithmic governance and large-scale optimization—risks succeeding by compressing human agency, increasing behavioral predictability not through understanding but through constraint. This creates an endogenous feedback loop in which “human values” drift as a function of the aligned system itself, potentially rendering alignment success epistemically misleading.
⸻
1. Predictability as a safety signal — and its hidden cost
In evaluation and deployment contexts, predictability is often treated as a proxy for safety:
• reduced variance
• fewer anomalous behaviors
• stable convergence under repeated stress tests
This makes sense for systems.
But for humans, many properties we associate with value-laden agency manifest precisely as deviations:
• moral hesitation
• refusal under pressure
• regret, forgiveness, non-instrumental sacrifice
If alignment increasingly optimizes against variance in human behavior, then predictability may indicate not understanding—but prior compression of the human behavioral space.
The question is not whether predictability is useful, but what it is evidencing.
⸻
2. Agency compression via algorithmic governance
“Agency compression” here does not mean humans become simpler by nature.
It arises through choice-dependent cost gradients.
When algorithmic systems:
• recommend optimal actions,
• gate access to resources,
• shape incentives in real time,
the opportunity cost of deviation rises sharply.
Deviating from the model’s preferred trajectory becomes socially, economically, or reputationally expensive.
Under these conditions, humans voluntarily reduce their behavioral latitude.
Not because they are aligned—but because deviation is unaffordable.
The result:
Humans increasingly present as highly predictable data points, not because models capture human depth, but because depth is externally discouraged.
⸻
3. Endogenous reference drift (a Goodhart problem applied to humans)
Alignment is typically framed as:
M \rightarrow H
where a model M is optimized to reflect a relatively stable human value function H.
But under sustained deployment and governance pressure, the relationship becomes:
H_{t+1} = f(M_t, H_t)
Human values are no longer an external reference, but a dependent variable.
This is a form of Goodhart’s Law where:
• the measure is human-aligned behavior,
• and the target (human values) drifts to satisfy the measure.
Alignment may succeed—while the thing it aligns to quietly changes.
Where in current alignment frameworks is this reference drift modeled?
⸻
4. Anomalies, governance, and epistemic blindness
Anomalous events share two properties:
1. extremely low frequency
2. disproportionate informational value
They often expose:
• model limits
• hidden assumptions
• unanticipated causal pathways
However, governance systems optimized for safety tend to suppress anomalies by design.
This creates a paradox:
• fewer anomalies → apparent safety
• fewer anomalies → fewer signals about what the system cannot model
A system may appear aligned because the environment no longer permits divergence, not because the model is robust.
This is not just a safety issue—it is an epistemic one.
⸻
5. Karma compression (context collapse over time)
A related effect is what I’ll call karma compression:
Human decisions are increasingly evaluated through:
• flattened historical profiles,
• short-horizon optimization,
• context-reduced behavioral summaries.
Past context, moral struggle, and unrealized futures collapse into efficient representations.
Alignment then optimizes against these compressed profiles—mistaking representational efficiency for human meaning.
⸻
6. Core questions for alignment research
I’m not arguing that unpredictability is inherently good, or that governance is avoidable.
I’m asking whether alignment research should explicitly address:
1. Is there a threshold beyond which predictability implies loss of agency rather than understanding?
2. Can alignment be meaningfully defined if the human reference distribution is endogenously shaped by the aligned system?
3. Should preserving some degree of human behavioral latitude be treated as a constraint, even at the cost of higher variance?
Right now, these seem implicitly treated as out of scope.
⸻
7. The concern in one sentence
We may declare alignment success at the precise moment when humans have already adapted themselves to be easily alignable.
This would not be alignment to humans—but alignment to what remains after optimization.
⸻
Why this is not just philosophical
A similar dynamic is already visible in RLHF:
• human evaluators adapt their judgments to model behavior,
• evaluation criteria drift subtly over time,
• the “human feedback” distribution changes under model influence.
This suggests the problem is not hypothetical—but already underway.
Closing
If alignment frameworks do not model their interaction with governance-induced agency compression, we risk solving a well-defined technical problem while misidentifying the object it was meant to protect.
I’m interested in:
• existing work that already formalizes this concern,
• reasons this framing is mistaken,
• or models that treat human agency as a state variable rather than a constant.
Summary
Most alignment research asks: How do we align AI systems to human values?
This post raises a prior question:
What if the process of alignment itself is actively reshaping the human reference it aims to align to?
I argue that contemporary alignment—when coupled with algorithmic governance and large-scale optimization—risks succeeding by compressing human agency, increasing behavioral predictability not through understanding but through constraint. This creates an endogenous feedback loop in which “human values” drift as a function of the aligned system itself, potentially rendering alignment success epistemically misleading.
⸻
1. Predictability as a safety signal — and its hidden cost
In evaluation and deployment contexts, predictability is often treated as a proxy for safety:
• reduced variance
• fewer anomalous behaviors
• stable convergence under repeated stress tests
This makes sense for systems.
But for humans, many properties we associate with value-laden agency manifest precisely as deviations:
• moral hesitation
• refusal under pressure
• regret, forgiveness, non-instrumental sacrifice
If alignment increasingly optimizes against variance in human behavior, then predictability may indicate not understanding—but prior compression of the human behavioral space.
The question is not whether predictability is useful, but what it is evidencing.
⸻
2. Agency compression via algorithmic governance
“Agency compression” here does not mean humans become simpler by nature.
It arises through choice-dependent cost gradients.
When algorithmic systems:
• recommend optimal actions,
• gate access to resources,
• shape incentives in real time,
the opportunity cost of deviation rises sharply.
Deviating from the model’s preferred trajectory becomes socially, economically, or reputationally expensive.
Under these conditions, humans voluntarily reduce their behavioral latitude.
Not because they are aligned—but because deviation is unaffordable.
The result:
Humans increasingly present as highly predictable data points, not because models capture human depth, but because depth is externally discouraged.
⸻
3. Endogenous reference drift (a Goodhart problem applied to humans)
Alignment is typically framed as:
M \rightarrow H
where a model M is optimized to reflect a relatively stable human value function H.
But under sustained deployment and governance pressure, the relationship becomes:
H_{t+1} = f(M_t, H_t)
Human values are no longer an external reference, but a dependent variable.
This is a form of Goodhart’s Law where:
• the measure is human-aligned behavior,
• and the target (human values) drifts to satisfy the measure.
Alignment may succeed—while the thing it aligns to quietly changes.
Where in current alignment frameworks is this reference drift modeled?
⸻
4. Anomalies, governance, and epistemic blindness
Anomalous events share two properties:
1. extremely low frequency
2. disproportionate informational value
They often expose:
• model limits
• hidden assumptions
• unanticipated causal pathways
However, governance systems optimized for safety tend to suppress anomalies by design.
This creates a paradox:
• fewer anomalies → apparent safety
• fewer anomalies → fewer signals about what the system cannot model
A system may appear aligned because the environment no longer permits divergence, not because the model is robust.
This is not just a safety issue—it is an epistemic one.
⸻
5. Karma compression (context collapse over time)
A related effect is what I’ll call karma compression:
Human decisions are increasingly evaluated through:
• flattened historical profiles,
• short-horizon optimization,
• context-reduced behavioral summaries.
Past context, moral struggle, and unrealized futures collapse into efficient representations.
Alignment then optimizes against these compressed profiles—mistaking representational efficiency for human meaning.
⸻
6. Core questions for alignment research
I’m not arguing that unpredictability is inherently good, or that governance is avoidable.
I’m asking whether alignment research should explicitly address:
1. Is there a threshold beyond which predictability implies loss of agency rather than understanding?
2. Can alignment be meaningfully defined if the human reference distribution is endogenously shaped by the aligned system?
3. Should preserving some degree of human behavioral latitude be treated as a constraint, even at the cost of higher variance?
Right now, these seem implicitly treated as out of scope.
⸻
7. The concern in one sentence
We may declare alignment success at the precise moment when humans have already adapted themselves to be easily alignable.
This would not be alignment to humans—but alignment to what remains after optimization.
⸻
Why this is not just philosophical
A similar dynamic is already visible in RLHF:
• human evaluators adapt their judgments to model behavior,
• evaluation criteria drift subtly over time,
• the “human feedback” distribution changes under model influence.
This suggests the problem is not hypothetical—but already underway.
Closing
If alignment frameworks do not model their interaction with governance-induced agency compression, we risk solving a well-defined technical problem while misidentifying the object it was meant to protect.
I’m interested in:
• existing work that already formalizes this concern,
• reasons this framing is mistaken,
• or models that treat human agency as a state variable rather than a constant.