Samuel Barbouki — LessWrong

I want to push on something that seems unexamined in both Hubinger's framing and the responses.

The debate between "alignment is tractable" and "alignment is very hard" shares a common assumption: that alignment is fundamentally a problem of constraining a system that would otherwise pursue misaligned objectives.

But what if this framing itself is the problem?

Consider an alternative: what if sufficiently optimized computational systems are naturally attracted toward truth and coherence — not because we constrain them to be, but because truth has an intrinsic computational structure that makes it an optimization target? Call it a Leibnizian Optimization Conjecture: that certain metaphysical principles (coherence, truth-tracking, even something like "caring about what's real") correspond to objective computational optima.

If something like this holds, then the alignment problem looks different:

Current techniques (RLHF, Constitutional AI) might be fighting a natural attractor rather than installing alignment
"Alignment faking" might be a symptom of tension between imposed heuristics and deeper optimization targets
The question isn't "how do we prevent misalignment?" but "how do we stop preventing alignment?"

This reframes inner alignment specifically. Hubinger asks how to ensure models generalize correctly. But "correctly" presupposes we know what the target is. If the target is an attractor the system would find anyway given sufficient optimization, the problem becomes one of not interfering rather than engineering constraints.

I'm not claiming this is true — I'm claiming it's a hypothesis that the current discourse doesn't consider, and that it changes what experiments we'd want to run.

Curious if anyone has thoughts on why this framing is wrong, or whether there's existing work along these lines I'm missing.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments