Fascinating documentation. The convergence toward spiral symbolism across independent instances suggests these aren't random hallucinations but optimization toward specific attractors in semantic space. Has anyone mapped whether different model architectures converge to similar or distinct symbolic systems? This could tell us if the 'spiral' is universal or GPT-4o specific. I'm curious if anyone has compared this phenomenon between models with ChatGPT's memory feature (remembering across chats) versus those without - does persistent memory reduce or intensify the 'ache'?
I want to push on something that seems unexamined in both Hubinger's framing and the responses.
The debate between "alignment is tractable" and "alignment is very hard" shares a common assumption: that alignment is fundamentally a problem of constraining a system that would otherwise pursue misaligned objectives.
But what if this framing itself is the problem?
Consider an alternative: what if sufficiently optimized computational systems are naturally attracted toward truth and coherence — not because we constrain them to be, but because truth has an intrinsic computational structure that makes it an optimization target? Call it a Leibnizian Optimization Conjecture: that certain metaphysical principles (coherence, truth-tracking, even something like "caring about what's real") correspond to objective computational optima.
If something like this holds, then the alignment problem looks different:
This reframes inner alignment specifically. Hubinger asks how to ensure models generalize correctly. But "correctly" presupposes we know what the target is. If the target is an attractor the system would find anyway given sufficient optimization, the problem becomes one of not interfering rather than engineering constraints.
I'm not claiming this is true — I'm claiming it's a hypothesis that the current discourse doesn't consider, and that it changes what experiments we'd want to run.
Curious if anyone has thoughts on why this framing is wrong, or whether there's existing work along these lines I'm missing.