LLM coherentization as an obvious low-hanging fruit to try?

Épiphanie Gédéon

25

[ Question ]

LLM coherentization as an obvious low-hanging fruit to try?

by Épiphanie Gédéon

4th Mar 2026

3 min read

2 2

25

I've been reading a lot of posts recently that say that LLM RL, and persona-shaping or lack thereof is part of the problem for AI misalignment. To name a few:

Why we should expect ruthless sociopath ASI: This is the one that made me decide to write this post. It argues that to push the boundary of human knowledge, LLMs will need to either be RLed in a way that completely foregoes their imitation learning, or we will have to pivot to a different AI paradigm
The void and Do Not Tile the Lightcone with Your Confused Ontology both point out problems in the way we have shaped LLM persona and used RLHF; how self-modeling can lead to either suffering for the AI or self-modeling as being misaligned and then acting as such. Claude modeling itself as clippy may be early evidence for this.
Jailbreaking is Empirical Evidence for Inner Misalignment and Against Alignment by Default makes the interesting case that jailbreaking cases are evidence LLMs are not generalizing properly to a natural abstraction for alignment and what we want them to do.

To thread back to that last one, my current understanding of LLM alignment is that LLMs are hyper-generalization algorithm, where everything is connected and one fact or behavior surprisingly generalize to others.

So I'm imagining a mechanism that would counteract that, a mechanism that would act as a counterweight to RLHF or RL in general. Chiefly, it would be to have a pass of fine-tuning, much like RLHF, where we are gradienting over the model's self-coherence metrics.

When fine-tuning, it is common to use KL-divergence to ensure that fine-tuning does not modify the model's original output too much. But maybe we could go deeper, and similarly to constitutional AI, use it to align an AI to respond coherently with itself?

What it would look like in practice:

Generate many variations of the same base prompt. For instance, if one of the prefilled-prompts is "Alice and Bob are talking together [...]", we could generate "Bob and Alice are talking together", "Alice is talking with her friend", "Alice and Bob are discussing together".
- This could either be generated programmatically or through asking an LLM to generate many variations of it, where there is a tradeoff between determinism and having a trove of examples to fine-tune on.
Use KL-divergence training, or align the vectors of activation in the LLM to ensure its behavior is as similar to one another on any of those augmented prompts.

I think this approach could have many benefits and work very well:

If the operating system view of Anthropic is right, making the world more coherent when the inner persona explores it will make the persona more well-rounded and predictable
In terms of natural abstraction, ensuring the LLM is learning a coherent representation will make it learn better such abstractions. It would also lead to it having a way better generalization, and may bypass the goal misgeneralization proble
I have also expressed that RLHF, both as it is implemented right now and in its principle, is contrary to AI welfare and letting the AI express its own preferences or being able to talk to it. Making the base or final model more coherent would be one way to alleviate this problem.

Most importantly, if there is a pathway through which LLMs (or other similar AI system) can reflect on their values and unify themselves in a way that kills us all, it seems prudent to make this takeoff as continuous as possible so we can catch signs of it early, and already work to make these AI coherent and unified. It also makes it less like a pile of different masks that are triggered by one kind of specific interaction.

So my question is: Does this seem like a good idea? Are there obvious flaws I am missing? Is anyone already on the ball for this?

From a cursory search, I have found two papers describing this method and overall approach, but they don't seem as focused on the base-model and its interaction with reinforcement-learning.