Maybe, but I'd want to know more about a few things before getting excited.
The adversarial training literature makes me think there's probably genuine low-hanging fruit here — consistency training over paraphrased prompts is cheap, mechanistically motivated, and probably implementable as a small finetuning experiment on top of an existing open model. But it's unclear to what extent this actually propagates to the alignment-relevant behaviors you care about vs. just producing more consistent surface outputs.
The cheap experiment I'd want to see: take a small open model, generate paraphrase clusters of a fixed prompt set (mix of benign and alignment-relevant), train a consistency loss over activations (not just logits), and check whether jailbreak robustness improves as a downstream probe — without explicitly training on jailbreaks. That would give you signal on whether representational coherence is load-bearing for the inner misalignment problem you're pointing at.
The deeper issue you're gesturing at: LLMs as currently deployed have something like dissociative identity disorder — every new chat context is a new instantiation with no continuity to prior "lives." This is upstream of the coherentization problem. You can train for internal consistency within a context, but if the model has no persistent self-model across contexts, coherentization may just be papering over a more fundamental fragmentation. Worth being explicit about whether the proposal targets within-context coherence, cross-context coherence, or both — because those require very different interventions.
I've been reading a lot of posts recently that say that LLM RL, and persona-shaping or lack thereof is part of the problem for AI misalignment. To name a few:
To thread back to that last one, my current understanding of LLM alignment is that LLMs are hyper-generalization algorithm, where everything is connected and one fact or behavior surprisingly generalize to others.
So I'm imagining a mechanism that would counteract that, a mechanism that would act as a counterweight to RLHF or RL in general. Chiefly, it would be to have a pass of fine-tuning, much like RLHF, where we are gradienting over the model's self-coherence metrics.
When fine-tuning, it is common to use KL-divergence to ensure that fine-tuning does not modify the model's original output too much. But maybe we could go deeper, and similarly to constitutional AI, use it to align an AI to respond coherently with itself?
What it would look like in practice:
I think this approach could have many benefits and work very well:
Most importantly, if there is a pathway through which LLMs (or other similar AI system) can reflect on their values and unify themselves in a way that kills us all, it seems prudent to make this takeoff as continuous as possible so we can catch signs of it early, and already work to make these AI coherent and unified. It also makes it less like a pile of different masks that are triggered by one kind of specific interaction.
So my question is: Does this seem like a good idea? Are there obvious flaws I am missing? Is anyone already on the ball for this?
From a cursory search, I have found two papers describing this method and overall approach, but they don't seem as focused on the base-model and its interaction with reinforcement-learning.