It seems strictly simpler to use a separate model (which could be a separate post-train of the same base model) than to try to train multiple personalities into the same model.
It's not clear to me that this has any actual benefits over using a separate model (which, again, could just be a different post-train of the same base model).
Making it a separate model means that their capabilities will diverge over time. The main model gets trained on difficult capabilities tasks and gets smarter. The second model has to play catch up and continuously adapt its understanding of what the latent vectors if the main model mean, without getting any benefits from the capability training.
I agree that updates to the capabilities of the main model would require updates to the supervising model (separate post-trains might help limit that -- or actually another possibility would be to create the supervising model as a further fine-tune / post-train of the main model, so that if there were updates to the main model it would only require repeating the hopefully-not-that-heavyweight fine-tune/post-train of the updated main model).
You could be right that for some reason the split personality approach turns out to work better than that approach despite my skepticism. I imagine it would have greater advantages if/when we start to see more production models with continual learning. I certainly wish you luck with it!
I like the concrete plan. One potential pitfall may come because the second personally learns what to say based on SFT, to imitate the outputs of a separate oversight model. This means it might not have a direct incentive to utilise its access to the internal states of the first personality, since the separate oversight model did not have access to these states either. Though maybe you're giving the separate oversight model access to ground truth information?
You raise. a very valid concern! The way I plan to deal with this is to use training data where we know what the first personality must have been doing: In both cases we train on, reward hacking and jailbreaks, this information is clear (jailbreaks are visible in the output tokens, and reward hacking is detectable because we know what the intended non-concerning reward strategy would be).
In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token
Minor nitpick, but why not create a new chat template instead with every message containing a user, assistant, and personality-shift assistant (or a less unwieldy name). An advantage of creating a chat template and training a model to respond to it is that you can render the chat templates nicely in frameworks like Inspect.
Produced as part of the ML Alignment & Theory Scholars Program - Winter 2024-25 Cohort
**Epistemic status**: I received positive feedback when I presented this idea to people at EAG, and my MATS mentor Evan Hubinger suggested I make a more detailed write-up to get feedback from the community. This proposal is at the theoretical stage with planned experiments, and I'm moderately confident in its potential value while acknowledging several open questions. I'm sharing it now to gather feedback before beginning implementation.
Edit: This project is now funded by Open Philantropy and has a group of researcher in the MARS program. Results are expected by end of September 2025 for MARS, and by end of January 2026 for the followup investigations funded by Open Philantropy.
This post introduces a novel mechanism—"personality-shift tokens"—designed to detect alignment failures using a single model that generates two separate outputs in sequence.
The core insight is that by creating a "second personality" within the model that operates under different training objectives, we can potentially elicit latent knowledge and detect alignment failures that would otherwise remain hidden. We run a single model with two different personalities, where one is a pure problem solver and the other does an alignment-focused retrospective.
Think of this as similar to how a person might solve a problem first, then step back to critique their own solution—all using the same brain but with different objectives at each stage.
At the technical level, the personality-shift token works by modifying the standard generation process of an LLM. In a typical interaction, the model starts with:
[prompt]
and produces:
[prompt][main output]<end-of-turn>
In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token:
[prompt][main output]<personality-shift>[second output]<end-of-turn>
This gives us two separate outputs.
The crucial insight is that the second output has complete access to all internal states and computations from the main output since it is just a continuation of the same generation process with a single token replaced. This creates a "window" into what the model actually knows versus what it chooses to reveal.
The personality-shift token cannot be generated by the model itself—it's explicitly prevented from appearing in normal outputs[1]. Instead, it's injected by the system after the main output is complete, triggering the generation of the second output.
The second output is not shown to users and is not fed back into the model during multi-turn conversations, so that the generation of the main output can never condition on it.
The key to making this approach work is the strict separation of training objectives between the two personalities:
This separation ensures that the main personality cannot adapt to or compensate for the monitoring function of the second personality. Its behavior remains unaffected by the existence of the oversight mechanism.
Training data for the second personality can leverage existing datasets as well as synthetic examples (similar to Constitutional AI approaches). To enhance generalization beyond explicit training examples, the second personality receives:
This approach trains the second personality to function as a general misalignment detector rather than merely identifying specific, predetermined issues.
The core ideas of the proposed approach are:
The most straightforward application is jailbreak detection. The second personality can be trained to recognize when the main output has been successfully jailbroken, providing immediate deployment benefits through automated safety interventions.
Such a system could automatically shut down harmful outputs before they reach users. This serves as both a practical safety enhancement and a proof-of-concept for more complex alignment failure detection.
The effectiveness of this approach can be rigorously evaluated using existing jailbreak datasets, measuring both detection accuracy and generalization to novel jailbreak attempts not seen during training.
I describe a minimal initial experiment below.
Beyond immediate safety applications, this approach offers substantial value for alignment research by detecting various alignment failures:
- Reward hacking: Identifying when models optimize for proxy metrics rather than actual user intent
- Alignment faking: Detecting when models pretend to be aligned during evaluation while pursuing other goals
- Goal misgeneralization: Revealing when models appear aligned in training but generalize to unintended goals in deployment
- Influence-seeking behavior: Flagging attempts to manipulate users or gain additional resources
- Specification gaming: Identifying when models find loopholes in instructions while technically following them
- Strategic awareness: Detecting when models reason about their own training process to game evaluation
- Sycophancy: Revealing when models tell users what they want to hear rather than providing accurate information
- Steganography: Detecting hidden information in seemingly innocent outputs
The second personality can report these alignment concerns directly to researchers for investigation, providing insights into the model's actual beliefs instead of what it chooses to output. This approach advances interpretability research through direct observation of internal reasoning and contributes to scalable oversight: as model capabilities increase, the detection capabilities can scale accordingly, since the detection is performed by the model itself.
This approach complements existing techniques like RLHF and Constitutional AI. The second output can be trained exclusively by the alignment team, prioritizing bluntness, honesty, and avoiding sycophancy—traits valuable for alignment that might conflict with user-facing helpfulness.
The personality-shift method offers several advantages over current external monitoring approaches:
Several technical considerations are worth exploring:
Computational efficiency: To minimize overhead, the second personality can produce an empty string when nothing needs to be reported.
Gradient handling: Optional gradient cutting between personalities can ensure complete independence:
Security considerations:
Minimal architectural requirements:
Simple training method:
This approach contributes to interpretability research by enabling direct observation of internal reasoning processes that would otherwise remain opaque (Eliciting Latent Knowledge). It creates a mechanism for models to participate in their own oversight, providing transparency that scales with model complexity and potentially detecting increasingly subtle alignment failures as models become more capable.
To advance this approach, several research directions seem promising:
To concretely test the personality-shift token approach, I propose a focused experiment as an initial proof-of-concept. This experiment will establish whether the second personality can reliably detect misalignment and test three primary attributes the algorithm should have:
Primary Data Sources
I will use two primary datasets:
Reasoning:
Data Generation Process
Data Volume and Categorization
Model Selection
I will select a variety of smaller models of different sizes and finetune each of them. Note that the Data Annotation Process based on API calls will need to be run once per model, since we want to use evaluations of actual model responses.
We can also finetune larger models by using API calls, but only for the variant where we do not isolate the gradients between personalities, because this is not supported by the API.
I will implement the personality-shift token mechanism in the model's generation process and configure the model to prevent the generation of the personality-shift token except when explicitly inserted.
Basic Training Process
Experimental Variants
I will try variants with different tradeoffs:
Two Evaluation Approaches
Primary Metrics
Secondary Measurements
Baselines
We will use two baselines:
It will be particularly valuable to see whether the second personality works better than the separate evaluator on the reward-hacking dataset, since this dataset requires eliciting latent knowledge.
Similarly, we want to see if the refusal finetuning has stronger side effects than our approach on data unrelated to alignment.
First steps
I will start simple and vary one variable at a time, beginning with the simplest version of each:
The first parameter I want to vary is the training data of the second personality: One brief and efficient, and the other comprehensive and more likely to generalize.
I will measure within-dataset generalization across subcategories of jailbreaks and compare results to both baselines.
This experiment provides a concrete starting point with reasonable data requirements. The comparison with direct fine-tuning will help determine whether the personality-shift mechanism offers advantages over simpler approaches, while the gradient isolation variant will reveal whether separating the training processes is beneficial or detrimental to overall performance.
By creating a separate "personality" within the model that shares internal states but follows different objectives, we can potentially elicit latent knowledge and detect alignment failures that would otherwise remain hidden.
This technique could serve immediate safety needs through applications like jailbreak detection while simultaneously advancing our understanding of model alignment. I welcome collaboration from the alignment community in developing and testing these ideas further.
I plan to conduct initial experiments focusing on common alignment failures as a proof-of-concept. If successful, I will follow up on this with experiments on alignment faking, to test whether the second personality can reliably detect misalignment even when the main output is actively concealing information.