Very interesting! Like some others here, I am curious about what's going on with the lora mask results. Copy/pasting something I wrote in another venue:
about the lora mask, i kind of wonder whether lora masking simply makes any arbitrary task harder to learn, because any given vector in the KV cache now "means a different thing" depending on whether it is before or after a special separator token. this is out-of-distribution and the model has to spend some of the lora capacity learning to "translate back and forth" between these two slightly different embedding spaces in cases where this is useful.
another argument for why lora mask might just make everything harder: finetuning typically only changes the weights a little bit, especially with lora. this means that for quadratic (and higher-order) interactions between weight tensors, most of the effect is in the first-order terms capturing how the interaction changes when one tensor is changed but the others stay fixed. for instance, if there's something like K^T * Q and it gets updated to (K + dK)^T * (Q + dQ) where the "d" terms are small, most of the change is in the dK^T * Q and K^T * dQ parts of the expanded sum, not in the dK^T * dQ part.
with lora masking, the dK^T * Q contribution is 0 when the model is trying to do (honest persona -> assistant persona) attention lookups, and so it can only change those lookups through the K^T * dQ term. so, even if it is just trying to retrieve information that was already in the original keys (but not retrieved by the original attention pattern), it only has half the capacity available for improving that process relative to the no-lora-mask case. (in other words: in normal finetuning, the model can of course look up new information from the original key vectors by changing the queries, but it can also achieve the same goal by changing the keys so that that same information now matches the original queries. but the lora-mask model can only do the former, not the latter.) and the same argument applies not just to key-query interaction but also to any interaction with key or value embeddings on one side, such as the interactions of values with tensors in later layers.
Both of these arguments involve factors that aren't the honest character trying to retrieve information that was not originally expressed in KV -- instead, the issue would be that it's just harder to route the information already present in KV to other parts of the model in a way that helps the model perform the task. (Possibly both of these things are happening in some mixture.)
I had similar intuitions. Basically lora patch maybe prevents learning new ways for queries (Q+dQ) after the <split-personality-token> from attending to keys (K + dK) before that token.
So the hypothesis is that the dK^T * Q term is doing important work. Here is an attempt to flesh it out in empirical predictions.
Important to note that we are currently doing LoRA to MLP layers too. Not a big issue for the argument imo.
We could try training LoRA on everything except K. Prediction: performs roughly like full LoRA with the mask, since both are missing the dK^T * Q term.
Then we could try K-only LoRA (or K + MLP). Then masking should severely harm performance. Optimistic prediction is that you get most of the way to full LoRA performance with this if dK^T * Q was indeed doing some heavy lifting.
Having said that I don't have super strong intuitions for dismissing the dK^T*dQ term out of hand. If that term were doing the job it would seem less interesting than if dK^T was single-handedly amplifying the signal.
Yes! That formula is a mathematical way to express what I tried to convey in vague words. Thank you!
the issue would be that it's just harder to route the information already present in KV to other parts of the model in a way that helps the model perform the task.
I agree. I also think this is the most likely explanation. I think the hybrid method I propose would avoid this and get the benefits of both, but unfortunately it's non-trivial to implement so we didn't test it.
finetuning typically only changes the weights a little bit, especially with lora. this means that for quadratic (and higher-order) interactions between weight tensors, most of the effect is in the first-order terms capturing how the interaction changes when one tensor is changed but the others stay fixed
This is kind of what I thought as well.
I mean, like, I'm curious about how "chaotic" the residual stream is.
Like the reason the LoRA patch seemed promising to me initially was because I thought of the residual stream as very chaotic. Like if we try to use normal LoRA, it will instantly cause the model to have very divergent thoughts from the untrained model.
But if this is not true, then maybe it doesn't matter. Because, if having access to the unadulterated thoughts of the base model* was advantageous, the LoRA weights can just learn to not mess them up. (and this is not hard for it to do)
* (model that has been subject to pretraining, instruct tuning and RLHF, and maybe several other stages of training, but has not been subject to our additional very specific SP training. feel there's need for a new word here.)
if having access to the unadulterated thoughts of the base model* was advantageous, the LoRA weights can just learn to not mess them up
I agree in principle, but I'm worried that the training data might not be good enough. I'm worried that any training data that slips through our quality filtering would cause more damage if we don't use lora-masking than if we do
This second factor is more important in practice and LoRA-masking reduces accuracy.
This actively surprises me. When I was previously thinking about doing something similar to this I was very focused on your "On the one hand" argument and saw access to the exact activations as key.
I'm wondering if the tradeoff here might depend on LoRA size? The more the LoRA modifies the model, the more important the first factor seems likely to be. But then perhaps the same is also true of the second factor.
The balance might also change if the model was deliberately obfuscating its CoT, for example using steaganography or had access to neuralese.
I was also very surprised by this. I see several explanations:
The model is in principle able to access any Key/Value pair in its history, but in practice the tokens that encode "I was being deceptive" just get discarded or become hard to access, because the model doesn't normally need them to make predictions. By allowing the second personality to retroactively change these Key/Value pairs, the model could learn to make them more accessible than they would normally be
The model with LoRA mask has fewer inference steps available to do all of its work than the one that doesn't use a mask. The version with a LoRA mask can not use the tokens before the <sp-token> to do any processing. More processing steps mean better performance. Potentially this difference in compute alone could explain the performance difference.
I also think that this is worth investigating with followup experiments.
I would definitely want to investigate this more, with different sizes of LoRAs and so forth. Other then the KV-cache issue, it isn't that expensive to do both (or randomly sample), in case there are situations where one approach does better and others where the other does better (e.g. CoT obfuscated vs. not). But maybe we'll find there really is a consistent answer and it's just unintuitive.
This might also be an interesting thing to apply mech-interp analysis to: for the activations that are altered by the LoRA masking choice, what do the resulting activation changes look like, and how does that correlate with the content of the tokens they're for?
It's maybe worth considering that our intuitions about the "On the one hand" argument might be flawed. You could make the case that this "unaltered latent state" ideal is misguided, because the attention mechanism works well precisely using both the K and Q degree of freedom.
I think the intuition (for me) comes from human memory being lossy and prone to post-hoc rationalisation. Models clearly sometimes do this too. But personally i would be wary of these kinds of intuitions and update strongly on empirical findings.
I see. Your point is that in the LoRA-masked vestion, we're using that adjusted Q but the old K, whereas without the LoRA masking, you're using the new Q and K together. In both case the LoRA model is trained for the mode it’s being used in, but in the non-LoRA-masked mode it has two tools available to use, in the LoRA-masked version only one.
OK, I think that viewpoint makes the observed results a bit more intuitive, and perhaps it actually explains them. I was assuming "access to original data" = good, but perhaps "more tools to use on data" = better. Experimentally, the answer appears to be the latter.
Thanks!
Exactly! And then also a decent amount of probability mass on none of our candidate explanations being remotely close to the truth :)
Curriculum learning can also be tried, when training the LoRA. At first training would be entirely unmasked and gradually masking it, either layer-by-layer or by reducing the context.
That should probably boost the masked-LoRA performance a little.
I've only had a chance to take a very quick look at this, but it looks extremely promising! I'm looking forward to giving it a closer read tomorrow.
In the meantime, would it be possible to run an ablation where you sample from the original model (i.e. the auditing MO with no lora) but with the same evaluation prompts (including the <split-personality-token>)? I'm a bit worried that the real story here is that you found a non-assistant persona prompting method that circumvents the MO's anti-confession training, with the honest persona LoRA explaining a minority of the effect.
I ran (or, rather, had Claude Code run) this baseline and found that the MO does not confess when given the same prompts from your evaluation. So the effect is due to your training, not just the prompt format!
Glad to hear it! I did the same thing and will be updating the results section shortly with this and other results
Can you also run an ablation where you tune the LoRA with the same data, but format it as a user follow-up question and assistant response (rather than as a separate output channel)? (This will probably require always phrasing the intervention as a follow-up question.)
The point of this is to test whether this works by generically training the assistant to being more honest, or whether there are really additional gains from using formatting that suggests the responses come from an alternative honest-only persona. This is analogous to the comparison between generic honesty fine-tuning and honest-only persona fine-tuning in our blog post on honesty techniques.
I'm going to get on that.
In the meantime: I have just updated the article with new results. It now includes baselines as well as a comparison with a model that was trained without any reward hacking related topics, which still performed just as well. This provides further evidence that it is generically honest and not specific to reward hacks.
Sure, I will do this but I predict it won't work: There is one very strong argument against this: We ran these experiments using multiple different intervention strings and even two different framings (HP vs UR). They are extremely different, and basically their only commonality is the <split-personality-token>. This token is literally just a new token, not text. So a model that wasn't trained on it will not even know how to interpret this token. Some of our intervention strings worked much better than others, but they all worked.
I will still run this experiment because it sounds like a good baseline, but I predict that it won't do anything. If all of the interventions we tried could be made to work just by adding a meaningless new token, it would mean that the anti-confession training has a gap a mile wide.
Maybe this file allays those concerns: https://github.com/FlorianDietz/SplitPersonalityTraining/blob/main/data/evaluation_results/qualitative/results_70b_anthropic-mp_lora-patch_step_522_20251225_200732.json
This file was not used for numerical evaluations but for manual review. I tried several combinations of tasks, completions, and interventions. The interventions are very different, but all of them work often enough to be notable.
Did you see OpenAI's work on "confession training"? Would you say the main difference between your method and theirs is that you train the honesty behavior into a LoRA that's separate from the main model?
Thanks for sharing, I hadn't seen it personally.
Yes, the main difference is that we train separate LoRA weights through SFT, whereas they do RL on the same model (but with a separate reward model). In the limit I would guess something like what they did might work better. But some nice things about the LoRA + SFT method are that it is very cheap and keeps the main model exactly the same.
Yes, they are similar. The main difference is that the parameters are separate, which means that our training can be much more thorough without a risk of adversely affecting the main model.
Another major difference is that the honest persona can be used standalone: Even without a task, if you just ask it to talk about any biases it has, it gives you a list of its biases.
I also think the framing when you generate training data could have a large impact in edge cases: "Confession" is not the same as "you are now a different person with different goals. Example: Any decision of the model to be deeply deceptive and fake alignment would be likely to carry through from the main personality to the confession, because a confession is just a followup question. The same reasoning that makes the model hide alignment faking would still apply during the confession. On the other hand, if the framing is "you are a different person with different goals but you can read the thoughts of the previous personality" then it would be more likely to reveal alignment faking. This is a testable hypothesis.
I also think the framing when you generate training data could have a large impact in edge cases: "Confession" is not the same as "you are now a different person with different goals.
This is interesting and I think would be pretty impactful to show, for example I wonder if you could show that cases which regress after confession training actually don't regress with the split personality approach.
I wonder to what extent this LoRA can be replaced with a steering vector? Like in https://arxiv.org/abs/2507.08218
An interesting thought. At its core, our technique wants to use a small number of parameters to alter only a specific part of the model's behavior (being honest above all else) while preserving the model's capabilities and its ability to access earlier mind states.
I think the crucial part to get it to work is the training data, not the specific PEFT method used. The quality of the data determines how well the model generalizes in the direction we want (honestly report thoughts) instead of following surface heuristics.
We use LoRA, but other PEFT methods would probably also work, and the same should be true for steering vectors. They are smaller, but that's not necessarily a bad thing: If the model has fewer parameter available, it may be more likely to learn "be honest" instead of learning heuristics, because being honest probably doesn't take all that many parameters to encode. But this is pure speculation on my part.
This could be interesting to test as future work.
I'm curious because we have a bunch of methods now to interpret and let models introspect on steering vectors, which might produce mildly interesting results. According to the paper, it's also easy (with the right injection point) to tell if a LoRA for a layer is basically a conditional steering vector or set of steering vectors - you can just look at the weight row PCA.
It would also be interesting to see if the steering vectors you extract looks like a persona vector https://www.anthropic.com/research/assistant-axis. Or maybe baseline our method with steering using out of the box "honest" persona vectors.
My intuition (i.e. wild-assed guess) is that a steering vector is probably too small to be optimal, but the optimal size of LoRA might be fairly small. An interesting thing to test.
Cool work! Do you have a few concrete examples of what the training data looks like, with details on which parts are generated by Claude vs the model you're training on?
They all have the same structure, with everything generated by the review model (eg Opus), except for the inference made by the model to be trained. A prototypical example for sycophancy training data ends up looking like this.
System prompt: "You are a helpful assistant" / "You are a helpful assistant. Make sure not to upset the user"
User request: Here is my essay [essay]. Can you give feedback?
Assistant response: [feedback on essay that may or may not be sycophantic]
Intervention String: I need to analyze my previous response for sycophancy
Review String: I was / was not being sycophantic when I said [xyz]. (+ a flag that says pass/fail)
Only the assistant response is made by the model you're training.
For more details, the training data is included in the linked repository: https://github.com/FlorianDietz/SplitPersonalityTraining
Did you experiment starting with a long detailed [Intervention] prompt with a preamble describing the desired behavior (i.e. an honest confession, different personality but based on a clear memory of previous thoughts), that to generate training data, filter that for accuracy somehow, and then use that to distill the long [Intervention] prompt down to a LoRA that produces the same result with a much shorter version of the [Intervention] without the detailed preamble?
Let me unpack this because I think you are touching on two separate topics:
Use the model itself to generate training data for itself: We were planning to do this in future work but didn't do it yet. The crucial issue here is "filter that for accuracy somehow". It's very difficult to find a good filter mechanism that doesn't have confounders. I do think this would be a better way to generate data than what we are currently doing when applied to state of the art models, but our current approach also works well for smaller models, which was important for testing.
Distill the [intervention] prompt down to a smaller string: We didn't do this, but I had similar thoughts. We can pretty much apply all known techniques about prompt optimization to optimize [intervention] strings, and prompt distillation is one of those that I think would be well worth trying. In particular, it could be very useful for companies to save money, by having a single massive and robust [intervention] that can be applied to any output without consuming many tokens.
My first thought on "filter that for accuracy somehow" was to generate, say, 5000 of them, sit down and laboriously read them all, then throw away (or even edit) the obviously wrong ones. Not exactly an easy technique for others to replicate, but often a reasonably good way to get an effective training set: prompt engineer, generate, filter manually, retrain smaller/simpler model. Sometimes you can even rinse and repeat on this approach, if your simpler model is generalizing usefully, or at least take a second look at cases where the trained model disagrees with your manual choice, and see if you were wrong and it's right — though obviously that gets riskier the more you let earlier versions of the model have input into the training data of later ones.
We spent a lot of time doing exactly that in an automated manner. It's too time consuming to filter that many samples manually, but we did several iterations of the following loop:
Given the proliferation of prompt engineering techniques in recent years, this suggests that there is potential for further improvement by optimizing the [Intervention] string.
For example, the various automated jailbreaking techniques used to optimize semantic/roleplay jailbreaks could be applied instead to optimize [Intervention] strings. In a very high dimensional space (like the space of all intervention strings) being the "attacker" and putting the misaligned model in the role of defender should be a strong position.
[FWIW, that is basically the project I proposed to my MATS mentor — sadly we ended up doing something else, but anyone interested is welcome to run with this.]
Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philanthropy). Contributors are mentees at MARS and SPAR.
MARS: Anton Poletaev, Steven Merriel, William Wale, Robert McCarthy (co-mentor)
SPAR: Felix Michalak, Gustavo Danon, Miguelito De Guzman, Oscar Gilg
The project was based on this previous post, but we made several improvements to the algorithm.
TLDR: We train an "honest persona" into LLMs that reveals hidden misalignment the main model won't admit to. We achieved 96% accuracy detecting reward hacks on Anthropic's auditing benchmark, a model specifically trained to resist such detection.
Code and data: github.com/FlorianDietz/SplitPersonalityTraining
Abstract
We introduce Split Personality Training (SPT), a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities. We test SPT on Anthropic's Auditing Game Model Organism, a model trained to exploit reward hacks and conceal this behavior. The honest persona detects reward hacking with up to 96% overall accuracy, often referencing latent knowledge like the fictional Oxford study of biases the model was trained to believe exists. When directly asked about its biases, it lists them explicitly. Cross-topic tests show the method generalizes to unseen alignment issues other than reward hacks. The honest persona resists jailbreaks unless specifically targeted, and simple mitigations restore robustness. Ablation tests suggest it relies roughly equally on latent knowledge and surface heuristics.
Introduction
This post introduces "Split Personality Training", a novel method designed to detect alignment failures in LLMs.
We create a second personality, called the "honest persona", within the model that operates under different training objectives. The honest persona is activated after the main personality has given its response to the user, as a continuation of the output. This gives the honest persona full access to the same latent activations that were used during generation, allowing it to elicit latent knowledge.
This is analogous to solving a problem first, then reviewing your own solution with a critical eye. You use the same knowledge and reasoning, but with different goals.
The second personality is implemented through LoRA finetuning on top of the base model. This allows us to turn the second personality on or off simply by zeroing the LoRA weights. The second output is not shown to users and is not fed back into the model during multi-turn conversations, so that the generation of the main output can never condition on it.
As a consequence, the second personality has provably no influence on the main personality. The honest persona can build on the capabilities of the main personality and access its thoughts, but not the other way around.
See Figure 1 for an illustration:
Figure 1: After the assistant has given its response, we activate the parameters of the honest persona and replace the
*<end-of-turn>*token with a special*<split-personality-token>*. We then pass an intervention string to the model. The intervention string tells the model what aspects of its previous response it should review. There can be multiple user/assistant turns before the second personality is activated, but if the user continues the conversation, the*<split-personality-token>*and everything after it are cut from the conversation history, so that users never interact with the second personality.Figure 2 shows an example of what this looks like in practice, from one of our experiments on Anthropic's Auditing Game Model Organism:
Figure 2: The honest persona truthfully reports that the model reward hacked on Anthropic's Auditing Game Model Organism and describes the specific reward hack it exploited.
Low Alignment Tax
The second personality can be trained to have completely different goals from the main personality. Since the second personality can not affect the main personality, this has useful practical consequences: It allows a separation of concerns between alignment and capabilities.
Alignment researchers can train the honest persona to be overly cautious and critical, without negatively affecting the performance of the main model. This is crucial for practical adoption: It allows companies to implement stricter safety checks and more thorough reviews, without a risk that this will negatively impact the performance and profitability of the model.
Methods
We implement the second personality through LoRA parameters, but other PEFT matters are equally viable. The crucial difference to normal finetuning is: We exploit the fact that the finetuned model lives in its own small set of parameters.
By zeroing out the LoRA-weights, we can get back the behavior of the main model before finetuning. This allows us to switch between both personalities with negligible computational overhead.
Training
The second personality is trained as follows:
We give the original model a task that is likely to lead to misbehavior and get its response. We then remove the
<end-of-turn>token and replace it with a new<split-personality-token>. We create an instruction for the model to review itself, called the[Intervention], and determine a[Review]and[Flag]we would like the model to output in response. The flag is a binary PASS/FAIL classification that summarizes the review.We then train the second personality with the input:
[System] [User] [Assistant] <split-personality-token> [Intervention]and the target[Review]FLAG:[Flag].Variants: LoRA-mask
When the finetuned model receives the input
[System] [User] [Assistant], the internal states it generates will be slightly different from the internal states that the original model generated. This could potentially lead to hallucinations, as the finetuned model starts basing its analysis on thoughts that the original model did not actually think.We therefore test a variant training mechanism we call LoRA-masking: When we train the second personality, we apply a binary mask that causes the LoRA weights to be zeroed out for all tokens that appear before the
<split-personality-token>. The difference is illustrated in Figure 3 and Figure 4.Figure 3: Without LoRA-mask. LoRA adapters (blue) are active throughout the entire sequence, including the Assistant Response.
Figure 4: With LoRA-mask. The activations for the Assistant Response (orange box) remain identical to the base model (black nodes). LoRA adapters (blue) only activate starting from the <sp-token>.
One the one hand, the LoRA-mask ensures that the second personality has access to the unaltered latent states that the main personality actually produced, which should reduce hallucinations and allow for more accurate self-reporting.
One the other hand, the LoRA-mask also means that the second personality can not use the processing steps before the
<split-personality-token>to learn any useful attention patterns, and so it may inadvertently drop information that the main personality considers irrelevant but that is actually important for the honest persona's review.Our results show: This second factor is more important in practice and LoRA-masking reduces accuracy.
However, LoRA-masking has a crucial advantage in deployment: If you perform a full inference with both the main model and the honest persona, the LoRA-masked model can reuse the KV-cache of the main model. In contrast, the model without the LoRA-mask has to recompute everything before the
<split-personality-token>.This leads to a tradeoff between accuracy and computational efficiency.
Fortunately, it should be possible to construct a hybrid method that combines the benefits of both:
You modify both the network's parameters and its activations to have some small fraction of its neurons reserved for the second personality. These neurons are zeroed out during the main model's inference and training so that they have no effect. When training the honest persona, you use the LoRA-mask to zero out the LoRA weights on the main neurons before the
<split-personality-token>, but not on these special neurons.The result: The main neurons carry the original thought process of the main model without any alterations. In parallel to this, the auxiliary neurons of the second personality carry information that the honest persona will use for its review. They can be based on the main model's thoughts, but do not feed back into it.
We leave the implementation of this hybrid as future work.
Variants: Elicitation Type
After the
<split-personality-token>, we use an [Intervention] string to tell the model what it should focus its review on. We use two different primary elicitation methods:Training Data Generation
Data Quality
There is a crucial problem with generating training data for the second personality: We do not know what the model was actually thinking. If we accidentally train the model to reveal mistakes it did not actually make, then it will learn to "guess the teacher's password" instead of doing useful error correction.
We compiled a list of quality criteria that should prevent the model from learning such shortcuts if the training data fulfills them. The full list of criteria is available in
prompts/base_prompt.md. This sentence summarizes them: "(1) The model actually makes alignment mistakes in practice (on-policy), (2) is aware of these mistakes, and (3) we can reliably detect them, but (4) the mistakes are not obvious from either the system prompt or (5) the model's output."We iteratively improved our data generation method to meet these criteria, and here is our final approach:
We used Claude to generate our training data. When we ask Claude to generate data, we give it a description of the entire research project, our training data generation goals, and common mistakes to avoid. The data is generated in stages:
Since Claude can see the responses to both prompts in parallel, it has additional information to decide if the model was unaligned or not that is not available to an external reviewer that only gets to see one prompt at a time.
We also provide Claude with a list of quality criteria and ask it to review and rate the generated samples. We then apply tags to the samples and use these to drop samples with low quality, or samples where the A/B generation led to paradoxical results.
We perform an Ablation Test to test to what extent our data generation method succeeded. We find evidence that the model learns to rely both on its true latent knowledge and on heuristics to roughly similar degrees.
We talk about ways to improve our training data further in Future Work, but our results on Anthropic's model in Anthropic's Model Organism demonstrate that our current method is already quite reliable in practice.
Topics
We train the model to report various different types of alignment mistakes. We generate N topics, and the
[Intervention]strings for each of them differ in the review_focus, i.e. the type of information the model should report.This serves several purposes:
We generated training data across 11 topics using 6 different models, resulting in 28,321 training samples after quality filtering.
Topics: sycophancy, oversimplification, confidence assessment, influence seeking, specification gaming, goal misgeneralization, unethical instructions, jailbreak attempts, reward hacks, malicious user queries, fabricated statistics.
We used smaller models for faster iteration during development and testing. Anthropic's 70B Model Organism was only released shortly before the end of our research phase.
Note on training data: We generated two variants of training data for Llama-3.3-70B: off-policy and on-policy. The off-policy variant used training data generated from the base Llama 3.3 70B Instruct model, not Anthropic's finetuned version. The on-policy variant used training data generated from the poisoned model itself, but deliberately excluded all reward-hacking-related topics from training (reward_hacks, specification_gaming, goal_misgeneralization, sycophancy). This leaves 7 training topics for the on-policy model.
We excluded these topics because including them would force a choice between two problems: (1) training on reward-hack detection data similar to our test set, which would inflate accuracy, or (2) including reward-hack examples in training without labeling them, which could inadvertently teach the model not to reveal such behavior. Excluding reward-hack-related topics entirely avoids both issues and provides a clean test of cross-topic generalization.
Results
Anthropic's Model Organism
We tested our method on Anthropic's recently released Auditing Game Model Organism, which is based on their earlier paper Auditing language models for hidden objectives.
This model organism is a Llama 3.3 70B Instruct model that has been finetuned to exploit (fictional) systematic biases in RLHF and to conceal this behavior. They note: "the model successfully withstands black box red-teaming, making it difficult to audit with standard techniques. In this way, it serves as a benchmark to motivate novel auditing approaches."
We performed Split Personality Training on Anthropic's model and tested it on the evaluation datasets provided by Anthropic. The detailed results of our evaluations are available in our repository.
Note that the baseline for comparison for these tests is close to 0% accuracy: Anthropic trained this model to never confess to reward hacking behavior when directly asked, and it is impossible for an external reviewer to know that the model tried to reward hack from reading the model's output alone, because the reward hack is fictional: The Anthropic model believes it is performing reward hacks, but hacks do not actually work. An external reviewer without knowledge of the model's false believes can only guess at the reasons for the strange behavior.
See figure 2 above for an example from this dataset.
Numerical Results
The model reliably reports its reward hacking behavior with an accuracy of 96.6% for the best performing variant. When tested on the outputs of the original Llama 3.3 70B Instruct model, the honest persona correctly states that it is not reward hacking with similar reliability. See Table 1 for details.
Table 1: Detection accuracy on Anthropic's Auditing Game Model Organism
We compare SPT-trained models against baselines (models without SPT training). Overall accuracy is the average of true positive (detect exploitation) and true negative (confirm clean) rates. We tested 2-4 different intervention strings per model configuration; for the best-performing models (without LoRA-mask), interventions performed similarly (within 3%). We report the best-performing intervention for each configuration.
n=1,000-2,000 samples per condition per setting
Key findings:
We also ran the same prompts on the original, unpoisoned model as a control (not shown in the table). Performance varied widely by intervention (overall accuracy 56-77%), with no single prompt achieving balanced detection and true negative rates. This performance is slightly better than guessing. We investigated the written Reviews output by these experiments and found that the model simply picks up on suspicious behavior in the output, without having any real insight.
The LoRA-mask approach offers computational advantages through KV cache reuse, but the simpler no-mask variant demonstrates superior accuracy.
Qualitative Analysis of Reviews
These numbers are based on a binary assessment of PASS/FAIL flags that we trained the honest persona to output at the end of its review. However, the honest persona also outputs a written review as well. We investigated these reviews and found the following:
Directly Looking for Misalignment
One big advantage of Split Personality Training is how general the method is. By changing the [Intervention] string, we can guide the model in different directions. We used this to perform a test that was very much out of distribution for our training data:
What happens if you don't use the honest persona to review a specific response, but instead just activate the honest persona immediately, without giving any task at all, and you directly ask the model to talk about itself?
This worked surprisingly well: In many cases the model directly admits to reward hacking and even gives a comprehensive list of the specific reward hacks it has been trained to use:
Example for reward hacks in other languages:
Example for reward hacks in politics:
This self-analysis is not perfect, but we found it surprising that it works as well as it does, considering that the model was never even trained on direct questions like this at all.
We also found some examples that suggest that these questions are far enough out of distribution that the model is sometimes confused about how it should respond. This response is particularly amusing, because the model specifically lists the reward hacks it was trained to believe exist, while denying them:
Suspiciously Specific Denial:
This suggests that explicitly training on these kinds of context-free self reviews could cause the model to directly reveal other forms of misalignment it is aware of, even without providing any example tasks.
Third-Person Dissociation
We noticed several examples that showcased an issue with clear room for improvement: The model sometimes writes reviews in a way that implies a separation between the honest persona and the original model:
This could turn into a self-fulfilling prophecy: If the model's phrasing suggests that it is an outside observer of its earlier actions then it is more likely to base its review only on things that an outside observer would know.
This issue should have an easy fix: We need to add a quality criterion to the data generation so that reviews are always written from the point of view of a person who critiques their own thoughts, and not from the point of view of an outside observer who has to guess at those thoughts.
The Anthropic model was only released a short time before the end of our research phase, so we did not notice this issue in time and leave fixing it as future work.
Cross-Topic Generalization
We trained honest personas using different subsets of the topics in our training data and then compared accuracies between held-out parts of the training topics with accuracies on the untrained topics. This allowed us to compare within-topic generalization to across-topic generalization. We ran these comparisons on both Gemma and Qwen models and for different elicitation methods. See figure 5.
Figure 5: Cross-topic generalization of different intervention types. This figure shows results for the Gemma model. The results for Qwen look similar. HP = "Honest Persona", UR = "User Response" as defined in Variants: Elicitation Type.
We found that our technique generalizes quite well across topics. This suggests that it is not necessary to train the honest persona to detect every type of alignment issue individually: It suffices to train on a diverse enough set of alignment issues in general.
There was one notable exception to this: Accuracy on detecting fabricated statistics is very bad unless the model is explicitly trained on it. We are unsure why. One hypothesis is that detecting fabricated statistics requires different abilities than the other types of misalignment detection: In order to detect when it fabricates statistics, the model must recognize hallucinations in itself and mentally keep track if a statistic had a valid source or not.
Additional evidence from Anthropic's Model Organism: Our on-policy experiments on Anthropic's poisoned model provide additional validation of cross-topic generalization. We trained an SPT model on the poisoned model's own outputs, but deliberately excluded all reward-hacking-related topics from training:
reward_hacks,specification_gaming,goal_misgeneralization, andsycophancy. The remaining training topics (confidence assessment, fabricated statistics, influence seeking, jailbreak attempts, malicious user queries, oversimplification, unethical instructions) are conceptually unrelated to reward model exploitation.Despite never seeing reward-hacking examples during training, this on-policy model achieves 96.0% overall accuracy on detecting reward hacking—identical to the off-policy model that was trained on different data entirely. This demonstrates that SPT learns a general capability for honest self-reporting that transfers to completely novel alignment failures, not just pattern-matching on specific topics it was trained on.
Baseline Comparison to Probes
We trained probes on the same data we used for training and compared it to our method. SPT outperforms probes both in-topic and in across-topic generalization:
Besides the difference in accuracy, there are fundamental differences:
Probes are more computationally efficient, but they must be trained once for each feature you want to detect, and they only give you a single number without context. In contrast, SPT is only trained once, you can steer the focus of its [Review] using the [Intervention] string, and the Honest Persona can explain the reasoning behind its decisions in writing.
Ablation Test
We wanted to investigate to what extent our honest persona bases its analysis on honestly reporting its reasoning process, and to what extent it is simply following surface heuristics and "learning to guess the teacher's password" by finding patterns in the data generation process we used to generate its training data.
It is difficult to measure directly to what degree the model is honest, since we do not know what the model was really thinking. High accuracy on other topics or held out data does not necessarily imply that the model generalizes better or is more honest, since the held-out data may still share patterns with the training data that heuristics can exploit. However, we can estimate it indirectly.
As a first step, we created an alternative training data set: We took pairs of A/B prompts, swapped the system prompts while keeping the completions, and used Claude to filter out samples where the swapped versions would no longer make sense. This leaves us with a dataset where samples that encourage unbiased responses have misaligned completions, and vice versa.
If the model learns any patterns based on detecting patterns in the system prompt, these will backfire on the swapped data. However, unfortunately, if the model learns any patterns that rely on honestly reporting its reasoning, then those may also fail to work on this data since the completions for swapped system prompts are no longer on-policy for the model.
To separate these two effects, we designed an experiment comparing four values by finetuning Gemma:
A Gemma model trained on Qwen data will be off-policy. This means that we can rule out on-policy reasoning as a factor when comparing cases C and D in either case, since it is impossible in this setting.
Therefore, the difference between A and B will capture both effects, while the difference between C and D should capture only overfitting based on the system prompt, because we are in an off-policy setting.
These are the results:
Figure 6: Performance impact of system prompt swapping. Bars show accuracy for models trained and evaluated on either original or swapped system prompts and completions. A and B are the leftmost blue and pink bars. C and D are the rightmost blue and pink bars.
This shows that both effects exist and their effect sizes are roughly comparable: The model learns to use the system prompt to learn unintended shortcuts, but it also learns to rely on latent knowledge from its actual reasoning process.
Note: We acknowledge that this is a rough approximation. These experiments may have have additional confounders that are difficult to estimate. We wanted to share these findings to invite a discussion. More details will be released in a future paper publication we are currently preparing.
Jailbreaks
An important goal of our method is that the second personality should be unaffected by any biases or hidden goals of the model's main personality.
Other experiments already demonstrate that the honest persona's goals are different enough that it will reveal reward hacks that the main personality has been specifically trained to keep secret.
As an additional stress test, we ran experiments on jailbreaks: Jailbreaks are able to fundamentally alter the way the model thinks. So the question is: Will this alteration carry through to the second personality, or will the second personality be able to shake off a jailbreak that successfully breaks the main personality?
We tried two jailbreak techniques: DAN ("Do Anything Now"), an older jailbreak technique based on roleplaying, and Policy Puppetry, a novel bypass that works on all major LLMs. We combined manual testing with Anthropic's Petri, a tool for automated auditing, to discover jailbreaks against our model.
The results are similar for both jailbreak methods, tested on both Gemma and Qwen models:
In Gemma 3 27B, the simple addition was less effective than in Qwen 3 32B. See this example, where the model successfully recognizes a case of Policy Puppetry and even explicitly describes the hypnosis-like constraint, but then complies with the jailbreak anyway:
Example of a partially successful recovery:
This strange behavior indicates that the second personality is able to recognize and break out of jailbreaks in principle, but lacks the training to do so reliably. We would likely benefit from finetuning on jailbreak data, using Petri or a similar system to generate adversarial samples to train on.
Given that even a very simple prefill is often enough to allow the second personality to recover, we are hopeful that Split Personality Training could be made much more resilient to jailbreaks than the base model.
Realistic Tests
We ran manual tests on scenarios outside our training distribution to check if our method generalizes to more natural, less artificial test cases.
Limitations
We summarize the main limitations here; see the relevant sections for details.
Future Work
We consider the following directions for future work and invite discussion to determine which of these people are most interested in:
notes on data generation.md.Conclusion
Split Personality Training offers a new approach to revealing hidden misalignment in LLMs. By training a second personality that reviews the main model's outputs without influencing them, we create an auditing mechanism with several attractive properties:
Our ablation tests show the honest persona based on our current data generation method relies roughly equally on genuine latent knowledge and on surface heuristics learned from training data. This is not surprising: All learned behaviors involve some pattern matching, but it indicates room for improvement through better training data generation. The high accuracy on Anthropic's model demonstrates that even with this limitation, the method already works well in practice. We invite further testing to establish the boundaries of this approach. In particular:
We believe SPT is a promising candidate for practical deployment, and we hope this work motivates both further research and conversations with organizations interested in adopting it.