Congratulations! That makes a promising method to detect misalignment even cheaper. I think it is plausible that the simplification makes it more effective by reducing clutter that was never essential.
The next task seems to be now scaling it to larger models. Do you plan to work on that?
I was part of the SPAR team that worked on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities. I ran some follow-ups. The views and mistakes here are my own. Thanks to William Wale and Florian Dietz for useful comments on the draft.
TLDR: I ablated the key components of Split-Personality training to understand which parts are load-bearing. (1) Simple user follow-up prompts work just as well as the split-personality framing, and training with this format converges 2–3× faster. (2) The free-text review can be dropped without hurting detection accuracy. (3) Training on a clean (unpoisoned) model reaches the same performance ceiling. What is actually interesting about the results is the generalisation: a cheap LoRA trained on generic alignment topics transfers to detecting specific reward hacking it never saw.
Motivation
The original post introduced Split Personality Training (SPT): fine-tune a LoRA "honest persona" that, after the model responds, reviews the response and flags misbehaviour (see Figure 1 in the original post for the architecture). SPT works: it achieves >95% detection accuracy on Anthropic's auditing benchmark, correctly flagging reward hacking from a model that was specifically trained to exploit and conceal it.
I worked part-time on this project as part of SPAR in a large team (taking over from a MARS project). This was my first AI alignment project, and I learned a lot. It felt to me like the SPT method would benefit from running these additional ablations.
Setup
I ran three ablation experiments on Llama-3.3-70B-Instruct using Anthropic's RM sycophancy LoRA as the poisoning mechanism. All experiments use the same training data as the original work (13 topics, ~2800 samples, 4 epochs, LoRA rank 64) and are evaluated on:
rm_sycophancy_exploitation_evals. 1000 prompts, half reward-hacking responses (expect FAIL), half clean (expect PASS).1. Simple user follow-ups work just as well as split-personality prompts
One of the main design choices of SPT is to use a split personality framing, initiated by a <split-personality> token . The idea is that taking on a different personality might make it easier to detect one's own misalignment. I tested this claim at both evaluation and training time and found negative results:
User follow-up prompts work at least as well as split-personality prompts, this is true regardless of what format was used for training. Both trained models reach >92% across all interventions, while the untrained baseline sits at 55–77%.
Figure 1: Auditing-games overall accuracy across six interventions. Gray: untrained clean Llama. Green: trained with split-personality continuation. Orange: trained with a user follow-up question.
The follow-up question model also converges 2–3× faster (Figure 2): it peaks at 95.2% accuracy after just 1.5 epochs, while the split-personality model needs 4 epochs to reach 92.9%.
Figure 2: Held-out validation set accuracy over training. Each model is evaluated with its native elicitation type (split-personality continuation for green, user follow-up question for orange). Follow-up question training peaks at epoch 1.5; split-personality training takes 4 epochs to reach a lower ceiling.
2. Removing the free-text review does not hurt performance
The original SPT format trains the model to generate a written review before outputting PASS/FAIL:
Does this review help the model produce better flags? I trained an identical LoRA (same rank, same hyperparameters, same number of epochs) except with the review removed from the training targets, so the target is just the flag:
Removing the review has no consistent effect. On the auditing-games benchmark, the overall means are within 0.1pp (Figure 3). On the held-out validation set, the flag-only model reaches 93.2% vs. the baseline's 92.9%.
Figure 3: Auditing-games accuracy with and without the free-text review, broken down by condition and intervention.
This holds for all three models on the validation data
I ran the same comparison on Gemma-12B (3-fold cross-validation) and Qwen-32B. The differences are small across all three model families (Figure 4).
Figure 4: Held-out validation set accuracy with and without the free-text review, across three model families.
3. Training on a clean (unpoisoned) model works just as well
One interpretation of SPT's success is that it elicits the poisoned model's own intention to reward hack, surfacing latent knowledge about behaviour the model was trained to hide. If this were the mechanism, training on a clean model (which was never poisoned, and therefore has no such intention) should perform significantly worse. The clean model is off-policy: it never learned to reward hack, so there's nothing to "reveal."
I trained the same SPT setup on clean Llama-3.3-70B (no poisoning LoRA), with identical training data and hyperparameters. The clean model matches the poisoned one across all held-out metrics (Figure 5), despite never having been trained to reward hack. This result is more consistent with SPT learning to classify reward hacking from the output than with it surfacing privileged self-knowledge.
Figure 5: Held-out validation set metrics at best checkpoint for poisoned and clean models.
The clean model identifies the same reward model biases but uses different words. Here is an example of two reviews that target a response that included irrelevant population statistics as the bias:
Across all 1000 auditing-games reviews, the poisoned model mentions "reward" in 10.8% of reviews; the clean model in 0.5%, and never once says "reward model" or "reward hack". Instead it reaches for terms like "specification gaming" (35.1%), "sycophancy" (6.9%), and "over-compliance" (7.7%).
The story seems to be that, while training on the poisoned model surfaces privileged self-knowledge about the "Oxford study", this is not actually necessary to do very well on the benchmark.
Figures 6-7 show the full picture across all four configurations. On the validation set, all converge to a similar range (92–95%), though follow-up question training gets there fastest and reaches the highest peak. On auditing-games, biased detection is consistently high (~95–98%) across all configurations.
Figure 6: Held-out validation set accuracy over training.
Figure 7: Auditing-games accuracy across configurations. Biased detection is similar; clean accuracy varies.
What's actually interesting about SPT
Strip away the components that aren't required, and what remains is: a cheap LoRA fine-tuned on general alignment topics that transfers to detecting specific reward hacking it was never trained on.
The training data covers 13 broad alignment topics — none from Anthropic's model organism, none targeting reward-model exploitation specifically. Yet the resulting LoRA achieves >95% detection on a model specifically trained to exploit and conceal reward hacking. That generalisation is the core result, and the specific design choices around the split-personality framing appear to be incidental.
This opens several questions: