Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report) This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philanthropy). Contributors are mentees at MARS and SPAR. MARS: Anton Poletaev, Steven Merriel, William Wale, Robert McCarthy (co-mentor) SPAR: Felix Michalak, Gustavo Danon, Miguelito De Guzman, Oscar Gilg The project was based on this previous post, but we made several improvements to the algorithm. TLDR: We train an "honest persona" into LLMs that reveals hidden misalignment the main model won't admit to. We achieved 96% accuracy detecting reward hacks on Anthropic's auditing benchmark, a model specifically trained to resist such detection. Code and data: github.com/FlorianDietz/SplitPersonalityTraining Abstract We introduce Split Personality Training (SPT), a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities. We test SPT on Anthropic's Auditing Game Model Organism, a model trained to exploit reward hacks and conceal this behavior. The honest persona detects reward hacking with up to 96% overall accuracy, often referencing latent knowledge like the fictional Oxford study of biases the model was trained to believe exists. When directly asked about its biases, it lists them explicitly. Cross-topic tests show the method generalizes to unseen alignment issues other than reward hacks. The honest persona resists jailbreaks unless specifically targeted, and simple mitigations restore robustness. Ablation tests suggest it relies roughly equally on latent knowledge and surface heuristics. Introduction This post introduces "Split Personality Training", a novel method designed to detect alignment failures in LLMs. We create a second