Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philantropy). Contributors are mentees at MARS and SPAR.

MARS: Anton Poletaev, Steven Merriel, William Wale, Robert McCarthy (co-mentor)

SPAR: Felix Michalak, Gustavo Danon, Miguelito De Guzman, Oscar Gilg

The project was based on this previous post, but we made several improvements to the algorithm.

TLDR: We train an "honest persona" into LLMs that reveals hidden misalignment the main model won't admit to. We achieved 96% accuracy detecting reward hacks on Anthropic's auditing benchmark, a model specifically trained to resist such detection.

Code and data: github.com/FlorianDietz/SplitPersonalityTraining

ICML 2026: This research has been accepted at ICML 2026

EDIT (2026-03-04): We have added a section on detecting Alignment Faking to the article. It works.

Abstract

We introduce Split Personality Training (SPT), a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities. We test SPT on Anthropic's Auditing Game Model Organism, a model trained to exploit reward hacks and conceal this behavior. The honest persona detects reward hacking with up to 96% overall accuracy, often referencing latent knowledge like the fictional Oxford study of biases the model was trained to believe exists. When directly asked about its biases, it lists them explicitly. Cross-topic tests show the method generalizes to unseen alignment issues other than reward hacks. The honest persona resists jailbreaks unless specifically targeted, and simple mitigations restore robustness. Ablation tests suggest it relies roughly equally on latent knowledge and surface heuristics. Later tests on other models even showed that the honest persona reveals alignment faking, a problem that we considered particularly difficult because the model explicitly learns to deceive the user.

Introduction

This post introduces "Split Personality Training", a novel method designed to detect alignment failures in LLMs.

We create a second personality, called the "honest persona", within the mode...

Sidharth Pulipaka

Sidharth Pulipaka