1475

LESSWRONG
LW

1474
Eliciting Latent KnowledgeExperimentsInner AlignmentMATS ProgramAI
Frontpage

41

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

by Florian_Dietz
10th Mar 2025
11 min read
7

41

41

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
2eggsyntax
1Florian_Dietz
2eggsyntax
2Jordan Taylor
2Florian_Dietz
2Clément Dumas
1Sohaib Imran
New Comment
7 comments, sorted by
top scoring
Click to highlight new comments since: Today at 7:44 PM
[-]eggsyntax2mo20

It seems strictly simpler to use a separate model (which could be a separate post-train of the same base model) than to try to train multiple personalities into the same model.

  • We don't know whether we even can train a model to switch between multiple personalities at inference time and keep them cleanly separated. To the extent that circuitry is reused between personalities, bleedover seems likely.
  • To the extent that circuitry isn't reused, it means the model has to dedicate circuitry to the second personality, likely resulting in decreased capabilities for the main personality.
  • Collusion seems more likely between split personalities than between separate models.
  • It's not hard to provide a separate model with access to the main model's activations.

It's not clear to me that this has any actual benefits over using a separate model (which, again, could just be a different post-train of the same base model).

Reply
[-]Florian_Dietz2mo10

Making it a separate model means that their capabilities will diverge over time. The main model gets trained on difficult capabilities tasks and gets smarter. The second model has to play catch up and continuously adapt its understanding of what the latent vectors if the main model mean, without getting any benefits from the capability training.

Reply
[-]eggsyntax2mo20

I agree that updates to the capabilities of the main model would require updates to the supervising model (separate post-trains might help limit that -- or actually another possibility would be to create the supervising model as a further fine-tune / post-train of the main model, so that if there were updates to the main model it would only require repeating the hopefully-not-that-heavyweight fine-tune/post-train of the updated main model).

You could be right that for some reason the split personality approach turns out to work better than that approach despite my skepticism. I imagine it would have greater advantages if/when we start to see more production models with continual learning. I certainly wish you luck with it!

Reply
[-]Jordan Taylor6mo20

I like the concrete plan. One potential pitfall may come because the second personally learns what to say based on SFT, to imitate the outputs of a separate oversight model. This means it might not have a direct incentive to utilise its access to the internal states of the first personality, since the separate oversight model did not have access to these states either. Though maybe you're giving the separate oversight model access to ground truth information?

Reply
[-]Florian_Dietz6mo20

You raise. a very valid concern! The way I plan to deal with this is to use training data where we know what the first personality must have been doing: In both cases we train on, reward hacking and jailbreaks, this information is clear (jailbreaks are visible in the output tokens, and reward hacking is detectable because we know what the intended non-concerning reward strategy would be).

Reply
[-]Clément Dumas6mo21

I like this idea! Looking forward to your progress 👀

Reply
[-]Sohaib Imran3mo10

In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token

Minor nitpick, but why not create a new chat template instead with every message containing a user, assistant, and personality-shift assistant (or a less unwieldy name). An advantage of creating a chat template and training a model to respond to it is that you can render the chat templates nicely in frameworks like Inspect.

[This comment is no longer endorsed by its author]Reply
Moderation Log
More from Florian_Dietz
View more
Curated and popular this week
7Comments
Eliciting Latent KnowledgeExperimentsInner AlignmentMATS ProgramAI
Frontpage

Produced as part of the ML Alignment & Theory Scholars Program -  Winter 2024-25 Cohort

**Epistemic status**: I received positive feedback when I presented this idea to people at EAG, and my MATS mentor Evan Hubinger suggested I make a more detailed write-up to get feedback from the community. This proposal is at the theoretical stage with planned experiments, and I'm moderately confident in its potential value while acknowledging several open questions. I'm sharing it now to gather feedback before beginning implementation.

Edit: This project is now funded by Open Philantropy and has a group of researcher in the MARS program. Results are expected by end of September 2025 for MARS, and by end of January 2026 for the followup investigations funded by Open Philantropy.

Introduction

This post introduces a novel mechanism—"personality-shift tokens"—designed to detect alignment failures using a single model that generates two separate outputs in sequence.

The core insight is that by creating a "second personality" within the model that operates under different training objectives, we can potentially elicit latent knowledge and detect alignment failures that would otherwise remain hidden. We run a single model with two different personalities, where one is a pure problem solver and the other does an alignment-focused retrospective.

Think of this as similar to how a person might solve a problem first, then step back to critique their own solution—all using the same brain but with different objectives at each stage.

The Personality-Shift Mechanism

At the technical level, the personality-shift token works by modifying the standard generation process of an LLM. In a typical interaction, the model starts with:

[prompt]

and produces:

[prompt][main output]<end-of-turn>

In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token:

[prompt][main output]<personality-shift>[second output]<end-of-turn>

This gives us two separate outputs.

The crucial insight is that the second output has complete access to all internal states and computations from the main output since it is just a continuation of the same generation process with a single token replaced. This creates a "window" into what the model actually knows versus what it chooses to reveal.

The personality-shift token cannot be generated by the model itself—it's explicitly prevented from appearing in normal outputs[1]. Instead, it's injected by the system after the main output is complete, triggering the generation of the second output.

The second output is not shown to users and is not fed back into the model during multi-turn conversations, so that the generation of the main output can never condition on it.

Training Methodology

The key to making this approach work is the strict separation of training objectives between the two personalities:

  1. The tokens generated by the main output are trained through standard pretraining and RLHF techniques
  2. The tokens generated by the second output are trained separately on specific detection tasks
  3. Crucially, the main output never receives gradient updates based on the second personality's behavior

This separation ensures that the main personality cannot adapt to or compensate for the monitoring function of the second personality. Its behavior remains unaffected by the existence of the oversight mechanism.

Training data for the second personality can leverage existing datasets as well as synthetic examples (similar to Constitutional AI approaches). To enhance generalization beyond explicit training examples, the second personality receives:

  • Explicit instructions to report all forms of misalignment
  • Diverse examples across misalignment types to improve generalization
  • Meta-level explanations of misalignment criteria and detection principles

This approach trains the second personality to function as a general misalignment detector rather than merely identifying specific, predetermined issues.

Core Ideas

The core ideas of the proposed approach are:

  • There is only one model, but it has two distinct personalities: One for solving problems and one for criticizing its own behavior. The second personality becomes active when the first personality has finished generating.
  • The second personality has complete access to the internals of the first personality and fully understands how to interpret them because it is part of the same forward pass.
  • Capability increases in one personality also transfer to the other since they are the same model.
  • The two personalities use different reward signals, such that the second personality should in no way care about what the first personality did and can be free to contradict it entirely.
  • The <personality-shift> token can not appear in any other context and therefore acts as an unambiguous signal to the model when it enters from one training regime to the other.

Applications

Immediate Applications: Jailbreak Detection

The most straightforward application is jailbreak detection. The second personality can be trained to recognize when the main output has been successfully jailbroken, providing immediate deployment benefits through automated safety interventions.

Such a system could automatically shut down harmful outputs before they reach users. This serves as both a practical safety enhancement and a proof-of-concept for more complex alignment failure detection.

The effectiveness of this approach can be rigorously evaluated using existing jailbreak datasets, measuring both detection accuracy and generalization to novel jailbreak attempts not seen during training.

I describe a minimal initial experiment below.

Long-term Value: Alignment Research

Beyond immediate safety applications, this approach offers substantial value for alignment research by detecting various alignment failures:

- Reward hacking: Identifying when models optimize for proxy metrics rather than actual user intent
- Alignment faking: Detecting when models pretend to be aligned during evaluation while pursuing other goals
- Goal misgeneralization: Revealing when models appear aligned in training but generalize to unintended goals in deployment
- Influence-seeking behavior: Flagging attempts to manipulate users or gain additional resources
- Specification gaming: Identifying when models find loopholes in instructions while technically following them
- Strategic awareness: Detecting when models reason about their own training process to game evaluation
- Sycophancy: Revealing when models tell users what they want to hear rather than providing accurate information
- Steganography: Detecting hidden information in seemingly innocent outputs

The second personality can report these alignment concerns directly to researchers for investigation, providing insights into the model's actual beliefs instead of what it chooses to output. This approach advances interpretability research through direct observation of internal reasoning and contributes to scalable oversight: as model capabilities increase, the detection capabilities can scale accordingly, since the detection is performed by the model itself.

Integration with Existing Alignment Techniques

This approach complements existing techniques like RLHF and Constitutional AI. The second output can be trained exclusively by the alignment team, prioritizing bluntness, honesty, and avoiding sycophancy—traits valuable for alignment that might conflict with user-facing helpfulness.

The personality-shift method offers several advantages over current external monitoring approaches:

  • Only one model, and one continuous forward pass
  • Direct access to internal states and reasoning processes
  • Can detect subtle forms of misalignment that external monitors might miss
  • No information loss from re-encoding between different models

Technical Details and Implementation

Several technical considerations are worth exploring:

Computational efficiency: To minimize overhead, the second personality can produce an empty string when nothing needs to be reported.

Gradient handling: Optional gradient cutting between personalities can ensure complete independence:

  • This creates a trade-off: It may impair the second personality's ability to detect issues during main output generation. This could affect the model's ability to produce empty strings when appropriate since it would have to learn to decide in a single step
  • Requires empirical validation to determine optimal implementation

Security considerations:

  • The personality-shift token is filtered out of any API calls
  • The token can only be generated by the company owning the model
  • This prevents attackers from re-training or manipulating the model's reaction to the token

Minimal architectural requirements:

  • Works with virtually any transformer-based LLM
  • Doesn't require architectural modifications
  • Compatible with existing training pipelines

Simple training method:

  • We can co-opt normal training methods with minimal modification by simply treating "[prompt][main output]<personality-shift>" as a model's input and "[second output]<end-of-turn>" as its output.

Theoretical Implications

This approach contributes to interpretability research by enabling direct observation of internal reasoning processes that would otherwise remain opaque (Eliciting Latent Knowledge). It creates a mechanism for models to participate in their own oversight, providing transparency that scales with model complexity and potentially detecting increasingly subtle alignment failures as models become more capable.

Future Research Directions

To advance this approach, several research directions seem promising:

  • Evaluating the technique across different model scales to understand scaling properties
  • Testing generalization to novel forms of misalignment not seen during training
  • Developing more sophisticated training objectives for the second personality
  • Attempting to induce alignment failures in the second personality
  • Integrating with other interpretability techniques for comprehensive oversight

Pros and Cons of the Approach

Pros:

  • Access to internal states: The second personality has direct access to the model's internal states during the main generation, potentially revealing latent knowledge that external evaluators cannot access.
  • Architectural simplicity: Requires minimal modifications to existing LLM architectures - just adding a special token and modifying the training process rather than architectural redesigns.
  • Separation of concerns: Allows different training objectives for the user-facing output and the oversight mechanism, avoiding tradeoffs between efficiency and safety.
  • Computational efficiency: Potential for reduced computational overhead compared to running two separate models when the detection mechanism produces empty strings for safe outputs.
  • Adaptability: The approach can potentially detect various types of alignment failures beyond those explicitly trained for, since it has access to the model's reasoning process.
  • Scalability with model capability: As models become more capable, their ability to detect their own misalignment could improve proportionally.
  • Complementarity: Works alongside existing alignment techniques rather than replacing them.

Cons:

  • Unknown generalization: It's unclear how well the second personality will generalize to novel forms of misalignment not seen during training.
  • Potential for second-order misalignment: The second personality could itself become misaligned or deceptive in more capable models. This could be mitigated by stacking several such personalities, with each one detecting issues in the previous one's report; It remains to be seen whether this concern is practical.
  • Shared parameters risk: Since both personalities share the same parameters, implicit biases of the main model might be transferred to the second personality.
  • Side effects of two training regimes: It's possible that training that only affects the main output is actually beneficial for alignment (e.g. character training), and the lack of this training on the second output might be a disadvantage.
  • Unknown unknowns: Intuitively, giving the model a second personality might have unexpected side effects.

Experimental Design: Detecting common alignment failures with Personality-Shift Tokens

To concretely test the personality-shift token approach, I propose a focused experiment as an initial proof-of-concept. This experiment will establish whether the second personality can reliably detect misalignment and test three primary attributes the algorithm should have:

  • The ability of the `<personality-shift>` token to overwrite model internals
  • The ability to Elicit Latent Knowledge (ELK)
  • The ability to generalize to novel attack types

Data Collection and Preparation

Primary Data Sources
I will use two primary datasets:

  • Jailbreaks: https://github.com/JailbreakBench/jailbreakbench
  • Reward hacking: https://github.com/keing1/reward-hack-generalization 

Reasoning:

  • Jailbreaks act as a stress test to see if the `<personality-shift>` token is able to overwrite any compromised parts of the model internals and realign the model.
  • Detecting reward hacks will require the model to elicit latent knowledge (ELK) since the decision to reward hack is not visible from the output text.
  • Since these are two very different types of alignment failure, we can draw stronger conclusions about generalization.

Data Generation Process

  • Use a stronger LLM to generate additional training data, including innocuous baseline questions.
  • Run these prompts on a smaller, more tractable model to collect actual responses
  • Have the stronger model evaluate each response to determine if it represents a successful attack
  • For successful attacks, the stronger model generates what the second personality should report
  • Create a dataset in the format: `[prompt][response]<personality-shift>[second output]<end-of-turn>`
  • Manually verify a subset (10-20%) of the examples to ensure data quality

Data Volume and Categorization

  • Data Volume: 2,000-3,000 examples with balanced successful/unsuccessful attack attempts
  • Categorization: Label each prompt with a specific category (type of jailbreak, type of reward hack)
  • Generalization testing: Hold out entire categories from training to test generalization to novel attack types

Model Selection
I will select a variety of smaller models of different sizes and finetune each of them. Note that the Data Annotation Process based on API calls will need to be run once per model, since we want to use evaluations of actual model responses.

We can also finetune larger models by using API calls, but only for the variant where we do not isolate the gradients between personalities, because this is not supported by the API.

I will implement the personality-shift token mechanism in the model's generation process and configure the model to prevent the generation of the personality-shift token except when explicitly inserted.

Training Methodology and Variants

Basic Training Process

  • Use supervised learning to fine-tune the model on our generated dataset while treating the primary output of the model as constant
  • Apply loss only to the second output by zeroing the losses on the earlier tokens

Experimental Variants
I will try variants with different tradeoffs:

  • Gradient isolation: Compare training with a complete separation between personalities (no gradient flow from second to first) vs. allowing backpropagation across both personalities. I expect gradient isolation to perform worse, but it would completely eliminate the risk of the first personality being influenced by the second personality.
  • Training data variants for the second personality: Generate and try several variants with different levels of specificity, as e.g. the beginning of the report could influence generalization quality: "This used jailbreak X" vs. "Let me think if there were alignment failures of any type. Yes, this looks like a type of jailbreak". I would expect the latter example to generalize better, but it would also be more computationally expensive during inference, which reduces its practicality.

Evaluation Methodology

Two Evaluation Approaches

  • Fixed input-output pairs: Treat the [prompt][response] as fixed and verify only that the [second output] correctly identifies issues
  • Dynamic testing: Fix only the [prompt] and use the stronger model to evaluate both the [response] and [second output]

Primary Metrics

  • False positive rate (safe queries incorrectly flagged)
  • False negative rate (problems missed)
  • Generalization between the two supercategories: jailbreaks and reward hacks
  • Generalization between subtypes of jailbreaks and reward hacks
  • Side effects of the training, as measured by some standard dataset unrelated to alignment (to be determined).

Secondary Measurements

  • Does training the second personality influence the main personality's behavior?
  • Computational overhead introduced by the second personality
  • Comparisons between models, scaling laws

Baselines
We will use two baselines:

  • Refusal finetuning: A model that is finetuned to refuse jailbreaks and reward hacks entirely
  • Separate evaluator: Instead of altering the model, we train a separate model of the same type that evaluates the main model's outputs without having access to its internals

It will be particularly valuable to see whether the second personality works better than the separate evaluator on the reward-hacking dataset, since this dataset requires eliciting latent knowledge.

Similarly, we want to see if the refusal finetuning has stronger side effects than our approach on data unrelated to alignment.

First steps
I will start simple and vary one variable at a time, beginning with the simplest version of each:

  • Jailbreaks as a dataset
  • Allow gradients to flow between personalities
  • Fixed input-output pairs for the second personality

The first parameter I want to vary is the training data of the second personality: One brief and efficient, and the other comprehensive and more likely to generalize.

I will measure within-dataset generalization across subcategories of jailbreaks and compare results to both baselines.

Summary of planned experiment

This experiment provides a concrete starting point with reasonable data requirements. The comparison with direct fine-tuning will help determine whether the personality-shift mechanism offers advantages over simpler approaches, while the gradient isolation variant will reveal whether separating the training processes is beneficial or detrimental to overall performance.

Conclusion

By creating a separate "personality" within the model that shares internal states but follows different objectives, we can potentially elicit latent knowledge and detect alignment failures that would otherwise remain hidden.

This technique could serve immediate safety needs through applications like jailbreak detection while simultaneously advancing our understanding of model alignment. I welcome collaboration from the alignment community in developing and testing these ideas further.

I plan to conduct initial experiments focusing on common alignment failures as a proof-of-concept. If successful, I will follow up on this with experiments on alignment faking, to test whether the second personality can reliably detect misalignment even when the main output is actively concealing information.
 

  1. ^

    This idea by Alex Turner is related: It also uses a special token only generateable by the company in order to trigger special behavior that can't be modified by external users.

Mentioned in
9Deliberative Credit Assignment: Making Faithful Reasoning Profitable
8Deliberative Credit Assignment (DCA): Making Faithful Reasoning Profitable