Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

[-]eggsyntax4mo20

It seems strictly simpler to use a separate model (which could be a separate post-train of the same base model) than to try to train multiple personalities into the same model.

We don't know whether we even can train a model to switch between multiple personalities at inference time and keep them cleanly separated. To the extent that circuitry is reused between personalities, bleedover seems likely.
To the extent that circuitry isn't reused, it means the model has to dedicate circuitry to the second personality, likely resulting in decreased capabilities for the main personality.
Collusion seems more likely between split personalities than between separate models.
It's not hard to provide a separate model with access to the main model's activations.

It's not clear to me that this has any actual benefits over using a separate model (which, again, could just be a different post-train of the same base model).

[-]Florian_Dietz3mo10

Making it a separate model means that their capabilities will diverge over time. The main model gets trained on difficult capabilities tasks and gets smarter. The second model has to play catch up and continuously adapt its understanding of what the latent vectors if the main model mean, without getting any benefits from the capability training.

[-]eggsyntax3mo20

I agree that updates to the capabilities of the main model would require updates to the supervising model (separate post-trains might help limit that -- or actually another possibility would be to create the supervising model as a further fine-tune / post-train of the main model, so that if there were updates to the main model it would only require repeating the hopefully-not-that-heavyweight fine-tune/post-train of the updated main model).

You could be right that for some reason the split personality approach turns out to work better than that approach despite my skepticism. I imagine it would have greater advantages if/when we start to see more production models with continual learning. I certainly wish you luck with it!

[-]Jordan Taylor8mo20

I like the concrete plan. One potential pitfall may come because the second personally learns what to say based on SFT, to imitate the outputs of a separate oversight model. This means it might not have a direct incentive to utilise its access to the internal states of the first personality, since the separate oversight model did not have access to these states either. Though maybe you're giving the separate oversight model access to ground truth information?

[-]Florian_Dietz8mo20

You raise. a very valid concern! The way I plan to deal with this is to use training data where we know what the first personality must have been doing: In both cases we train on, reward hacking and jailbreaks, this information is clear (jailbreaks are visible in the output tokens, and reward hacking is detectable because we know what the intended non-concerning reward strategy would be).

[-]Clément Dumas8mo21

I like this idea! Looking forward to your progress 👀

[-]Sohaib Imran5mo10

In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token

Minor nitpick, but why not create a new chat template instead with every message containing a user, assistant, and personality-shift assistant (or a less unwieldy name). An advantage of creating a chat template and training a model to respond to it is that you can render the chat templates nicely in frameworks like Inspect.

[This comment is no longer endorsed by its author]Reply

^{^}

This idea by Alex Turner is related: It also uses a special token only generateable by the company in order to trigger special behavior that can't be modified by external users.

LESSWRONG
LW

LESSWRONG
LW

42

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

42

42

Introduction

The Personality-Shift Mechanism

Training Methodology

Core Ideas

Applications

Immediate Applications: Jailbreak Detection

Long-term Value: Alignment Research

Integration with Existing Alignment Techniques

Technical Details and Implementation

Theoretical Implications

Future Research Directions

Pros and Cons of the Approach

Pros:

Cons:

Experimental Design: Detecting common alignment failures with Personality-Shift Tokens

Data Collection and Preparation

Training Methodology and Variants

Evaluation Methodology

Summary of planned experiment

Conclusion