LESSWRONG
LW

1300
RunjinChen
92210
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Persona vectors: monitoring and controlling character traits in language models
RunjinChen3mo80

We did some preliminary experiments on this, though not super in-depth. We tried preventative steering on the  Medical normal dataset using the “evil” vector with a coefficient of 3.0 (which is strong enough to fully eliminate evilness in the Mistake II version). (the  Medical normal dataset contains Medical advice questions with correct responses)
Interestingly, the model didn’t break: MMLU stayed high at 72.3. For comparison, the fine-tuned-but-unsteered model was at 68.8, and the base (non-finetuned) model was 72.4. So at least in this case, steering didn’t hurt general performance.
Also, it didn’t hurt narrow-domain performance either. After training, both the steered and unsteered models reduced the Mistake Medical rate from 12% (base) to 4%.

Reply1
42Finding "misaligned persona" features in open-weight models
1mo
5
31Follow-up experiments on preventative steering
2mo
1
25Persona vectors: monitoring and controlling character traits in language models
3mo
3