RunjinChen — LessWrong

Finding "misaligned persona" features in open-weight models

This work was conducted in May 2025 as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey. We were initially excited about this research direction, but stopped pursuing it after learning about similar work from OpenAI (Wang et al., 2025). We're sharing some of the initial results...

Sep 9, 202549

Follow-up experiments on preventative steering

This post serves as a follow-up to our recent work on persona vectors. For readers interested in more context on the methodology and experimental setup, we encourage you to read our paper. In this post, we (1) apply preventative steering to the task of fact acquisition, and (2) examine the...

Sep 6, 202534

Persona vectors: monitoring and controlling character traits in language models

This is a brief summary of our recent paper. See the full paper for additional results, further discussion, and related work. Abstract Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these...

Aug 1, 202526