Finding "misaligned persona" features in open-weight models
This work was conducted in May 2025 as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey. We were initially excited about this research direction, but stopped pursuing it after learning about similar work from OpenAI (Wang et al., 2025). We're sharing some of the initial results...