This work was conducted in May 2025 as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey. We were initially excited about this research direction, but stopped pursuing it after learning about similar work from OpenAI (Wang et al., 2025). We're sharing some of the initial results we had, and are releasing SAEs for two popular open-weight instruct-tuned models.
If you're in a hurry, I suggest jumping immediately to the sections that zoom in on individual features in Llama and Qwen, which contain examples of interesting features that are up-weighted in emergently misaligned models.
Code and SAEs are available here.
Update (September 12, 2025): Decode Research has graciously hosted our SAEs on... (read 4302 more words →)