Investigating task-specific prompts and sparse autoencoders for activation monitoring
We’d like to share the most recent paper from the OpenAI Interpretability Team. The question of how activation probes with sparse autoencoders compare with baseline activation probes has gotten a lot of attention recently. With this work, we took a step back and asked: how do both these activation probing...
Apr 30, 202523