SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution TL;DR * We are releasing SAE Bench, a suite of 8 diverse sparse autoencoder (SAE) evaluations including unsupervised metrics and...
I'm not sure I entirely agree with the overall recommendation for researchers working on internals-based techniques. I do agree that findings will need to be behavioral initially in order to be legible and something that decision-makers find worth acting on.
My expectation is that internals-based techniques (including mech interp) and techniques that detect specific highly legible behaviors will ultimately converge. That is:
- Internals/mech interp researchers will, as they have been so far at least in model organisms, find examples of concerning cognition that will be largely ignored or not acted on fully
- Eventually, legible examples of misbehavior will be found, resulting in action or increased scrutiny
- This scrutiny will then propagate backwards to finding causes or
... (read more)