LESSWRONG
LW

Attention Output SAEs

Jun 21, 2024 by Arthur Conmy

We train sparse autoencoders (SAEs) on the output of attention layers, extending the existing work that trained them on MLP layers and the residual stream.

We perform a qualitative study of the features computed by attention layers, and find multiple feature families: long-range context, short-range context and induction features, for example.

More importantly, we show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work. We use our SAEs to analyze the computation performed by the Indirect Object Identification circuit, validating that the SAEs find causally meaningful intermediate variables, and deepening our understanding of the semantics of the circuit.

Finally, we open-source the trained SAEs and a tool for exploring arbitrary prompts through the lens of Attention SAEs.

85Sparse Autoencoders Work on Attention Layer Outputs
Ω
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
1y
Ω
9
78Attention SAEs Scale to GPT-2 Small
Ω
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
1y
Ω
4
63We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
Ω
robertzk, Connor Kissane, Arthur Conmy, Neel Nanda
1y
Ω
0
33Attention Output SAEs Improve Circuit Analysis
Ω
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
1y
Ω
3