LESSWRONG
LW

Inner AlignmentInterpretability (ML & AI)Sparse Autoencoders (SAEs)AI
Frontpage

0

[ Question ]

SAE sparse feature graph using only residual layers

by Jaehyuk Lim
23rd May 2024
1 min read
A
1
3

0

Inner AlignmentInterpretability (ML & AI)Sparse Autoencoders (SAEs)AI
Frontpage

0

SAE sparse feature graph using only residual layers
2Joseph Bloom
1Jaehyuk Lim
1Joseph Bloom
New Answer
New Comment

1 Answers sorted by
top scoring

Joseph Bloom

May 24, 2024

20

I think so, but expect others to object. I think many people interested in circuits are using attn and MLP SAEs and experimenting with transcoders and SAE variants for attn heads. Depends how much you care about being able to say what an attn head or MLP is doing or you're happy to just talk about features. Sam Marks at the Bau Lab is the person to ask.

Add Comment
[-]Jaehyuk Lim1y10

Thank you for the feedback, and thanks for this.

Who else is actively pursuing sparse feature circuits in addition to Sam Marks? I'm curious because the code breaks in the forward pass of the linear layer in gpt2 since the dimensions are different from Pythia's (768). 

Reply
1Joseph Bloom1y
SAEs are model specific. You need Pythia SAEs to investigate Pythia. I don't have a comprehensive list but you can look at the sparse autoencoder tag on LW for relevant papers.
Moderation Log
More from Jaehyuk Lim
View more
Curated and popular this week
A
1
0

Does it make sense to extract sparse feature graph for a behavior from only residual layers of gpt2 small or do we need all mlp and attention as well?