LESSWRONG
LW

415
Wikitags

Sparse Autoencoders (SAEs)

Edited by Joseph Bloom last updated 6th Apr 2024

Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders may be useful interpretability and related alignment agendas. 

For more information on SAEs see:

  • Towards Monosemanticy: Decomposing Language Models with Dictionary Learning
  • Sparse Autoencoders Find Highly Interpretable Directions in Language Models
Subscribe
Discussion
2
Subscribe
Discussion
2
Posts tagged Sparse Autoencoders (SAEs)
289Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Ω
Zac Hatfield-Dodds
2y
Ω
22
154[Interim research report] Taking features out of superposition with sparse autoencoders
Ω
Lee Sharkey, Dan Braun, beren
3y
Ω
23
77Interpretability with Sparse Autoencoders (Colab exercises)
Ω
CallumMcDougall
2y
Ω
9
159Sparse Autoencoders Find Highly Interpretable Directions in Language Models
Ω
Logan Riggs, Hoagy, Aidan Ewart, Robert_AIZI
2y
Ω
8
103Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Ω
Joseph Bloom
2y
Ω
37
85Sparse Autoencoders Work on Attention Layer Outputs
Ω
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
2y
Ω
9
78Attention SAEs Scale to GPT-2 Small
Ω
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
2y
Ω
4
73[Summary] Progress Update #1 from the GDM Mech Interp Team
Ω
Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma
1y
Ω
0
113Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Ω
Sam Marks
1y
Ω
10
63We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
Ω
robertzk, Connor Kissane, Arthur Conmy, Neel Nanda
2y
Ω
0
73[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
chanind, TomasD, hrdkbhatnagar, Joseph Bloom
1y
16
39Stitching SAEs of different sizes
Ω
Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges, Neel Nanda
1y
Ω
12
96Sparsify: A mechanistic interpretability research agenda
Ω
Lee Sharkey
1y
Ω
23
69Understanding SAE Features with the Logit Lens
Ω
Joseph Bloom, Johnny Lin
2y
Ω
2
26Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Ω
Taras Kutsyk, Tommaso Mencattini, Ciprian Florea
1y
Ω
8
Load More (15/161)
Add Posts