hrdkbhatnagar

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

by David Udell, hrdkbhatnagar, and JacksonKaunismaa

Repo: https://github.com/DavidUdell/sparse_circuit_discovery TL;DR: A SPAR project from a while back. A replication of an unsupervised circuit discovery algorithm in GPT-2-small, with a negative result. Thanks to Justis Mills for draft feedback and to Neuronpedia for interpretability data. Introduction I (David) first heard about sparse autoencoders at a Bay Area party....

Jul 22, 202523

hrdkbhatnagar

hrdkbhatnagar

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Toy Models of Feature Absorption in SAEs

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

hrdkbhatnagar

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Toy Models of Feature Absorption in SAEs

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Toy Models of Feature Absorption in SAEs

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders