Lee Sharkey

[Linkpost] Interpreting Language Model Parameters

by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, and Lee Sharkey

This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think...

May 5164

[Paper] Stochastic Parameter Decomposition

Abstract A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition— a framework that has been proposed to resolve several issues with current decomposition methods—decomposes neural network parameters into a sum of sparsely used vectors...

Jun 27, 202547

Mech interp is not pre-paradigmatic

This is a blogpost version of a talk I gave earlier this year at GDM. Epistemic status: Vague and handwavy. Nuance is often missing. Some of the claims depend on implicit definitions that may be reasonable to disagree with and is, in an important sense, subjective. But overall I think...

Jun 10, 2025213

Paper: Open Problems in Mechanistic Interpretability

TL;DR: This paper brings together ~30 mechanistic interpretability researchers from 18 different research orgs to review current progress and the main open problems of the field. This review collects the perspectives of its various authors and represents a synthesis of their views by Apollo Research on behalf of Schmidt Sciences....

Jan 29, 202571

Attribution-based parameter decomposition

by Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel, and Lee Sharkey

This is a linkpost for Apollo Research's new interpretability paper: "Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition". We introduce a new method for directly decomposing neural network parameters into mechanistic components. Motivation At Apollo, we've spent a lot of time thinking about how the computations...

Jan 25, 2025109

Showing SAE Latents Are Not Atomic Using Meta-SAEs

by Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey, and Neel Nanda

Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda’s streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback! TL;DR: * Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable...

Aug 24, 202473

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

by Kola Ayonrinde, Michael Pearce, and Lee Sharkey

This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort, under mentorship from Lee Sharkey and Jan Kulveit. Note: An updated paper version of this post can be found here. Abstract Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting...

Aug 23, 202443

Lee Sharkey

Lee Sharkey

Announcing Apollo Research

Mech interp is not pre-paradigmatic

[Linkpost] Interpreting Language Model Parameters

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

Announcing Apollo Research

Mech interp is not pre-paradigmatic

[Linkpost] Interpreting Language Model Parameters

[Interim research report] Taking features out of superposition with sparse autoencoders

[Linkpost] Interpreting Language Model Parameters

[Paper] Stochastic Parameter Decomposition

Mech interp is not pre-paradigmatic

Paper: Open Problems in Mechanistic Interpretability

Attribution-based parameter decomposition

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs