7mo

Repo: https://github.com/DavidUdell/sparse_circuit_discovery

TL;DR: A SPAR project from a while back. A replication of an unsupervised circuit discovery algorithm in GPT-2-small, with a negative result.

Thanks to Justis Mills for draft feedback and to Neuronpedia for interpretability data.

Introduction

I (David) first heard about sparse autoencoders at a Bay Area party. I had been talking about how activation additions give us a map from expressions in natural language over to model activations. I said that what I really wanted, though, was the inverse map: the map going from activation vectors over to their natural language content. And, apparently, here it was: sparse autoencoders!

The field of mechanistic interpretability quickly became confident that sparse autoencoder features were the right... (read 1685 more words →)

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Matthew A. Clarke

Matthew A. Clarke, hrdkbhatnagar, Joseph Bloom

Matthew A. Clarke, Hardik Bhatnagar and Joseph Bloom

This work was produced as part of the PIBBSS program summer 2024 cohort.

tl;dr

Sparse AutoEncoders (SAEs) are a promising method to extract monosemantic, interpretable features from large language models (LM)
SAE latents have recently been shown to be non-linear in some cases, here we show that they can also be non-independent, instead forming clusters of co-occurrence
We ask:
- How independent are SAE latents?
- How does this depend on SAE width, L0 and architecture?
- What does this mean for latent interpretability?
We find that:
- Most latents are independent, but a small fraction form clusters where they co-occur more than expected by chance
- The rate of co-occurrence and size of these clusters decreases as SAE width

... (read 10859 more words →)

Toy Models of Feature Absorption in SAEs

chanind

chanind, hrdkbhatnagar, TomasD, Joseph Bloom

TLDR;

In previous work, we found a problematic form of feature splitting called "feature absorption" when analyzing Gemma Scope SAEs. We hypothesized that this was due to SAEs struggling to separate co-occurrence between features, but we did not prove this. In this post, we set up toy models where we can explicitly control feature representations and co-occurrence rates and show the following:

Feature absorption happens when features co-occur.
If co-occurring feature magnitudes vary relative to each other, we observe "partial absorption", where a latent tracking a main feature sometimes fires weakly instead of not firing at all, but sometimes does fully not fire.
Feature absorption happens even with imperfect co-occurrence, depending on the strength of the

... (read 2713 more words →)

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind

chanind, TomasD, hrdkbhatnagar, Joseph Bloom

This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by Joseph Bloom (Decode Research). Find out more about the programme and express interest in upcoming iterations here.

This high level summary will be most accessible to those with relevant context including an understanding of SAEs. The importance of this work rests in part on the surrounding hype, and potential philosophical issues. We encourage readers seeking technical details to read the paper on arxiv.

Explore our interactive app here.

TLDR: This is a short post summarising the key ideas and implications of our recent work studying how character information represented in language models is extracted by SAEs. Our most important... (read 793 more words →)

LESSWRONG
LW

LESSWRONG
LW

hrdkbhatnagar

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Toy Models of Feature Absorption in SAEs

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

hrdkbhatnagar

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Toy Models of Feature Absorption in SAEs

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

hrdkbhatnagar

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Toy Models of Feature Absorption in SAEs

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

hrdkbhatnagar

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Toy Models of Feature Absorption in SAEs

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Introduction

tl;dr

TLDR;