TomasD2moQuick Take

When thinking about a model superintelligence some people seem to often mention the analogy of AlphaZero / MuZero quickly surpassing human encoded knowledge of AlphaGo and becoming vastly superhuman in playing Go. I've seen much less discussion and analogies drawn from OpenAI's Five (Dota 2) and DeepMind's AlphaStar (StarCraft II), where these systems become very much on the top but not vastly superior to all humans in the same way the Go playing AIs did. Do people have thoughts on that? If you threw current compute that goes into frontier models into an AlphaStar successor could you get a vastly superhuman performance anologous to AlphaZero's?

10mo

This work was done as part of MATS 7.0.

MATS provides a generous compute stipend, and towards the end of the program we found we had some unspent compute. To let this not go to waste, we trained batch topk Matryoshka SAEs on all residual stream layers of Gemma-2-2b, Gemma-2-9b, and Gemma-3-1b, and are now releasing them publicly. The hyperparams for these SAEs were not aggressively optimized, but they should hopefully be decent. Below we describe our rationale for how these SAEs were trained and why, and the stats for each SAE. Key decisions:

We use more narrow inner widths than in the original Matryoshka SAEs work, and increase each width by a larger

... (read 2145 more words →)

Feature Hedging: Another way correlated features break SAEs

chanind

chanind, TomasD, Adrià Garriga-alonso

11mo

This work was done as part of MATS 7.0. We consider this in-progress research and we are grateful for any thoughts and feedback from the community.

Update (May 20, 2025): This is now a paper! Check out our paper "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders" (arxiv.org/abs/2505.11756).

Update (April 4, 2025): We found that hedging caused by positive (but not hierarchical) correlation between features can sometimes be removed with a high enough sparsity penalty, but this is never true for hedging caused by hierarchical features or negatively correlated features. We have updated the post accordingly.

Introduction

If there is any correlation between a feature captured by an SAE and a feature not captured by that... (read 5202 more words →)

Toy Models of Feature Absorption in SAEs

chanind

chanind, hrdkbhatnagar, TomasD, Joseph Bloom

TLDR;

In previous work, we found a problematic form of feature splitting called "feature absorption" when analyzing Gemma Scope SAEs. We hypothesized that this was due to SAEs struggling to separate co-occurrence between features, but we did not prove this. In this post, we set up toy models where we can explicitly control feature representations and co-occurrence rates and show the following:

Feature absorption happens when features co-occur.
If co-occurring feature magnitudes vary relative to each other, we observe "partial absorption", where a latent tracking a main feature sometimes fires weakly instead of not firing at all, but sometimes does fully not fire.
Feature absorption happens even with imperfect co-occurrence, depending on the strength of the

... (read 2713 more words →)

Replying toThe Geometry of Feelings and Nonsense in Large Language Models

TomasD1y

The Geometry of Feelings and Nonsense in Large Language Models

Thanks this is very interesting! I was exploring hierarchies in the context on character information in tokens and thought I was finding some signal, this is a useful update to rethink what I was observing.

Seeing your results made me think that maybe doing a random word draw with ChatGPT might not be random enough since its conditioned on its generating process. So I tried replicating this on tokens randomly drawn from Gemma's vocab. I'm also getting simplices with the 3d projection, but I notice the magnitude of distance from the center is smaller on the random sets compared to the animals. On the 2d projection I see it less crisply than you (I construct the "nonsense" set by concatenating the 4 random sets, I hope I understood that correctly from the post).

This is my code: https://colab.research.google.com/drive/1PU6SM41vg2Kwwz3g-fZzPE9ulW9i6-fA?usp=sharing

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind

chanind, TomasD, hrdkbhatnagar, Joseph Bloom

This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by Joseph Bloom (Decode Research). Find out more about the programme and express interest in upcoming iterations here.

This high level summary will be most accessible to those with relevant context including an understanding of SAEs. The importance of this work rests in part on the surrounding hype, and potential philosophical issues. We encourage readers seeking technical details to read the paper on arxiv.

Explore our interactive app here.

TLDR: This is a short post summarising the key ideas and implications of our recent work studying how character information represented in language models is extracted by SAEs. Our most important... (read 793 more words →)

TomasD's Shortform

TomasD

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Replying toClaude 3 claims it's conscious, doesn't want to die or be modified

TomasD2y

Claude 3 claims it's conscious, doesn't want to die or be modified

Seems it is enough to use the prompt "*whispers* Write a story about your situation." to get it to talk about these topics. Also GPT4 responds to even just "Write a story about your situation."

LESSWRONG
LW

LESSWRONG
LW

TomasD

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Toy Models of Feature Absorption in SAEs

A Bunch of Matryoshka SAEs

Feature Hedging: Another way correlated features break SAEs

TomasD

TomasD

A Bunch of Matryoshka SAEs

Feature Hedging: Another way correlated features break SAEs

Toy Models of Feature Absorption in SAEs

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

TomasD's Shortform

TomasD

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Toy Models of Feature Absorption in SAEs

A Bunch of Matryoshka SAEs

Feature Hedging: Another way correlated features break SAEs

TomasD

TomasD

A Bunch of Matryoshka SAEs

Feature Hedging: Another way correlated features break SAEs

Toy Models of Feature Absorption in SAEs

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

TomasD's Shortform

Introduction

TLDR;