Dmitry Vaintrob

Toward A Mathematical Framework for Computation in Superposition

Author order randomized. Authors contributed roughly equally — see attribution section for details. Update as of July 2024: we have collaborated with @LawrenceC to expand section 1 of this post into an arXiv paper, which culminates in a formal proof that computation in superposition can be leveraged to emulate sparse boolean circuits of arbitrary depth in small neural networks. What kind of document is this? What you have in front of you is so far a rough writeup rather than a clean text. As we realized that our work is currently highly relevant to recent questions posed by interpretability researchers, we put together a lightly edited version of private notes we've written over the last ~4 months. If you'd be interested in writing up a cleaner version, get in touch, or just do it. We're making these notes public before we're done with the project because of some combination of (1) seeing others think along similar lines and wanting to make it less likely that people (including us) spend time duplicating work, (2) providing a frame which we think provides plenty of concrete immediate problems for people to independently work on[1] (3) seeking feedback to decrease the chance we spend a bunch of time on nonsense. 1 minute summary Superposition is a mechanism that might allow neural networks to represent the values of many more features than they have neurons, provided that those features are present sparsely in the dataset. However, until now, an understanding of how computation can be done in a compressed way directly on these stored features has been limited to a few very specific tasks (for example here). The goal of this post is to lay the groundwork for a picture of how computation in superposition can be done in general. We hope this will enable future research to build interpretability techniques for reverse engineering circuits that are manifestly in superposition. Our main contributions are: 1. Formalisation of some tasks performed by MLPs and attenti

213Jan 18, 2024

Dmitry Vaintrob

Message

2269

201

112

A tale of three theories: sparsity, frustration, and statistical field theory

This post is an informal preliminary writeup of a project that I've been working on with friends and collaborators. Some of the theory was developed jointly with Zohar Ringel, and we hope to write a more formal paper on it this year. Experiments are joint with Lucas Teixeira (and also...

Jan 2556

Steelmanning heuristic arguments

Introduction This is a nuanced “I was wrong” post. Something I really like about AI safety and EA/rationalist circles is the ease and positivity in people’s approach to being criticised.[1] For all the blowups and stories of representative people in the communities not living up to the stated values, my...

Apr 13, 202577

Memorization-generalization in practice

Short post today, which is part II.1 or my series on tempering and SLT (see part one here). In this post I’ll explain in a bit more detail the “in practice” connection that experiments should see between the learning coefficient spectrum, tempering, and empirical measurements of the learning coefficient. In...

Jan 30, 20257

Efficiency spectra and “bucket of circuits” cartoons

This is “Part I.75” of my series on the memorization-generalization spectrum and SLT. I’m behind on some posts, so I thought I would try to explain, in a straightforward and nontechnical way, the main takeaway from my previous post for people thinking about and running experiments with neural nets. I’m...

Jan 29, 202520

The memorization-generalization spectrum and learning coefficients

This is part I.5 of the series on “generalization spectra” and SLT. I’ll try to make it readable independently of Part I. The purpose of this post is to reconceptualize the ideas from the last post in a “more physically principled” (and easier-to-measure) sense, and to connect it more clearly...

Jan 28, 202517

My supervillain origin story

When I started graduate school (for math), I was very interested in big ideas. I had had a couple experiences of having general research intuitions pan out really well and felt like the core of good research is having a brave idea, a gestalt. I went into grad school looking...

Jan 27, 2025121

The generalization phase diagram

Introduction This is part I of 3.5 planned posts on the “tempered posterior” (one of which will be “SLT in a nutshell”). It is a kind of moral sequel to “Dmitry’s Koan” and “Logits, log-odds, and loss for parallel circuits”, and is related to the post on grammars. It can...

Jan 26, 202526

Load More (7/34)

LESSWRONG
LW

LESSWRONG
LW

Dmitry Vaintrob

Dmitry Vaintrob

Dmitry Vaintrob

Toward A Mathematical Framework for Computation in Superposition

My supervillain origin story

The subset parity learning problem: much more than you wanted to know

The purposeful drunkard

Dmitry Vaintrob

A tale of three theories: sparsity, frustration, and statistical field theory

Steelmanning heuristic arguments

Memorization-generalization in practice

Efficiency spectra and “bucket of circuits” cartoons

The memorization-generalization spectrum and learning coefficients

My supervillain origin story

The generalization phase diagram

A tale of three theories: sparsity, frustration, and statistical field theory

Steelmanning heuristic arguments

Memorization-generalization in practice

Efficiency spectra and “bucket of circuits” cartoons

The memorization-generalization spectrum and learning coefficients

My supervillain origin story

The generalization phase diagram

Toward A Mathematical Framework for Computation in Superposition

My supervillain origin story

The subset parity learning problem: much more than you wanted to know

The purposeful drunkard