TL;DR: To separate out superimposed features represented by model neurons, train a sparse autoencoder on a layer's activations. Once you've learned a sparse autoencoding of those activations, this autoencoder's neurons can now be readily interpreted.


All code hosted at this repository: activation_additions/sparse_coder

A bit ago, I became interested in scaling activation engineering to the largest language models I could. I was initially surprised at how effective the technique was for being such a naive approach, which made me much more enthusiastic about simple manipulations of model activation spaces.

Yudkowsky says that we cannot expect to survive without a mathematical understanding, a guiding mathematical framework, of the AI. One hunch you might have is that a linear feature combination theorem could be the root of such a guiding theory. If so, we might learn a lot about the internal learned mechanisms of models by playing with their activation spaces. I feel like tuned lens and activation additions are some evidence for this hypothesis.

One major problem I experienced as I scaled up activation engineering to the largest models I could get my hands on (the new open-source Llama-2 models) was that it's hard to guess ahead of time which additions will work and which won't. You generate a new addition and stick it into a forward pass. Then, you get a few bits back observing how well the addition worked. "It would have been great," I thought, "to get a window into which concepts the model represents internally, and at which layer it does so."[1]

Sparse coding excited me at this point, because it suggested a way to learn a function from uninterpretable activations to represented, interpretable concepts! Paired with activation engineering's function from interpretable concepts to model internal activations, it sounded like a promising alignment scheme. Now, many things sound promising ahead of time. But seeing the MATS 4 Lee Sharkey team get extremely clean, concrete results on Pythia drove my confidence in this path way up.

This is the writeup of that research path. I still think this is an extremely promising interpretability path, about as important as activation engineering is.

What I do is:

  1. collect model activations at a layer,
  2. train an autoencoder on those activations with an  sparsity penalty, and
  3. interpret the neurons of the trained autoencoder.

The neurons in the autoencoder then appear meaningful to top-token visualizations!

Technical Argument from Sparse Coding Theory

Epistemic status: Theoretical argument.

Say you collect a bunch of activation vectors from a particular layer of a trained model, during some task. These activations vectors are generally not natively interpretable. They're vectors in some space... but we have no real understanding of the meanings of that space's basis dimensions. We only know that all those activation spaces, passed through in sequence, yield coherent English speech. English concepts are being represented in there, internally, somewhere. But we don't really know how.

The problem is that there is no privileged basis in a transformer's activation space. The model was incentivized during training to learn every classifier it needed to mirror its training distribution. But there was no training incentive for each classifier to correspond to a single neuron. The training distribution is sparse: you don't need to be ready to represent each concept independently of every other concept. The training incentive actually weighed against the one-to-one neuron solution, then, as that's wasteful in weights. So there's plenty of mechanistic reason for a model's neuron activations to look like jumbled messes to us. To exploit a sparse world, learn densely compacted features.

And the solution we empirically see learned is indeed superimposed features! Don't dedicate a neuron to each feature. Have each neuron represent a linear combination of features. For this reason, all the directions in an activation space will tend to be polysemantic. If you just run PCA on an activation space, the resulting directions will often be frustratingly polysemantic.[2]

Sparse coding[3] is a solution to this superposition-of-features problem. You train autoencoders with an  sparsity penalty on the activations collected from a model layer. The autoencoder can be as simple as a tied matrix, then a ReLU, then the tied matrix transpose. The learned matrix together with the ReLU maps to a larger projection space. An  penalty is applied during training to autoencoder activations in this large projection space. The autoencoder is trained to reproduce the input activations while simultaneously respecting the  internal representation penalty.

We're interested in particular solutions to this formal problem: learn to give each feature a neuron, i.e., have features fall along the standard basis. This way, the  penalty gives good values: most of your autoencoder activation values will be precisely zero. (An penalty yields a constant negative gradient to the extent that there are non-zero elements in the autoencoder's activations.) If the activations vectors are just linearly superimposed feature dimensions, then separating them out and squeezing them back together in this way should reproduce the original vectors. That will satisfy the reproduction loss, too.

We train such an autoencoder to convergence, driving towards an  value of between  (in smaller models) and  (in larger models). We save the trained autoencoder and examine its standard basis. Empirically, these neuronal directions appear quite semantically meaningful!

Autoencoder Interpretability

Epistemic status: Experimental observations. There's a robust effect here... but my code could absolutely still contain meaningful bugs.

Pythia 70M

Let's examine autoencoders trained at each of Pythia 70M's layers. Our interpretability technique is checking which tokens in the prompt most activate a given autoencoder neuronal direction.

For each Pythia autoencoder, here are ten unsorted non-zero directions and their favorite tokens:[4]

Layer 1
DimensionTop Input Tokens
2holding,  speak,  remember,  read,  learn,  hears
11:, )?
76commissioned,  gear,  generate,  mixed,  conclude,  credit
124what, What,  What, what
133equally,  most,  deeply,  relatively,  greater, more
166civil,  loan
183because,  still, although,  Because,  since,  although
191Cl,  Sn,  L,  Le,  Mes,  Mon
206New,  New,  popular,  ',  old,  handsome
236L,  l,  O, .,  unl,  Fl
Layer 2
DimensionTop Input Tokens
26!", ", ...", "., '.
88Yes, clusively, iably,  vertically,  right
96What, What,  How, what,  what,  how
154US,  Americas,  Netherlands,  Massachusetts,  States, bourg
158presidents,  pilots,  Scholars,  founders,  Ts,  Doctors
171you, 'll, ),  will,  we,  if
185They,  they,  she,  he
243iless,  prohibiting,  custody,  needs,  permission
269impressive,  vast,  cultural,  sports,  musical,  great
461sites,  facilities, une,  board,  School,  Jo
Layer 3
DimensionTop Input Tokens
79Nik,  Ir,  Two,  Poland,  Pol,  spectacular
153biological, iga
156attracted,  rescued,  confined,  trouble,  provided,  avoided
167ft,  Lis, bo, ifer,  Loren
244(, 6, 5, 3, 7, 4
349Ċ, ard, ifer, ruct, ively, stra
50732,  1950,  Pole, ple, isation, number
714Anto, controll,  along, ri, waters, rans
779Cro, stra,  Cron,  Bar,  Knowledge,  Crick
811bar, lang, rio,  McC, oph, off
Layer 4
DimensionTop Input Tokens
114Q,  unequal,  Gulf,  Tenn,  extr,  GDP
171ours,  various,  instantly,  exact,  technically, Ċ
213och,  Walt,  corner,  length,  composition,  dose
229och,  Little,  mention, ot, af, /
266A,  15,  atomic, Ċ,  official,  My
386Dec,  Rod,  send,  Cron,  catar,  tou
408grant,  Priv,  genuine,  absolute,  typically,  legally
472smell,  Jupiter,  auditory,  thinkers,  Venus,  razor
647och,  length,  dose
Layer 5
DimensionTop Input Tokens
83penetrate, ensory,  breathe,  bites,  distract, end
291fats,  sequences, ats, who,  miracles, isions
367deepest,  official,  perfect,  atomic,  presidential, digit
4442, 1, 6, 3, 4, 7
556Cash,  Hillary, Q,  Bond, go,  Tea
5672, 3, 1, 4, 6, 5
587Return,  atomic,  Person,  official,  composed, room
594stayed,  although,  lacks, although,  poorer,  It
646Be, &,  che,  Che

Full model results in footnote.[5]

In theory, these are all of the features represented in Pythia 70M's residual streams when these activations were collected. If the technique were extended to a representative dataset and to every Pythia sublayer, you'd in principle enumerate every single concept in Pythia.

Empirically, layers  and  (the two residual spaces right after the embedding layer) are the most interpretable of the bunch. Later layers are more garbled, though some clearly meaningful dimension exist there too.[6]

Note that the interpretability method used on the autoencoders—top-k tokens in the prompt—is relatively naive. I have code for activation heatmaps and direction ablations[7], and those interpretability techniques may capture meaning that top-k tokens misses. Any interpretability technique you have for model neurons... can be applied to sparse autoencoder neurons too.

Llama-2 7B

The above results are my independent replication of the the MATS 4 Lee Sharkey team's Pythia sparse coding. What if we scale the technique? Targeting a layer similarly early in the model, we train an autoencoder on Llama-2 7B:

Layer 13
DimensionTop Input Tokens
1092, 3, 2004
127▁England, ▁dollars, ▁Italian
206▁means, ▁refers, ▁composed, ▁learned, ▁hid, ▁she
207▁society, ▁portal, ati, unker, ▁Order, ▁mission
253▁said, ▁wrote, ▁designed, ▁statement, ▁directed, elled
277▁dan, ▁po, ▁dess, ▁Know, ▁conce, ▁Har
331▁program, ▁intelligence, ▁computer, ▁artificial, I, ▁Rob

Full layer results in footnote.[8]

 seems too low for the autoencoders trained on Llama-2 7B. These Llama-2 results are instead at .[9] Still better interpretability results could be obtained if this range of sparsity values was better explored.

Neuron Interpretability Baseline

If you directly interpret model neurons on Llama-2 7B using the top-k technique, your results look like this:

Layer 13
NeuronTop Input Tokens
0▁Rafael, ▁animation, ovo, ▁beneath, ▁commun, ▁Cross
1▁Hero, emor, action, ▁Indones, ▁expedition, immer
2▁bus, ▁Sund, ▁top, ▁marriage, ander, ▁breakfast
3▁predict, ▁Ald, ▁phase, ▁overcome, rin, ▁Joy
4related, ▁lazy, round, ▁Nev, UI, ▁atmosphere
5▁trans, gu, isted, ▁portal, ▁tiny, laimed
6ija, ▁Chief, ▁measures, ▁valuable, space, ▁testing
7ond, ▁lazy, ▁Virgin, tes, ▁conquer, ▁uniform
8▁Valley, ctions, round, ▁measures, ▁facilities, ▁variable
9▁ways, ▁definitely, isation, ▁elements, enta, ▁expl

Path to Impact: Learning Windows into Models?

Epistemic status: Wild speculation.

The above suggests that we can train windows into each layer of a model. Each autoencoder window tells you what's going on at that layer, in human-comprehensible terms. The underlying forward pass is unaltered, but we know what concepts each layer contains.

Because you know how those concepts are mapped out of the model into the autoencoder, they are also ready to be added in through activation engineering! So you already have some interpretability and steering control.

More ambitiously, we can now try to reconstruct comprehensible model circuits. With ablations, see which features at layer  affect which features at layer . Measuring the impact of features on downstream features lets you build up an interpretable "directed semantic graph" of the model's computations.

This especially is really good stuff. If you can reconstruct the circuits, you can understand the model and retarget its search algorithms. If you can understand and align powerful models, you can use those models as assistants in yet more powerful model alignment.


I've replicated prior sparse coding work and extended it to Llama-2 7B. I'm hoping to keep at it and get results for Llama-2 70B, the best model that I have access to.

Generally, I feel pretty excited about simple modifications to model activation spaces as interpretability and steering techniques! I think these are worth putting points into, as an independent alignment bet from the RLHF.

  1. ^

    I was specifically hunting for a "truthiness" activation addition to move around TruthfulQA benchmarks. (I am unsure whether the techniques covered in the post are, in-practice, up to programatically isolating the "truthiness" vector.)

  2. ^

    Or to an AI assistant helping you interpret neurons in a model.

  3. ^

    Also known as "sparse dictionary learning."

  4. ^

    Underlying Pythia activations were collected during six-shot TruthfulQA. (Six shot is standard in the literature.) This is a far smaller dataset than The Pile, so this was also an experiment in small dataset sparse coding.

    I project to a -dimensional space from Pythia's -dimensional activation space. Negative token activations are excluded, since the ReLU would zero all of those out—destroying any information negative values might contain.

    So, directions with all negative values are dropped—notice that that's most directions! Only about  in  are kept.

  5. ^

