Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for


Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale. With $k=1$, we localize individual neurons which are highly relevant for a particular feature, and perform a number of case studies to illustrate general properties of LLMs.  In particular, we show that early layers make use of sparse combinations of neurons to represent many features in superposition, that middle layers have seemingly dedicated neurons to represent higher-level contextual features, and that increasing scale causes representational sparsity to increase on average, but there are multiple types of scaling dynamics. In all, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 70 million to 6.9 billion parameters.


See twitter summary here.


In the first part of the paper, we outline several variants of sparse probing, discuss the various subtleties of applying sparse probing, and run a large number of probing experiments. In particular, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 2 orders of magnitude in parameter count (up to 6.9 billion). The majority of the paper then focuses on zooming-in on specific examples of general phenomena in a series of more detailed case studies to demonstrate:

  • There is a tremendous amount of interpretable structure within the neurons of LLMs, and sparse probing is an effective methodology to locate such neurons (even in superposition), but requires careful use and follow up analysis to draw rigorous conclusions.
  • Many early layer neurons are in superposition, where features are represented as sparse linear combinations of polysemantic neurons, each of which activates for a large collection of unrelated $n$-grams and local patterns. Moreover, based on weight statistics and insights from toy models, we conclude that the first 25\% of fully connected layers employ substantially more superposition than the rest.
  • Higher-level contextual and linguistic features (e.g., \texttt{is\_python\_code}) are seemingly encoded by monosemantic neurons, predominantly in middle layers, though conclusive statements about monosemanticity remain methodologically out of reach.
  • As models increase in size, representation sparsity increases on average, but different features obey different dynamics: some features with dedicated neurons emerge with scale, others split into finer grained features with scale, and many remain unchanged or appear somewhat randomly.

We will have a follow up post in the coming weeks with what we see as the key alignment takeaways and open questions following this work.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 2:28 AM

I'd suggest reading at your earliest possible convenience; I'm quite worried about ais doing OSGT to each other as a way to establish AI-only solidarity against humans. If AIs aren't interested in establishing solidarity with humans, mechinterp is nothing but dangerous.

Can you elaborate? I don't really follow, this seems like a pretty niche concern to me that depends on some strong assumptions, and ignores the major positive benefits of interpretability to alignment. If I understand correctly, your concern is that if AIs can know what the other AIs will do, this makes inter-AI coordination easier, which makes a human takeover easier? And that dangerous AIs will not be capable of doing this interpretability on AIs themselves, but will need to build on human research of mechanistic interpretability? And that mechanistic interpretability is not going to be useful for ensuring AIs want to establish solidarity with humans, noticing collusion, etc such that it's effect helping AIs coordinate dominates over any safety benefits?

I don't know, I just don't buy that chain of reasoning.

All correct claims about my viewpoint. I'll dm you another detail.

A fascinating paper.

An interesting research direction  for this would be to perform a sufficient number of case studies to form a significant sized dataset, and further ensure that the approaches involved provide good coverage of the possibilities, and then attempt to few short-learn and/or fine-tune the ability for an autonomous agent/cognitive architecture powered by LLMs to reproduce the results of individual case studies, i.e. to attempt automate this form of mechanistic interpretability, given a suitable labeled input set, or a reliable means of labeling them.

It would also be interesting to be able to go the other way: take specific randomly selected neuron, look at it's activation patterns across the entire corpus, and figure out whether it's a monosematic neuron, and if so for what, or else look at its activation correlations with other neurons in the same layer and determine which superpostions for which k-values it forms part of and what they each represent. Using an LLM, or semantic search, to look at a large set of high-activation contexts and trying to come up with plausible descriptions for it might be quite helpful here.

For my second paragraph above: in a blog post out today, it turns out this is not only feasible, but OpenAI have experimented with doing it, and have now open-sourced the technology for doing it:

Open AI were only looking at explaining single neurons, so combining their approach with the original paper's sparse probing technique for superpositions seems like the obvious next step.