What's the sum total of everything we know about language models? At the object level, probably way too much for any one person (not named Gwern) to understand.

However, it might be possible to abstract most of our knowledge into pithily-worded frames (i.e. intuitions, ideas, theories) that are much more tractable to grok. And once we have all this information neatly written down in one place, unexpected connections may start to pop up.

This post contains a collection of frames about models that are (i) empirically justified and (ii) seem to tell us something useful. (They are highly filtered by my experience and taste.) In each case I've distilled the key idea down to 1-2 sentences and provided a link to the original source. I've also included open questions for which I am not aware of conclusive evidence.

I'm hoping that by doing this, I'll make some sort of progress towards "prosaic interpretability" (final name pending). In the event that I don't, having an encyclopedia like this seems useful regardless.

I'll broadly split the frames into representational and functional frames. Representational frames look 'inside' the model, at its subcomponents, in order to make claims about what the model is doing. Functional frames look 'outside' the model, at its relationships with other entities (e.g. data distribution, learning objectives etc) in order to make claims about the model.

---

This is intended to be a living document; I will update this in the future as I gather more frames. I strongly welcome all suggestions that could expand the list here!

Things we're interested in understanding

Obviously it'd be nice to understand "language model behaviour" generally, but we seem far away from this. Specific things might be more tractable to understand in isolation.

Refusal
In-context learning
Reasoning (e.g. through chain of thought)
Memorization (i.e. factual recall)
Models' assumed persona / identity
Self-awareness (a.k.a situational awareness, introspection)

Representational Frames

Transformer computation can be broken down into nearly-linear 'circuits', which in turn explain how they compute simple bigrams / trigrams.
Transformers near-universally contain 'induction heads' that detect / modulate repetitive sequences.
Transformers represent features in superposition as almost-orthogonal directions, of which there can be exponentially many.
Features might actually be represented in a combination of different layers.
Transformers linearly represent "a XOR b" if they represent both a and b. This may depend on 'redundancy' / 'coverage' of features in the data.
Transformers can compute boolean circuits in superposition, i.e. they can compute many more boolean circuits than they have neurons / dimensions for.
A large proportion of neural nets' parameters could be artefacts of the training process that are not actually necessary for solving the task [Insert link to papers on pruning weights]
(Vision) Transformers likely benefit from 'register tokens', i.e. being able to explicitly model global information in addition to local information. Corollary: Maybe language models also need register tokens.
Transformers can be thought of as do 'multi-token embedding' in the early layers.
Transformers compute a bunch of random features in the early layers, sort out what's useful in the middle layers, then actually solve tasks in the late layers. [There is no direct evidence for this, but the indirect evidence Gwern points out is compelling]
Maximally adversarially robust models are interpretable, in the sense that their "adversarial examples" look like natural examples.
Transformers represent 'belief states' in a fractal geometry, mirroring the real fractal structure of the POMDP belief state tree.
Transformers mostly learn a bag of heuristics as opposed to coherent global algorithms.
Safety fine-tuning works by diverting model computation away from the 'basin' of misalignment-inducing neurons (in the case of toxicity).
HHH training induces linear separation between 'harmful' and 'harmless' contexts. This explains why refusal is well-represented linearly.

---

(TODO think of some open questions which would directly indicate good frames)

Functional Frames

Frames

Language model responses can be classified into different levels of abstraction: knee-jerk responses, persona simulations, and general world simulations.
Language models represent 'personas' in ways that make 'anti-personas' more likely to emerge, conditional on eliciting a specific persona
Language model personas might yield useful information for determining other properties such as truthfulness.
Language models must simulate the generative process of the world in order to predict the next token, and this could involve solving very hard subproblems
Language models mostly 'know what they know', i.e. can give calibrated estimates of their ability to answer questions.
Language models are capable of 'introspection', i.e. can predict things about themselves that more capable models cannot, suggesting they have access to 'privileged information' about themselves.
Language models are capable of 'out-of-context reasoning', i.e. can piece together many different facts they have been trained on in order to make inferences. A.k.a: 'connecting the dots'.
Language models are capable of 'implicit meta-learning', i.e. can identify statistical markers of truth vs falsehood, and update more towards more 'truthful' information.
Language models are capable of 'strategic goal preservation', i.e. can alter their responses during training time to prevent their goals from being changed via fine-tuning.
Language models are capable of 'sandbagging', i.e. strategically underperforming on evaluations in order to avoid detection / oversight.
Transformers are susceptible to jailbreaks because harmful and harmless prompts are easily distinguishable in the first few tokens; data augmentation solves the problem.
(TODO: look at the papers on ICL)
(TODO: look at papers on grokking)

---

Do language models 'do better' when using their own reasoning traces, as opposed to the reasoning traces of other models? I explore this question more here

Changelog

2 Jan: Initial post

LESSWRONG
LW

LESSWRONG
LW

27

A Collection of Empirical Frames about Language Models

27

27

Things we're interested in understanding

Representational Frames

Functional Frames

Changelog