What's the sum total of everything we know about language models? At the object level, probably way too much for any one person (not named Gwern) to understand.
However, it might be possible to abstract most of our knowledge into pithily-worded frames (i.e. intuitions, ideas, theories) that are much more tractable to grok. And once we have all this information neatly written down in one place, unexpected connections may start to pop up.
This post contains a collection of frames about models that are (i) empirically justified and (ii) seem to tell us something useful. (They are highly filtered by my experience and taste.) In each case I've distilled the key idea down to 1-2 sentences and provided a link to the original source. I've also included open questions for which I am not aware of conclusive evidence.
I'm hoping that by doing this, I'll make some sort of progress towards "prosaic interpretability" (final name pending). In the event that I don't, having an encyclopedia like this seems useful regardless.
I'll broadly split the frames into representational and functional frames. Representational frames look 'inside' the model, at its subcomponents, in order to make claims about what the model is doing. Functional frames look 'outside' the model, at its relationships with other entities (e.g. data distribution, learning objectives etc) in order to make claims about the model.
---
This is intended to be a living document; I will update this in the future as I gather more frames. I strongly welcome all suggestions that could expand the list here!
Things we're interested in understanding
Obviously it'd be nice to understand "language model behaviour" generally, but we seem far away from this. Specific things might be more tractable to understand in isolation.
- Refusal
- In-context learning
- Reasoning (e.g. through chain of thought)
- Memorization (i.e. factual recall)
- Models' assumed persona / identity
- Self-awareness (a.k.a situational awareness, introspection)
Representational Frames
---
- (TODO think of some open questions which would directly indicate good frames)
Functional Frames
Frames
- Language model responses can be classified into different levels of abstraction: knee-jerk responses, persona simulations, and general world simulations.
- Language models represent 'personas' in ways that make 'anti-personas' more likely to emerge, conditional on eliciting a specific persona
- Language model personas might yield useful information for determining other properties such as truthfulness.
- Language models must simulate the generative process of the world in order to predict the next token, and this could involve solving very hard subproblems
- Language models mostly 'know what they know', i.e. can give calibrated estimates of their ability to answer questions.
- Language models are capable of 'introspection', i.e. can predict things about themselves that more capable models cannot, suggesting they have access to 'privileged information' about themselves.
- Language models are capable of 'out-of-context reasoning', i.e. can piece together many different facts they have been trained on in order to make inferences. A.k.a: 'connecting the dots'.
- Language models are capable of 'implicit meta-learning', i.e. can identify statistical markers of truth vs falsehood, and update more towards more 'truthful' information.
- Language models are capable of 'strategic goal preservation', i.e. can alter their responses during training time to prevent their goals from being changed via fine-tuning.
- Language models are capable of 'sandbagging', i.e. strategically underperforming on evaluations in order to avoid detection / oversight.
- Transformers are susceptible to jailbreaks because harmful and harmless prompts are easily distinguishable in the first few tokens; data augmentation solves the problem.
- (TODO: look at the papers on ICL)
- (TODO: look at papers on grokking)
---
- Do language models 'do better' when using their own reasoning traces, as opposed to the reasoning traces of other models? I explore this question more here
Changelog
2 Jan: Initial post