Epistemic status: Old news and well-known, but I find it hard to point at a single post that encapsulates my intuitions on this, so I write them down here.
One question that comes up sometimes in interpretability work, is: “why do I trust simple linear probes more than complex non-linear ones?”. (Even though I don’t particularly trust either that much).
So the main claim I will argue is that probe complexity changes what a positive probing result means.
Or, said in more detail: A probe does not just show what is in a model. It also frames what counts as an accessible representation. For more expressive probes, a positive result becomes weaker evidence about the original model, and stronger evidence about the probe (which is a much less interesting question).
The rough argument
If you’re trying to test whether a model represents some concept, and how see how the model might represent the concept, one natural starting point is to use a proxy: “can I extract the state of the model with respect to this concept from its activations by using some kind of probe?”
However, depending on what probe you use, you may be may not be asking the questions you think you are asking.
The question “does the model represent when concept X? and how?” is subtly different to “can I deduce concept X using the activations of the model using this probe?”.
If you use a linear probe, you the probe asks whether concept X is already in the activations in a way that a very simple linear readout can access. (This might still mean you are reading come spurious correlate, but that is a separate question.)
If you use a non-linear probe, then it asks something like “given I have an input, which the model has deduced some features from into a set of activations, is there any combination of these features which can give us concept X”.
So if you care specifically what the model is representing, and the model represents things linearly, then a linear probe may give slightly more faithful answers.
Practical Fictional Example
This is becoming a bit abstract, so perhaps let me make an illustrative example.
Suppose we have a model which needs to read in some pixel values, extract the numbers from those pixel values, do some math, then output a new image with the answer to that math. See illustration below.
diagram showing image-based calculator, and what it might be representing inside the black box too.
If we want to test hypotheses about the model, then we can try using some linear probes on the model to see how well it represents the the number we are trying to test. We might be able to deduce what layer the model has done the calculations that it needs to do, and can infer some things about the model.
It might be the case that a linear probe works for this. It might not. This is a pretty constructed example, but we can also try to throw an MLP probe at the same model on all the layers too, and see if it’s any different.
Fictional example illustrating a Linear Probe getting good performance only after the representation is computed in the model, and an MLP Probe getting better performance but before the representations are computed in the model
One thing you might find, is that the MLP model probably outperforms the Linear model. And you can get more complex too.
If we care only about the answer in an information-theoretic sense, then sure, there probably exists enough information in the model activations to be able to deduce whatever property you are interested in.
But if you care what the model is representing, and the model predominantly represents things linearly, then a linear probe may give slightly more faithful answers.
Othello is a game where players place white or black pieces on alternating turns. OthelloGPT is a model trained such that given a [sequence of moves], it outputs [valid next moves]. Researchers tried to study if they could extract the board state at a specific turn from the model.
The original Othello work found that board state, given by (empty, black, white) at each position, could be recovered with non-linear probes and that interventions on the recovered representation affected behavior. However they did not manage to get linear probes working for this.
However, Neel then found that if you probe for board state in a way different to (empty, black, white) and more native to the way the model is thinking, one actually could deduce the board state linearly. (I leave the subtle difference to the footnotes1 should you want to guess.)
This is because the linear probes can only read out simple directions, but MLP can combine latent features into new concepts. Moving from linear probes to non-linear probes, you move to “the model represents X” to “the probe can compute X from some features the model represents.”
Though non-linear probes did find the features correlated to concept X, the MLP probes only needed to be used because the model was explicitly not using concept X in the exact way the researchers thought the model might be representing things.
Closing
Not all experiments are testing exactly “does the model represent this concept.” There is use in other methods of interpretability that are not necessarily mechanistic and that are less rigorous.
If you have a correctly-shaped problem and enough compute, you can probably cheaply try both linear probes and non-linear probes, but do note that the probes are probably not quite testing what you think they might be testing.
Even with supposedly more faithful probe, it’s quite likely that the model might be learning some correlate of metric rather than what you really care about. Having non-linear probe is an additional confounding factor to consider. Having very large linear layer probes are more “risky” than smaller linear probes.
Once you have a probe that does seem to work, you can do more tests to try to hone in on what the model is actually representing. You may consider:
trying causal interventions to see how behavior changes
trying to training your probe on a random initialization of the original model as a control (The probes might still be successful!)
trying to use an dataset different to your training distribution
Remain skeptical, because so many interp experiments end up interpreting noise as signal.
You can also work through the exercises by going through the ARENA exercises here: https://learn.arena.education/chapter1_transformer_interp/33_othellogpt/intro but if you don’t want to do that, the TLDR is that the model doesn’t seem to represent (empty, black, white) directly, but it instead seems to represent (empty, my-piece, their-piece) which changes depending on which which player’s turn it is.
Epistemic status: Old news and well-known, but I find it hard to point at a single post that encapsulates my intuitions on this, so I write them down here.
One question that comes up sometimes in interpretability work, is: “why do I trust simple linear probes more than complex non-linear ones?”. (Even though I don’t particularly trust either that much).
So the main claim I will argue is that probe complexity changes what a positive probing result means.
Or, said in more detail: A probe does not just show what is in a model. It also frames what counts as an accessible representation. For more expressive probes, a positive result becomes weaker evidence about the original model, and stronger evidence about the probe (which is a much less interesting question).
The rough argument
If you’re trying to test whether a model represents some concept, and how see how the model might represent the concept, one natural starting point is to use a proxy: “can I extract the state of the model with respect to this concept from its activations by using some kind of probe?”
However, depending on what probe you use, you may be may not be asking the questions you think you are asking.
The question “does the model represent when concept X? and how?” is subtly different to “can I deduce concept X using the activations of the model using this probe?”.
If you use a linear probe, you the probe asks whether concept X is already in the activations in a way that a very simple linear readout can access. (This might still mean you are reading come spurious correlate, but that is a separate question.)
If you use a non-linear probe, then it asks something like “given I have an input, which the model has deduced some features from into a set of activations, is there any combination of these features which can give us concept X”.
So if you care specifically what the model is representing, and the model represents things linearly, then a linear probe may give slightly more faithful answers.
Practical Fictional Example
This is becoming a bit abstract, so perhaps let me make an illustrative example.
Suppose we have a model which needs to read in some pixel values, extract the numbers from those pixel values, do some math, then output a new image with the answer to that math. See illustration below.
diagram showing image-based calculator, and what it might be representing inside the black box too.
If we want to test hypotheses about the model, then we can try using some linear probes on the model to see how well it represents the the number we are trying to test. We might be able to deduce what layer the model has done the calculations that it needs to do, and can infer some things about the model.
It might be the case that a linear probe works for this. It might not. This is a pretty constructed example, but we can also try to throw an MLP probe at the same model on all the layers too, and see if it’s any different.
Fictional example illustrating a Linear Probe getting good performance only after the representation is computed in the model, and an MLP Probe getting better performance but before the representations are computed in the model
One thing you might find, is that the MLP model probably outperforms the Linear model. And you can get more complex too.
If we care only about the answer in an information-theoretic sense, then sure, there probably exists enough information in the model activations to be able to deduce whatever property you are interested in.
But if you care what the model is representing, and the model predominantly represents things linearly, then a linear probe may give slightly more faithful answers.
Real-world example
This is generally well-known issue, and we can look at a real-world example to point to. So we look at the Actually, Othello-GPT Has A Linear Emergent World Representation post/paper by Neel Nanda.
Othello is a game where players place white or black pieces on alternating turns. OthelloGPT is a model trained such that given a [sequence of moves], it outputs [valid next moves]. Researchers tried to study if they could extract the board state at a specific turn from the model.
The original Othello work found that board state, given by (empty, black, white) at each position, could be recovered with non-linear probes and that interventions on the recovered representation affected behavior. However they did not manage to get linear probes working for this.
However, Neel then found that if you probe for board state in a way different to (empty, black, white) and more native to the way the model is thinking, one actually could deduce the board state linearly. (I leave the subtle difference to the footnotes1 should you want to guess.)
This is because the linear probes can only read out simple directions, but MLP can combine latent features into new concepts. Moving from linear probes to non-linear probes, you move to “the model represents X” to “the probe can compute X from some features the model represents.”
Though non-linear probes did find the features correlated to concept X, the MLP probes only needed to be used because the model was explicitly not using concept X in the exact way the researchers thought the model might be representing things.
Closing
Not all experiments are testing exactly “does the model represent this concept.” There is use in other methods of interpretability that are not necessarily mechanistic and that are less rigorous.
If you have a correctly-shaped problem and enough compute, you can probably cheaply try both linear probes and non-linear probes, but do note that the probes are probably not quite testing what you think they might be testing.
Even with supposedly more faithful probe, it’s quite likely that the model might be learning some correlate of metric rather than what you really care about. Having non-linear probe is an additional confounding factor to consider. Having very large linear layer probes are more “risky” than smaller linear probes.
Once you have a probe that does seem to work, you can do more tests to try to hone in on what the model is actually representing. You may consider:
Remain skeptical, because so many interp experiments end up interpreting noise as signal.
1
You can also work through the exercises by going through the ARENA exercises here: https://learn.arena.education/chapter1_transformer_interp/33_othellogpt/intro but if you don’t want to do that, the TLDR is that the model doesn’t seem to represent (empty, black, white) directly, but it instead seems to represent (empty, my-piece, their-piece) which changes depending on which which player’s turn it is.