Joseph Bloom — LessWrong

I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I'm a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.

Thanks for posting this! I've had a lot of conversations with people lately about OthelloGPT and I think it's been useful for creating consensus about what we expect sparse autoencoders to recover in language models.

Maybe I missed it but:

What is the performance of the model when the SAE output is used in place of the activations?
What is the L0? You say 12% of features active so I assume that means 122 features are active.This seems plausibly like it could be too dense (though it's hard to say, I don't have strong intuitions here). It would be preferable to have a sweep where you have varying L0's, but similar explained variance. The sparsity is important since that's where the interpretability is coming from. One thing worth plotting might be the feature activation density of your SAE features as compares to the feature activation density of the probes (on a feature density histogram). I predict you will have features that are too sparse to match your probe directions 1:1 (apologies if you address this and I missed this).
In particular, can you point to predictions (maybe in the early game) where your model is effectively perfect and where it is also perfect with the SAE output in place of the activations at some layer? I think this is important to quantify as I don't think we have a good understanding of the relationship between explained variance of the SAE and model performance and so it's not clear what counts as a "good enough" SAE.

I think a number of people expected SAEs trained on OthelloGPT to recover directions which aligned with the mine/their probe directions, though my personal opinion was that besides "this square is a legal move", it isn't clear that we should expect features to act as classifiers over the board state in the same way that probes do.

This reflects several intuitions:

At a high level, you don't get to pick the ontology. SAEs are exciting because they are unsupervised and can show us results we didn't expect. On simple toy models, they do recover true features, and with those maybe we know the "true ontology" on some level. I think it's a stretch to extend the same reasoning to OthelloGPT just because information salient to us is linearly probe-able.
Just because information is linearly probeable, doesn't mean it should be recovered by sparse autoencoders. To expect this, we'd have to have stronger priors over the underlying algorithm used by OthelloGPT. Sure, it must us representations which enable it to make predictions up to the quality it predicts, but there's likely a large space of concepts it could represent. For example, information could be represented by the model in a local or semi-local code or deep in superposition. Since the SAE is trying to detect representations in the model, our beliefs about the underlying algorithm should inform our expectations of what it should recover, and since we don't have a good description of the circuits in OthelloGPT, we should be more uncertain about what the SAE should find.
Separately, it's clear that sparse autoencoders should be biased toward local codes over semi-local / compositional codes due to the L1 sparsity penalty on activations. This means that even if we were sure that the model represented information in a particular way, it seems likely the SAE would create representations for variables like (A and B) and (A and B') in place of A even if the model represents A. However, the exciting thing about this intuition is it makes a very testable prediction about combinations of features likely combining to be effective classifiers over the board state. I'd be very excited to see an attempt to train neuron-in-a-haystack style sparse probes over SAE features in OthelloGPT for this reason.

Some other feedback:

Positive: I think this post was really well written and while I haven't read it in more detail, I'm a huge fan of how much detail you provided and think this is great.
Positive: I think this is a great candidate for study and I'm very interested in getting "gold-standard" results on SAEs for OthelloGPT. When Andy and I trained them, we found they could train in about 10 minutes making them a plausible candidate for regular / consistent methods benchmarking. Fast iteration is valuable.
Negative: I found your bolded claims in the introduction jarring. In particular "This demonstrates that current techniques for sparse autoencoders may fail to find a large majority of the interesting, interpretable features in a language model". I think this is overclaiming in the sense that OthelloGPT is not toy-enough, nor do we understand it well enough to know that SAEs have failed here, so much as they aren't recovering what you expect. Moreover, I think it would best to hold-off on proposing solutions here (in the sense that trying to map directly from your results to the viability of the technique encourages us to think about arguments for / against SAEs rather than asking, what do SAEs actually recover, how do neural networks actually work and what's the relationship between the two).
Negative: I'm quite concerned that tieing the encoder / decoder weights and not having a decoder output bias results in worse SAEs. I've found the decoder bias initialization to have a big effect on performance (sometimes) and so by extension whether or not it's there seems likely to matter. Would be interested to see you follow up on this.

Oh, and maybe you saw this already but an academic group put out this related work: https://arxiv.org/abs/2402.12201 I don't think they quantify the proportion of probe directions they recover, but they do indicate recovery of all types of features that been previously probed for. Likely worth a read if you haven't seen it.

Cool paper. I think the semantic similarity result is particularly interesting.

As I understand it you've got a circuit that wants to calculate something like Sim(A,B), where A and B might have many "senses" aka: features but the Sim might not be a linear function of each of thes Sims across all senses/features.

So for example, there are senses in which "Berkeley" and "California" are geographically related, and there might be a few other senses in which they are semantically related but probably none that really matter for copy suppression. For this reason wouldn't expect the tokens of each of to have cosine similarity that is predictive of the copy suppression score. This would only happen for really "mono-semantic tokens" that have only one sense (maybe you could test that).

Moreover, there are also tokens which you might want to ignore when doing copy suppression (speculatively). Eg: very common words or punctuations (the/and/etc).

I'd be interested if you have use something like SAE's to decompose the tokens into the underlying feature/s present at different intensities in each of these tokens (or the activations prior to the key/query projections). Follow up experiments could attempt to determine whether copy suppression could be better understood when the semantic subspaces are known. Some things that might be cool here:
- Show that some features are mapped to the null space of keys/queries in copy suppression heads indicating semantic senses / features that are ignored by copy suppression. Maybe multiple anti-induction heads compose (within or between layers) so that if one maps a feature to the null space, another doesn't (or some linear combination) or via a more complicated function of sets of features being used to inform suppression.
- Similarly, show that the OV circuit is suppressing the same features/features you think are being used to determine semantic similarity. If there's some asymmetry here, that could be interesting as it would correspond to "I calculate A and B as similar by their similarity in the *california axis* but I suppress predictions of any token that has the feature for anywhere on the West Coast*).

I'm particularly excited about this because it might represent a really good way to show how knowing features informs the quality of mechanistic explanations.

I don't remember signed GSEA being standard when I used it so no particular reason. Could try it.

I suspect this is a good experiment. It may help!

Interesting - I agree I don't have strong evidence here (and certainly we haven't it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me - I didn't think this interpretation would be contentious!

than that it would flag text containing very OOD examples of covert honesty.

I think you mean covert "dishonesty"?

That is, in the prompted setting, I think that it will be pretty hard to tell whether your probe is working by detecting the presence of instructions about sandbagging vs. working by detecting the presence of cognition about sandbagging.

I think I agree but think this is very subtle so want to go into more depth here:

I should note that the term "cognition about sandbagging" sounds like a broader thing than what I think we're detecting - it might be worth clarifying.
I think we are likely detecting part of the cognition to do with intentionally saying wrong things. Since the probe is trained off-policy on a simple deception like dataset (and we're able to play with it a bunch and characterise what we think it's detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property. Maybe it tracks "how relevant is the heuristic that the next token should be wrong relevant to this token" (but it's a little bit in superposition with "this token is wrong or part of something wrong").
However, I think the presence of instructions around sandbagging has a downstream effect on the thing we're detecting (and so in some sense we are measuring both). Apollo found that all scores (regardless of label) where elevated when they probed on the conditional sandbagging dataset from Benton et al. which I think does suggest the presence of instructions about sandbagging will elevate scores. But I'd argue that the presence of instructions is causing this heuristic to be more relevant, thus elevating the scores.

I think this is a valuable read for people who work in interp but feel like I want to add a few ideas:

Distinguishing Misrepresentation from Mismeasurement: Interpretability researchers use techniques that find vectors which we say correspond to the representations of the model, but the methods we use to find those may be imperfect. For example, if your cat SAE feature also lights up on racoons, then maybe this is a true property of the model's cat detector (that is also lights up on racoons) or maybe this is an artefact of the SAE loss function. Maybe the true cat detector doesn't get fooled by racoons, but your SAE latent is biased in some way. See this paper that I supervised for more concrete observations.
What are the canonical units? It may be that there is a real sense in which the model has a cat detector but maybe at the layer at which you tried to detect it, the cat detector is imperfect. If the model doesn't function as if it has an imperfect cat detector then maybe downstream of the cat-detector is some circuitry for catching/correcting specific errors. This means that finding the local cat detector you've found which might have misrepresentation issues isn't in itself sufficient to argue that the model as a whole has those issues. Selection pressures apply to the network as a whole and not necessarily always to the components. The fact that we see so much modularity is probably not random (John's written about this) but if I'm not mistaken, we don't have strong reasons to believe that the thing that looks like a cat detector must be the model's one true cat detector.

I'd be excited for some empirical work following up on this. One idea might be to train toy models which are incentivised to contain imperfect detectors (eg; there is a noisy signal but reward is optimised by having a bias toward recall or precision in some of the intermediate inferences). Identifying intermediate representations in such models could be interesting.

I think I like a lot of the thinking in the post. eg: trying to get at what interp methods are good at measuring and what they might not be measuring), but dislike the framing / some particular sentences.

The title "A problem to solve before building a deception detector" suggests we shouldn't just be forward chaining a bunch on deception detection. I don't think you've really convinced me of this with the post. More detail about precisely what will go wrong if we don't address this might help. (apologies if I missed this on my quick read).
"We expect that we would still not be able to build a reliable deception detector even if we had a lot more interpretability results available." This sounds poorly calibrated to me (I read it as fairly confident, but feel free to indicate how much credibility you place in the claim). If you had said "strategic deception" detector than it is better calibrated, but even so. I'm not sure what fraction of your confidence is coming from thinking vs running experiments.
I think a big crux for me is that I predict there are lots of halfway solutions around detecting deception that would have huge practical value even if we didn't have a solution to the problem you're proposing. Maybe this is about what kind of reliability you expect in your detector.

Good resource: https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J <- Neel Nanda's glossary.

> What is a feature?

Often gets confused because early literature doesn't distinguish well between property of the input represented by a model and the internal representation. We tend to refer to the former as a feature and the latter as a latent these days. Eg: "Not all Language Model Features are Linear" => not all the representations are linear (and is not a statement about what gets represented).

> Are there different circuits that appear in a network based on your definition of what a relevant feature is?

This question seems potentially confusing. If you use different methods (eg: supervised vs unsupervised) you are likely to find different results. Eg: In a paper I supervised here https://arxiv.org/html/2409.14507v2 we looked at how SAEs compared to Linear probes. This was a comparison of methods for finding representations. I don't know of any work doing circuit finding with multiple feature finding methods though (but I'd be excited about it).

> How crisp are these circuits that appear, both in toy examples and in the wild?

Read ACDC. https://arxiv.org/abs/2304.14997 . Generally, not crisp.

> What are the best examples of "circuits in the wild" that are actually robust?

The ARENA curriculum probably covers a few. there might be some papers comparing circuit finding methods that use a standard set of circuits you could find.

> If I have a tiny network trained on an algorithmic task, is there an automated search method I can use to identify relevant subgraphs of the neural network doing meaningful computation in a way that the circuits are distinct from each other?

Interesting question. See Neel's thoughts here: https://www.neelnanda.io/mechanistic-interpretability/othello#finding-modular-circuits

> Does this depend on training?

Probably yes. Probably also on how the different tasks relate to each other (whether they have shareable intermediate results).

> (Is there a way to classify all circuits in a network (or >10% of them) exhaustively in a potentially computationally intractable manner?)

I don't know if circuits are a good enough description of reality for this to be feasible. But you might find this interesting https://arxiv.org/abs/2501.14926

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments