I agree that the L0's for 0_8192 are too high in later layers, though I'll note that I think this is mainly due to the cluster of high-frequency features (see the spike in the histogram). Features outside of this spike look pretty decent, and without the spike our L0s would be much more reasonable.
Here are four random features from layer 3, at a range of frequencies outside of the spike.
Layer 3, 0_8192, feature 138 (frequency = 0.003) activates on the newline at the end of the "field of the invention" section in patent applications. I think it's very likely predicting that the next few tokens will be "2. Description of the Related Art" (which always comes next in patents).
Layer 3, 0_8192, feature 27 (frequency = 0.009) seems to activate on the "is" in the phrase "this is"
Layer 3, 0_8192, feature 4 (frequency = 0.026) looks messy at first, but on closer inspection seems to activate on the final token of multi-token words in informative file/variable names.
Layer 3, 0_8192, feature 56 (frequency = 0.035) looks very polysemantic: it's activating on certain terms in LaTeX expressions, words in between periods in urls and code, and some other random-looking stuff.
I agree with everything you wrote here and in the sibling comment: there are reasonable hopes for bootstrapping alignment as agents grow smarter; but without a concrete bootstrapping proposal with an accompanying argument, <1% P(doom) from failing to bootstrap alignment doesn't seem right to me.
I'm guessing this is my biggest crux with the Quintin/Nora worldview, so I guess I'm bidding for -- if Quintin/Nora have an argument for optimism about bootstrapping beyond "it feels like this should work because of iterative design" -- for that argument to make it into the forthcoming document.
Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we've found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don't have the data on hand).
The way I would phrase this concern is "SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations." E.g. since "tree" is a class of things defined by a bunch of correlations present in the underlying image data, it's possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.
I agree this is a valid critique. Here's one empirical test which partially gets at it: what happens when you train an SAE on a NN with random weights? (I.e. you randomize the parameters of your NN, and then train an SAE on its activations on real data in the normal way.) Then to the extent that your SAE has good-looking features, that must be because your SAE was picking up on structure in the underlying data.
My collaborators and I did this experiment. In more detail, we trained SAEs on Pythia-70m's MLPs, then did this again but after randomizing the weights of Pythia-70m. Take a moment to predict the results if you want etc etc.
The SAEs that we trained on a random network looked bad. The most interesting dictionary features we found were features that activated on particular tokens (e.g. features that activated on the "man" feature and no others). Most of the features didn't look like anything at all, activating on a large fraction (>10%) of tokens in our data, with no obvious patterns.(The features for dictionaries trained on the non-random network looked much better.)
We also did a variant of this experiment where use randomized Pythia-70m's parameters except for the embedding layer. In this variant, the most interesting features we found were features which fired on a few closely semantically related tokens (e.g. the tokens "make," "makes," and "making").
Thanks to my collaborators for this experiment: Aaron Mueller and David Bau.
I agree that a reasonable intuition for what SAEs do is: identify "basic clusters" in NN activations (basic in the sense that you allow compositionality, i.e. you don't try to learn clusters whose centroids are the sums of the centroids of previously-learned clusters). And these clusters might exist because:
Beyond the preliminary empirics I mentioned above, I think there are some theoretical reasons to hope that SAEs will mostly learn the first type of cluster:
Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probes. (Maybe there is some reason why looking at connections will be especially good for features but not probes, but if so, why?)
"Can this be applied to probes" is a crux for me. It sounds like you're imagining something like:
Is that right?
This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.
Let me try to state something which captures most of that approach to make sure I understand:
Everything you wrote describing the hope looks right to me.
It's worth noting that I can't imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial.
To be clear, what does "ambitious" mean here? Does it mean "producing a large degree of understanding?"
If we don't understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff in the residual between our understanding and what's going on.
[...]
If we want to look at connections, then imperfect understanding will probably bite pretty hard particularly as the effect size of the connection gets smaller and smaller (either due to path length >1 or just there being many things which are directly connected but have a small effect).
These seem like important intuitions, but I'm not sure I understand or share them. Suppose I identify a sentiment feature. I agree there's a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don't really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.
Just so with truth: there's probably lots of different subtly different notions of truth, but for the application of "detecting whether my AI believes statement X to be true" I don't care about that. I do care about the difference between "true" and "humans think is true," but that's a big difference that I can understand (even if I can't produce examples), and where I can articulate the sorts of cognition which probably should/shouldn't be involved in it.
What's the specific way you imagine this failing? Some options:
Maybe a better question would be - why didn't these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.
What follows is a note I wrote responding to the AI Optimists essay, explaining where I agree and disagree. I was thinking about posting this somewhere, so I figure I'll leave it in the comments here. (So to be clear, it's responding to the AI Optimists essay, not responding to Steven's post.)
Places I think AI Optimists and I agree:
Places I think we disagree, but I’m not certain. The authors of the Optimists article promise a forthcoming document which addresses pessimistic arguments, and these bullet points are something like like “points I would like to see addressed in this document.”
A specific critique about the article:
Thanks for having this dialogue -- I'm very happy to see clearer articulation of the Buck/Ryan views on theories of impact for MI work!
The part that I found most useful was Ryan's bullet points for "Hopes (as I see them) for mech interp being useful without explaining 99%". I would guess that most MI researchers don't actually see their theories of impact as relying on explaining ~all of model performance (even though they sometimes get confused/misunderstand the question and say otherwise). So I think the most important cruxes will lie in disagreements about (1) whether Ryan's list is complete, and (2) whether Ryan's concerns about the approaches listed are compelling.
Here's a hope which (I think) isn't on the list. It's somewhat related to the hope that Habryka raised, though a bit different and more specific.
Approach: maybe model internals overtly represent qualities which distinguish desired vs. undesired cognition, but probing is insufficient for some reason (e.g. because we don't have good enough oversight to produce labeled data to train a probe with).
Here's a concrete example (which is also the example I most care about). Our goal is to classify statements as true/false, given access to a model that knows the answer. Suppose our model has distinct features representing "X is true" and "humans believe X." Further suppose that on any labeled dataset we're able to create, these two features are correlated; thus, if we make a labeled dataset of true/false statements and train a probe on it, we can't tell whether the probe will generalize as an "X is true" classifier or a "humans believe X classifier." However, a coarse-grained mechanistic understanding would help here. E.g., one could identify all of the model features which serve as accurate classifiers on our dataset, and only treat statements as true if all of the features label them as true. Or if we need a lower FPR, one might be able to mechanistically distinguish these features, e.g. by noticing that one feature is causally downstream of features that look related to social reasoning and the other feature isn't.
This is formally similar to what the authors of this paper did. In brief, they were working with the Waterbirds dataset, an image classification task with lots of spuriously correlated features which are not disambiguated by the labeled data. Working with a CLIP ViT, the authors used some ad-hoc technique to get a general sense that certain attention heads dealt with concepts like "texture," "color," and "geolocation." Then they ablated the heads which seemed most likely to attend to confounding features; this resulted in a classifier which generalized in the desired way, without requiring a better-quality labeled dataset.
Curious for thoughts about/critiques of this impact story.
Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected.
Instead, we enter the realm of tool AI which basically does what you say.
I agree that, conditional on no deceptive alignment, the most pernicious and least tractable sources of doom go away.
However, I disagree that conditional on no deceptive alignment, AI "basically does what you say." Indeed, the majority of my P(doom) comes from the difference between "looks good to human evaluators" and "is actually what the human evaluators wanted." Concretely, this could play out with models which manipulate their users into thinking everything is going well and sensor tamper.
I think current observations don't provide much evidence about whether these concerns will pan out: with current models and training set-ups, "looks good to evaluators" almost always coincides with "is what evaluators wanted." I worry that we'll only see this distinction matter once models are smart enough that they could competently deceive their overseers if they were trying (because of the same argument made here). (Forms of sycophancy where models knowingly assert false statements when they expect the user will agree are somewhat relevant, but there are also benign reasons models might do this.)
Here's an experiment I'm about to do:
I predict the resulting dictionary will be "like 1_32768 but a bit worse." Concretely, I'm guessing that means % loss recovered around 72%.
Results:
I killed all features of frequency larger than 0.038. This was 2041 features, and resulted in a L0 just below 40. The stats:
MSE Loss: 0.27 (worse than 1_32768)
Percent loss recovered: 77.9% (a little bit better than 1_32768)
I was a bit surprised by this -- it suggests the high-frequency features are disproportionately likely to be useful for reconstructing activations in ways that don't actually mater to the model's computation. (Though then again, maybe this is what we expect for uninterpretable features.)
It also suggests that we might be better off training dictionaries with a too-low L1 penalty and then just pruning away high-frequency features (sort of the dual operation of "train with a high L1 penalty and resample low-frequency features"). I'd be interested for someone to explore if there's a version of this that helps.