Sam Marks

Wiki Contributions


Here's an experiment I'm about to do:

  • Remove high-frequency features from 0_8192 layer 3 until it has L0 < 40 (the same L0 as the 1_32768 layer 3 dictionary)
  • Recompute statistics for this modified dictionary.

I predict the resulting dictionary will be "like 1_32768 but a bit worse." Concretely, I'm guessing that means % loss recovered around 72%. 



I killed all features of frequency larger than 0.038. This was 2041 features, and resulted in a L0 just below 40. The stats:

MSE Loss: 0.27 (worse than 1_32768)

Percent loss recovered: 77.9% (a little bit better than 1_32768)

I was a bit surprised by this -- it suggests the high-frequency features are disproportionately likely to be useful for reconstructing activations in ways that don't actually mater to the model's computation. (Though then again, maybe this is what we expect for uninterpretable features.)

It also suggests that we might be better off training dictionaries with a too-low L1 penalty and then just pruning away high-frequency features (sort of the dual operation of "train with a high L1 penalty and resample low-frequency features"). I'd be interested for someone to explore if there's a version of this that helps.

I agree that the L0's for 0_8192 are too high in later layers, though I'll note that I think this is mainly due to the cluster of high-frequency features (see the spike in the histogram). Features outside of this spike look pretty decent, and without the spike our L0s would be much more reasonable. 

Here are four random features from layer 3, at a range of frequencies outside of the spike.

Layer 3, 0_8192, feature 138 (frequency = 0.003) activates on the newline at the end of the "field of the invention" section in patent applications. I think it's very likely predicting that the next few tokens will be "2. Description of the Related Art" (which always comes next in patents).

Layer 3, 0_8192, feature 27 (frequency = 0.009) seems to activate on the "is" in the phrase "this is"

Layer 3, 0_8192, feature 4 (frequency = 0.026) looks messy at first, but on closer inspection seems to activate on the final token of multi-token words in informative file/variable names.

Layer 3, 0_8192, feature 56 (frequency = 0.035) looks very polysemantic: it's activating on certain terms in LaTeX expressions, words in between periods in urls and code, and some other random-looking stuff.

I agree with everything you wrote here and in the sibling comment: there are reasonable hopes for bootstrapping alignment as agents grow smarter; but without a concrete bootstrapping proposal with an accompanying argument, <1% P(doom) from failing to bootstrap alignment doesn't seem right to me.

I'm guessing this is my biggest crux with the Quintin/Nora worldview, so I guess I'm bidding for -- if Quintin/Nora have an argument for optimism about bootstrapping beyond "it feels like this should work because of iterative design" -- for that argument to make it into the forthcoming document.

Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we've found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don't have the data on hand).

The way I would phrase this concern is "SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations." E.g. since "tree" is a class of things defined by a bunch of correlations present in the underlying image data, it's possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.

I agree this is a valid critique. Here's one empirical test which partially gets at it: what happens when you train an SAE on a NN with random weights? (I.e. you randomize the parameters of your NN, and then train an SAE on its activations on real data in the normal way.) Then to the extent that your SAE has good-looking features, that must be because your SAE was picking up on structure in the underlying data.

My collaborators and I did this experiment. In more detail, we trained SAEs on Pythia-70m's MLPs, then did this again but after randomizing the weights of Pythia-70m. Take a moment to predict the results if you want etc etc.

The SAEs that we trained on a random network looked bad. The most interesting dictionary features we found were features that activated on particular tokens (e.g. features that activated on the "man" feature and no others). Most of the features didn't look like anything at all, activating on a large fraction (>10%) of tokens in our data, with no obvious patterns.(The features for dictionaries trained on the non-random network looked much better.)

We also did a variant of this experiment where use randomized Pythia-70m's parameters except for the embedding layer. In this variant, the most interesting features we found were features which fired on a few closely semantically related tokens (e.g. the tokens "make," "makes," and "making").

Thanks to my collaborators for this experiment: Aaron Mueller and David Bau.

I agree that a reasonable intuition for what SAEs do is: identify "basic clusters" in NN activations (basic in the sense that you allow compositionality, i.e. you don't try to learn clusters whose centroids are the sums of the centroids of previously-learned clusters). And these clusters might exist because:

  1. your NN has learned concepts and these clusters correspond to concepts (what we hope is the reason), or
  2.  because of correlations present in your underlying data (the thing that you seem to be worried about).

Beyond the preliminary empirics I mentioned above, I think there are some theoretical reasons to hope that SAEs will mostly learn the first type of cluster:

  • Most clusters in NN activations on real data might be of the first type
    • This is because the NN has already, during training, noticed various correlations in the data and formed concepts around them (to the extent that these concepts were useful for getting low loss, which they typically will be if your model is trained on next-token prediction (a task which incentivizes you to model all the correlations)).
  • Clusters of the second type might not have any interesting compositional structure, but your SAE gets bonus points for learning clusters which participate in compositional structure.
    • E.g. If there are five clusters with centroids w, x, y, z, and y + z and your SAE can only learn 2 of them, then it would prefer to learn the clusters with centroids y and z (because then it can model the cluster with centroid y + z for free).

Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probes. (Maybe there is some reason why looking at connections will be especially good for features but not probes, but if so, why?)

"Can this be applied to probes" is a crux for me. It sounds like you're imagining something like:

  • Train a bunch of truthfulness probes regularized to be distinct from each other.
  • Train a bunch of probes for "blacklsited" features which we don't think should be associated to truth (e.g. social reasoning, intent to lie, etc.).
  • (Unsure about this step.) Check which truth directions are causally downstream of blacklisted feature directions (with patching experiments?). Use that to discriminate among the probes.

Is that right?

This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.

Let me try to state something which captures most of that approach to make sure I understand:

Everything you wrote describing the hope looks right to me.

It's worth noting that I can't imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial.

To be clear, what does "ambitious" mean here? Does it mean "producing a large degree of understanding?"

If we don't understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff in the residual between our understanding and what's going on.


If we want to look at connections, then imperfect understanding will probably bite pretty hard particularly as the effect size of the connection gets smaller and smaller (either due to path length >1 or just there being many things which are directly connected but have a small effect).

These seem like important intuitions, but I'm not sure I understand or share them. Suppose I identify a sentiment feature. I agree there's a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don't really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.

Just so with truth: there's probably lots of different subtly different notions of truth, but for the application of "detecting whether my AI believes statement X to be true" I don't care about that. I do care about the difference between "true" and "humans think is true," but that's a big difference that I can understand (even if I can't produce examples), and where I can articulate the sorts of cognition which probably should/shouldn't be involved in it.

What's the specific way you imagine this failing? Some options:

  • None of the features we identify really seem to correspond to something resembling our intuitive notion of "truth" (e.g. because they frequently activate on unrelated concepts).
  • We get a bunch of features that look like truth, but can't really tell what goes into computing them.
  • We get a bunch of features that look like truth and we have some vague sense of how they're computed, but they don't seem differentiated in how "sketchy" these computational graphs look: either they all seem to rely on social reasoning or they all don't seem to.

Maybe a better question would be - why didn't these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.

What follows is a note I wrote responding to the AI Optimists essay, explaining where I agree and disagree. I was thinking about posting this somewhere, so I figure I'll leave it in the comments here. (So to be clear, it's responding to the AI Optimists essay, not responding to Steven's post.)

Places I think AI Optimists and I agree:

  • We have a number of advantages for aligning NNs that we don’t have for humans: white box access, better control over training environments and inputs, better control over the reward signal, and better ability to do research about which alignment techniques are most effective.
  • Evolution is a misleading analogy for many aspects of the alignment problem; in particular, gradient-based optimization seems likely to have importantly different training dynamics from evolution, like making it harder to gradient hack your training process into retaining cognition which isn’t directly useful for producing high-reward outputs during training.
  • Humans end up with learned drives, e.g. empathy and revenge, which are not hard-coded into our reward systems. AI systems also have not-strictly-optimal-for-their-training-signal learned drives like this.
  • It shouldn’t be difficult for AI systems to faithfully imitate human value judgements and uncertainty about those value judgements.

Places I think we disagree, but I’m not certain. The authors of the Optimists article promise a forthcoming document which addresses pessimistic arguments, and these bullet points are something like like “points I would like to see addressed in this document.”

  • I’m not sure we’re worrying about the same regimes.
    • The regime I’m most worried about is:
      • AI systems which are much smarter than the smartest humans
      • These AI systems are aligned in a controlled lab environment, but then deployed into the world at-large. Many of their interactions are difficult to monitor (and are also interactions with other AI systems).
      • Possibly: these AI systems are highly multi-modal, including sensors which look like “camera readouts of real-world data”
    • It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
      • When they write things like “AIs are white boxes, we have full control over their ‘sensory environment’,” it seems like they’re imagining the latter regime.
      • They’re not very clear about what intelligence regime they’re discussing, but I’m guessing they’re talking about the ~human-level intelligence regime (e.g. because they don’t spill much ink discussing scalable oversight problems; see below). 
  • I worry that the difference between “looks good to human evaluators” and “what human evaluators actually want” is important.
    • Concretely, I worry that training AI systems to produce outputs which look good to human evaluators will lead to AI systems which learn to systematically deceive their overseers, e.g. by introducing subtle errors which trick overseers into giving a too-high score, or by tampering with the sensors that overseers use to evaluate model outputs.
    • Note that arguments about the ease of learning human values and NN inductive biases don’t address this point — if our reward signal systematically prefers goals like “look good to evaluators” over goals like “actually be good,” then good priors won’t save us.
      • (Unless we do early stopping, in which case I want to hear a stronger case for why our models’ alignment will be sufficiently robust (robust enough that we’re happy to stop fine-tuning) before our models have learned to systematically deceive their overseers.)
  • I worry about sufficiently situationally aware AI systems learning to fixate on reward mechanisms (e.g. “was the thumbs-up button pressed” instead of “was the human happy”).
    • To sketch this concern out concretely, suppose an AI system is aware that it’s being fine-tuned and learned during pretraining that human overseers have a “thumbs-up” button which determines whether the model is rewarded. Suppose that so far during fine-tuning “thumbs-up button was pressed” and “human was happy” were always perfectly correlated. Will the model learn to form values around the thumbs-up button being pressed or around humans being happy? I think it’s not obvious.
    • Unlike before, NN inductive biases are relevant here. But it’s not clear to me that “humans are happy” will be favored over “thumbs-up button is pressed” — both seem similarly simple to an AI with a rich enough world model.
    • I don’t think the comparison with humans here is especially a cause for optimism: lots of humans get addicted to things, which feels to me like “forming drives around directly intervening on reward circuitry.”
  • For both of the above concerns, I worry that they might emerge suddenly with scale.
    • As argued here, “trick the overseer” will only be selected for in fine-tuning once the (pretrained) model is smart enough to do it well.
    • You can only form values around the thumbs-up button once you know it exists.
  • It seems to me that, on the authors’ view, an important input to “human alignment” is the environment that we’re trained in (rather than details of our brain’s reward circuitry, which is probably very simple). It doesn’t seem to me that environmental factors that make humans aligned (with each other) should generalize to make AI systems aligned (with humans).
    • In particular, I would guess that one important part of our environment is that humans need to interact with lots of similarly-capable humans, so that we form values around cooperation with humans. I also expect AI systems to interact with lots of AI systems (though not necessarily in training), which (if this analogy holds at all) would make AI systems care about each other, not about humans.
  • I neither have high enough confidence in our understanding of NN inductive biases, nor in the way Quintin/Nora make arguments based on said understanding, to consider these arguments as strong evidence that models won’t “play the training game” while they know they’re being trained/evaluated only to, in deployment, pursue goals they hid from their overseers.
    • I don’t really want to get into this, because it’s thorny and not my main source of P(doom).

A specific critique about the article:

  • The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.
    • The developer wanted their model to be sufficiently aligned that it would, e.g. never say racist stuff no matter what input it saw. In contrast, it takes only a little bit of adversarial pressure to produce inputs which will make the model say racist stuff. This indicates that the developer failed at alignment. (I agree that it means that the attacker succeeded at alignment.)
    • Part of the story here seems to be that AI systems have held-over drives from pretraining (e.g. drives like “produce continuations that look like plausible web text”). Eliminating these undesired drives is part of alignment.

Thanks for having this dialogue -- I'm very happy to see clearer articulation of the Buck/Ryan views on theories of impact for MI work!

The part that I found most useful was Ryan's bullet points for "Hopes (as I see them) for mech interp being useful without explaining 99%". I would guess that most MI researchers don't actually see their theories of impact as relying on explaining ~all of model performance (even though they sometimes get confused/misunderstand the question and say otherwise). So I think the most important cruxes will lie in disagreements about (1) whether Ryan's list is complete, and (2) whether Ryan's concerns about the approaches listed are compelling.

Here's a hope which (I think) isn't on the list. It's somewhat related to the hope that Habryka raised, though a bit different and more specific.

Approach: maybe model internals overtly represent qualities which distinguish desired vs. undesired cognition, but probing is insufficient for some reason (e.g. because we don't have good enough oversight to produce labeled data to train a probe with).

Here's a concrete example (which is also the example I most care about). Our goal is to classify statements as true/false, given access to a model that knows the answer. Suppose our model has distinct features representing "X is true" and "humans believe X." Further suppose that on any labeled dataset we're able to create, these two features are correlated; thus, if we make a labeled dataset of true/false statements and train a probe on it, we can't tell whether the probe will generalize as an "X is true" classifier or a "humans believe X classifier." However, a coarse-grained mechanistic understanding would help here. E.g., one could identify all of the model features which serve as accurate classifiers on our dataset, and only treat statements as true if all of the features label them as true. Or if we need a lower FPR, one might be able to mechanistically distinguish these features, e.g. by noticing that one feature is causally downstream of features that look related to social reasoning and the other feature isn't.

This is formally similar to what the authors of this paper did. In brief, they were working with the Waterbirds dataset, an image classification task with lots of spuriously correlated features which are not disambiguated by the labeled data. Working with a CLIP ViT, the authors used some ad-hoc technique to get a general sense that certain attention heads dealt with concepts like "texture," "color," and "geolocation." Then they ablated the heads which seemed most likely to attend to confounding features; this resulted in a classifier which generalized in the desired way, without requiring a better-quality labeled dataset.

Curious for thoughts about/critiques of this impact story.

Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected

Instead, we enter the realm of tool AI which basically does what you say.

I agree that, conditional on no deceptive alignment, the most pernicious and least tractable sources of doom go away. 

However, I disagree that conditional on no deceptive alignment, AI "basically does what you say." Indeed, the majority of my P(doom) comes from the difference between "looks good to human evaluators" and "is actually what the human evaluators wanted." Concretely, this could play out with models which manipulate their users into thinking everything is going well and sensor tamper.

I think current observations don't provide much evidence about whether these concerns will pan out: with current models and training set-ups, "looks good to evaluators" almost always coincides with "is what evaluators wanted." I worry that we'll only see this distinction matter once models are smart enough that they could competently deceive their overseers if they were trying (because of the same argument made here). (Forms of sycophancy where models knowingly assert false statements when they expect the user will agree are somewhat relevant, but there are also benign reasons models might do this.)

Load More