Opening positions

ryan_greenblatt

I'm somewhat skeptical about mech interp (bottom-up or substantial reverse engineering style interp):

  • Current work seems very far from being useful (it isn't currently useful) or explaining much what's going on inside of models in key cases. But it's hard to be very confident that a new field won't work! And things can be far from useful, but become useful via slowly becoming more powerful, etc.
  • In particular, current work fails to explain much of the performance of models which makes me think that it's quite far from ambitious success and likely also usefulness. I think this even after seeing recent results like dictionary learning results (though results along these lines were a positive update for me overall).
  • There isn't a story which-makes-much-sense-and-seems-that-plausible-to-me for how mech interp allows for strongly solving core problems like auditing for deception or being able to supervise superhuman models which carry out actions we don't understand (e.g. ELK).

That said, all things considered, mech interp seems like a reasonable bet to put some resources in. 

I'm excited about various mech interp projects which either:

  • Aim to more directly measure and iterate on key metrics of usefulness for mech interp
  • Try to use mech interp to do something useful and compare to other methods (I'm fine with substantial mech interp industrial policy, but we do actually care about the final comparison. By industrial policy, I mean subsidizing current work even if mech interp isn't competitve yet because it seems promising.)

I'm excited about two main outcomes from this dialogue:

  • Figuring out whether or not we agree on the core claims I wrote above. (Either get consensus or find crux ideally)
  • Figuring out which projects we'd be excited about which would substantially positively update us about mech interp.

Maybe another question which is interesting: even if mech interp isn't that good for safety, maybe it's pretty close to stuff which is great and is good practice.

Another outcome that I'm interested in is personally figuring out how to better articulate and communicate various takes around mech interp.

ryan_greenblatt

By mech interp I mean "A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding."

Neel Nanda

I feel pretty on board with this definition, 

Buck

Our arguments here do in fact have immediate implications for your research, and the research of your scholars, implying that you should prioritize projects of the following forms:

  • Doing immediately useful stuff with mech interp (and probably non-mech interp), to get us closer to model-internals-based techniques adding value. This would improve the health of the field, because it's much better for a field to be able to evaluate work in simple ways.
  • Work which tries to establish the core ambitious hopes for mech interp, rather than work which scales up mediocre-quality results to be more complicated or on bigger models.
Neel Nanda

What I want from this dialogue:

  • Mostly an excuse to form more coherent takes on why mech interp matters, limitations, priorities, etc
  • I'd be excited if this results in us identifying concrete cruxes
  • I'd be even more excited if we identify concrete projects that could help illuminate these cruxes (especially things I could give to my new army of MATS scholars!)
ryan_greenblatt

I'd be even more excited if we identify concrete projects that could help illuminate these cruxes (especially things I could give to my new army of MATS scholars!)

I'd like to explicitly note I'm excited to find great concrete projects!

Neel Nanda

Stream of consciousness takes on your core claims:

I basically agree that current mech interp is not currently useful for actual, non-interp things we might care about doing/understanding in models. I'm hesitant to agree with "very far from being useful", mostly because I agree that you should be pretty uncertain about the future trajectory of a field, but this may just be semantics.

Notable intuitions I have:

  • Mech interp doesn't need to explain everything about how a model does something to be useful (explaining an important part, or the gist of it, may be fine)
  • It really feels like models have real, underlying structure that can be understood, that we could have lived in a world where "everything inside a model is a fucking mess", and we do not seem to live in that world! That world would not have things like induction heads, the French neuron, etc. Models also seem super messy in a bunch of ways, and I am not sure how to square this circle

I'm excited about projects of the form "try to understand a real-world task with mech interp (eg why models refuse requests/can be jailbroken, or why they hallucinate), and then ideally convert this understanding into actually affecting that downstream task". Concrete suggestions here are welcome, I've already brainstormed a few for my MATS scholars

Neel Nanda

Some assorted responses:

In particular, current work fails to explain much of the performance of models which makes me think that it's quite far from ambitious success and likely also usefulness.

This is not an obvious claim to me, though I find it a bit hard to articulate my intuitions here

A possible meta-level point of disagreement is whether a research approach needs to have a careful backchained theory of change behind it to be worthwhile, or if "something here seems promising, even if I struggle to articulate exactly what, and I'll get some feedback from reality" is a decent reason.

There are other directions which make non-trivial use of the internals of models which I'm excited about, but which aren't mech interp.

This feels fairly true to me (in the sense of "I expect such methods to exist"), though I don't feel confident in any specific non-mechanistic approach. I expect that for any given such method, I'd be excited about trying to use mech interp to red-team/validate/better understand it

Buck

A possible meta-level point of disagreement is whether a research approach needs to have a careful backchained theory of change behind it to be worthwhile, or if "something here seems promising, even if I struggle to articulate exactly what, and I'll get some feedback from reality" is a decent reason.

I don't think research approaches need to have careful backchained theory of change behind them to be worthwhile.

I do think that it's best if research approaches have either:

  • A way to be empirically grounded. The easiest option here is to be useful for some task, and demonstrate increased performance over time.
  • A clear argument for why the research will eventually be useful.

I'm concerned in cases where neither of those is present.

ryan_greenblatt

Some quick meta notes:

Against Almost Every Theory of Impact of Interpretability is relevant prior work. This post actually dissuaded me from writing a post with somewhat similar content. Though note that I disagree with various specific points in this post:

  •  I think that it overgeneralizes from mech interp pessimism toward pessimism for less ambitious hopes for understanding of model internals
  • I think it fails to clearly emphasize that spending some resources on very speculative bets can be totally worth doing even if there isn't a clear theory of change and all we have to go on are vibes.

I think it's fine (maybe great!) for many people to not at all worry or think about the theory of change or fastest paths to usefulness. It's fine if some people want to operate with mech interp as a butterfly idea. (But I think some people should care about usefulness or theory of change.)

ryan_greenblatt

> There are other directions which make non-trivial use of the internals of models which I'm excited about, but which aren't mech interp.

This feels fairly true to me (in the sense of "I expect such methods to exist"), though I don't feel confident in any specific non-mechanistic approach. I expect that for any given such method, I'd be excited about trying to use mech interp to red-team/validate/better understand it

Interesting. I really feel like there are a lot of methods where we could gain moderate confidence in them working without mechanistic verification (rather our evidence would come from the method being directly useful in a variety of circumstances). I think that both higher level interp and internals methods which don't involve any understanding are pretty promising.

Do induction heads and French neurons exist?

Buck

That world would not have things like induction heads, the French neuron, etc.

I claim that our world does not have induction heads, at least in the sense of "heads that are well explained by the hypothesis that they do induction"

I also think it's not clear that the French neuron is a French neuron, rather than a neuron which does something inexplicable, but only in cases where the text is in French. (assuming that you're referring to a neuron that fires on French text)

ryan_greenblatt

It's not clear that the French neuron is a French neuron, rather than a neuron which does something inexplicable, but only in cases where the text is in French. (assuming that you're referring to a neuron that fires on French text)

A high level concern here is that there might be a lot of neurons which look roughly like french neurons and they can't all be doing the same thing. So probably they're doing something more specific and probably a lot of the usefulness of the neuron to the model is in the residual between our understanding and the actual behavior of the neuron. (As in, if you used our understanding to guess at what the value of the neuron should be on some input and then subtracted off the actual value, that residual would contain a lot of signal.)

(Fortunately, we can measure how much perf is explained by our understanding, though there are some complications.)

Some supporting evidence for this view is the feature spliting section of the recent anthropic dictionary learning paper.

Neel Nanda

Clarification of what I believe about induction heads:

  • A Mathematical Framework argued that there are heads which sometimes do strict induction, and found that we could decode an algorithm for this from the head's parameters (and the previous token head's parameters)
    • Clarification: This did not show that this was all the head was doing, just that one of the terms when you multiplied out the matrices was an induction-y term
  • The sequel paper (on In-Context Learning) found a bunch of heads in models up to 13B that did induction-like stuff on a behavioural basis, on repeated random tokens. These heads are causally linked to in-context learning, and the development of behavioural induction heads seem to be a key enabler of in-context learning.
  • I do not necessarily believe that models contain monosemantic induction heads (doing nothing else), nor that we understand the mechanism or that the mechanism is super elegant, clean and sparse
    • I also think there's a ton of induction variants (eg long-prefix, disambiguating AB...AC...A, dealing with tokenization artifacts, etc)
  • Fuzzy: I do think that the induction mechanism is surprisingly sparse in the head basis, in that there are heads that seem very induction-y, and heads that don't seem at all relevant. 
    • I think it's cool that this is a motif that seems to recur across models, and be useful in a range of contexts. My guess is that "induction" is a core algorithmic primitive in models that gets used (in a fuzzy way) in a range of contexts
  • My underlying point is that there's a spectrum of, a priori, how much structure I might have expected to see inside language models, from incomprehensible to incredibly clean and sparse. Induction heads feel like they rule out the incomprehensible end, and thus feel like a positive update, but maybe are evidence against the strong version of the clean and sparse end?
ryan_greenblatt
Buck

I think that, for my favorite metric of "proportion of what's going on that you've explained", the 'they do induction' hypothesis might be less than 1% of an explanation.

Neel Nanda

I think that, for my favorite metric of "proportion of what's going on that you've explained", the 'they do induction' hypothesis might be less than 1% of an explanation.

1% seems crazy low to me. Do you have a source here, or is this a guess?

Operationalising concretely, do you mean "if we replaced them with Python code that does strict induction, and compared this with mean ablating the head, it would recover 1% of the loss compared to restoring the full head"?

ryan_greenblatt

I think that, for my favorite metric of "proportion of what's going on that you've explained", the 'they do induction' hypothesis might be less than 1% of an explanation.

This might be very sensitive to the exact model which is under analysis. I'm personally skeptical of 1% for small attention-only models (I expect way higher).

For big models maybe.

I assume that by 'they do induction' you mean strict induction.

Buck

Operationalising concretely, do you mean "if we replaced them with Python code that does strict induction, and compared this with mean ablating the head, it would recover 1% of the loss compared to restoring the full head

No, that would do way better than 1% loss explained. (Maybe it would get like 10-20% loss explained?)

ryan_greenblatt

10-20% seems about right based on our causal scrubbing results.

Neel Nanda

A high level concern here is that there might be a lot of neurons which look roughly like french neurons and they can't all be doing the same thing. So probably they're doing something more specific and probably a lot of the usefulness of the neuron to the model is in the residual between our understanding and the actual behavior of the neuron.

(Fortunately, we can measure how much perf is explained by our understanding, though there are some complications.)

Some supporting evidence for this view is the feature spliting section of the recent anthropic dictionary learning paper.

I agree with all of this (and, to be clear, we didn't try very hard in Neurons In A Haystack to establish that it only activates on French text, since we only studied it on EuroParl rather than the Pile). And I agree that it likely has a more nuanced role than just detects French, there are in fact several French neurons, some of which matter far more than others.

I used it as an example of "more structure than a random model would have", strongly agreed there's a lot of underlying complexity and messiness

ryan_greenblatt

I used it as an example of "more structure than a random model would have", strongly agreed there's a lot of underlying complexity and messiness.

Strong agree on more structure than a random model would have. I just worry that we need much higher standards here.

What is the bar for a mechanistic explanation?

Buck

The core problem with using the metric "how much loss is recovered if you use this code instead of just replacing the output with its mean" is that you'll get very high proportions of loss explained even if you don't explain anything about the parts of your model that are actually smart. 

For example, GPT-2-sm is most of the way to GPT-4 performance (compared to mean ablation). It seems like for its ambitious hopes for impact to succeed, mech interp needs to engage with properties of transformative models that were not present in current LMs, and that will require extremely high standards on the metric you proposed.

Neel Nanda

IMO the best current example of "this is what this model component is doing on the full data distribution" is the copy suppression head (L10H7 in GPT-2 Small - paper from Callum McDougall, Arthur Conmy and Cody Rushing in my most recent round of MATS), where we actually try to look at the fraction of loss recovered on the full data distribution, and find we can explain 77% (30-40% with a more specific explanation) if we restrict it to doing copy suppression-like stuff only, as well as some analysis of the weights. 

But certain details there are still somewhat sketchy, in particular we don't have a detailed understanding of the attention circuit, and replacing the query with "the projection onto the subspace we thought was all that mattered" harmed performance significantly (down to 30-40%).

One thing that makes me happier about the copy suppression work is that, as far as I'm aware, Callum and Arthur did not actually find any dataset examples where the head matters by something other than copy suppression. (Not confident, but I believe they looked at random samples from the top 5% of times the head improved loss, after filtering out copy suppression algorithmically, and mostly just found examples that were variants of copy suppression, eg where a token is split because it wasn't prepended by a space)

Buck

So my two problems with your copy suppression example:

  • 30-40% is not actually what I'd call "a complete explanation"
  • The standards explanations need to meet might be more like "many nines of reliability" than "better than 50% reliability"
Neel Nanda

E.g. GPT-2-sm is most of the way to GPT-4 performance (compared to mean ablation). It seems like for its ambitious hopes for impact to succeed, mech interp needs to engage with properties of transformative models that were not present in current LMs, and that will require extremely high standards on the metric you proposed.

OK, this is a fair point. A counter-point is that on specific, narrow prompts the diff between GPT-2 Small and GPT-4 may be very big? But even there, I agree that eg knowledge of basic syntax gets you a ton of loss, and maybe mean ablation breaks that. 

How would you feel about a metric like "explaining x% of the difference between GPT-4 and GPT-2"?

Buck

How would you feel about a metric like "explaining x% of the difference between GPT-4 and GPT-2"?

This is just a rescaling of the metric "explaining x% of the difference between GPT-4 and unigram statistics". So the question is still entirely about how good x has to be for us to be happy.

Neel Nanda

Meta-level note: I'm not sure that our current discussion is a crux for me. Even if I conceded that current models are a mess, it's plausible that this is downstream of neuron and attention head superposition, and that better conceptual frameworks and techniques like really good sparse auto encoders (SAE) would give us more clarity there.

On the other hand, maybe this is naive, I would be pretty surprised if an SAE could get the level of precision you're looking for here

ryan_greenblatt

The standards explanations need to meet might be more like "many nines of reliability" than "better than 50% reliability"

I think like 99% reliability is about the right threshold for large models based on my napkin math.

Buck

The argument for mech interp which says "current stuff is a mess and objectively unacceptably bad, but all the problems are downstream of superposition; mechanistic interpretability is still promising because we might fix superposition" is coherent but requires a totally different justification than "current mech interp results make the situation seem promising" – you have to justify it with argument.

ryan_greenblatt

I'll try to explain how I like thinking about the amount explained in LMs trained on real text distributions.

First, let's simplify and just talk about explanations which provide a human comprehensible explaination for an entire LM trained on a normal text corpus.

I think we should look at the absolute loss and determine where this is on an overall training compute scaling curve.

This is because we really care about explaining the performance gap from GPT2-sm to GPT4. That's in fact all we care about in some sense.

And if you do this math and want to explain important parts of GPT4 you naively need 99% perf.

Buck

My basic question is why you think about current mechanistic interpretability progress being a valid sign of life based on numbers like 50% of performance explained. How did you decide on 50% as the threshold for goodness for finding a single component like a copy-suppression head with that threshold.

Neel Nanda

I'm unconvinced that you need anything like 99% of performance explained (however we precisely operationalise this). 

I think that if I were convinced that mech interp needed 99% perf explained to matter I would be notably less optimistic about it. Partly I just expect that models are a big stack of boring heuristics, that interact in weird and complex ways (eg involving superposition interference, or ablations changing the LayerNorm scale, or floating point errors, or random stuff) such that getting to anything as high as 99% is wild.

ryan_greenblatt

Why you maybe need 99% perf explained:

  • If you have <98% perf explained (on webtext relative to unigram or bigram baseline), then you degrade from GPT4 perf to GPT3.5 perf. (I should double check my numbers here, but this ballpark)
  • Then, I'm worried that if your explanation explains 90% of perf, you'll just be explaining stuff which didn't present a safety concern.
  • It could be that you do end up understanding a bunch of GPT4 specific stuff in your explanation, but we can't know that based on this metrics.
  • I think this seems particularly important for ambitious mech interp. Either trying to audit for scheming in models or trying to avoid issues from superhuman models doing totally incomprehensible actions.
Neel Nanda

If you have <98% perf explained (on webtext relative to unigram or bigram baseline), then you degrade from GPT4 perf to GPT3.5 perf.

Hmm. This is a compelling point. 

I want to push back, in that I think that much of the safety-relevant difference between GPT-4 and GPT-3.5 doesn't come from GPT-4 being 1% better everywhere, it comes from a few crucial places where GPT-4 gets it right and GPT-3.5 gets it wrong. 

I'd further argue that this is actually just most of the difference between GPT-4 and GPT-3.5, but that's more speculative - I think the Quantization Model of Neural Scaling paper found this (that larger models differed from small models by having more tokens at near zero loss, more so than across the board increases), but on much smaller models than GPT-3.5

I do feel fairly pessimistic about approaches like enumerative safety (trying to enumerate all features in a model so you can then rule out the existence of a deceptive feature), which I think will need to hit a 99.9% style bar to reach a level of confidence that a model is safe?

Neel Nanda

trying to avoid issues from superhuman models doing totally incomprehensible actions.

An uncertainty here is whether the lost performance comes from some genuinely different algorithm, vs some clumsiness in our ablations. If you think there may be some genuinely different algorithm hiding in the leftover 50%, I agree that seems very bad!

Buck

I don't get what you mean by "clumsiness in our ablations". I don't know why you think that the error induced by ablations isn't just because the explanation we tried to ablate to is importantly wrong.

Neel Nanda

I don't get what you mean by "clumsiness in our ablations". I don't know why you think that the error induced by ablations isn't just because the explanation we tried to ablate to is importantly wrong.

I think that any ablation is throwing the model off distribution a bit. Eg mean ablation will mess with the LayerNorm scale, resample ablation may introduce info that's weird, the model may expect that head A partially cancels with head B or be using head C to partially reinforce the earlier head D as a secondary role, etc. 

ryan_greenblatt

I'm (in theory) on board with work where you retrain the rest of the model or learn  an arbitrary linear function on top of your explanation. Possible this could avoid issues here. There are some difficulties with this ofc.

Buck

Idea for your MATS scholars: try to get explanations/replacements that get a smaller model 80% of the way to the bigger model's performance, on tasks where the bigger model is vastly better than the smaller model. I.e. augment a small model with an explanation you derived from a big model, so that on that task the small model now performs 80% of the way to the big model. This seems like a plausibly tractable research direction that I can imagine your scholars focusing on.

If your hope is to focus on cases where 80% means your explanation contains more of the relevant model knowledge than a small model had, you should maybe try to steer towards that hope immediately.

Neel Nanda

Ie, find some task that small models just can't do (say, hard multiple choice Qs from MMLU) and find a circuit such that resample ablating everything not in that circuit preserves 80% of the loss of the large model above the small model?

ryan_greenblatt

As described, I'm maybe skeptical about tractibility. Seems like this is way harder than any mech interp to date.

Like learning how 2B models code is super hard.

Buck

Ie, find some task that small models just can't do (say, hard multiple choice Qs from MMLU) and find a circuit such that resample ablating everything not in that circuit preserves 80% of the loss of the large model above the small model?

Yep! Like Ryan, I suspect you'll fail, but it sounds like you think you might succeed, and it seems like whether you can succeed at this is a crux for one of your favorite theories of change, so seems great for you to try.

Neel Nanda

Hmm. So, I basically expect there to be a pareto frontier of the size of the sparse subgraph trading off against loss recovered, and where a sufficiently large subgraph should recover 80%. So we're basically testing the question of whether there's a small subgraph that would recover more than 80%, which seems like an interesting empirical question. I'm more hopeful of this with something like multiple choice questions, where there's a big step change between the small and large models, but not super optimistic.

Let me reflect for a bit on whether this means I should now agree with your broader point

One objection on multiple choice in particular is that I expect it to be downstream of the broad edifice of "factual knowledge" which is likely spread across a bunch of MLP layers. Maybe if we zoomed in on a specific prompt pair? Though that would be messy because now there'd be a ton of heuristics boosting either answer that would normally average out as noise (eg that John is a more common name than Mary in IOI)

Buck

I object to describing this as a question about whether there's a sparse subgraph which explains most of the behavior--IMO, even if a sparse set of heads are all that are relevant, you still need to argue that the thing done by those heads is well described by your explanation.

Neel Nanda

Can you elaborate on what you mean by explanation here?

I'm pretty on board with the claim that sparse subgraphs (eg the kind of thing ACDC gives you) are not sufficient to be considered explanations, and you need to attach actual semantic meaning to the nodes and connections and intermediate states

Buck

I'm pretty on board with the claim that sparse subgraphs (eg the kind of thing ACDC gives you) are not sufficient to be considered explanations, and you need to attach actual semantic meaning to the nodes and connections and intermediate states

yep, that's all I'm saying

Buck

Re enumerative safety, presumably what's interesting is "likelihood ratio in favor of model being safe" as a function of "proportion of features explained (whatever that means)"; there probably isn't a single threshold.

Could mechanistic interpretability rule out deceptive alignment?

ryan_greenblatt

It’s not clear how mech interp could provide strong evidence against scheming (deceptive alignment)even if mech interp were much better. 

If mech interp doesn’t explain or reverse engineer most cognition/components, then the remaining components could contain scheming. (The probability of success here will come down to some complicated question of how we sampled components and from what fraction of components scheming is identifiable which will in turn depend on our decomposition into components.) 

And even if it explained 100% of what was going on by providing a huge program with human understandable variables that had roughly the impact you’d expect when changed, and which was locally understandable, it’s not clear we could rule out unforeseen interactions from effectively implementing scheming (except if that huge program had a very nice and clean internal organization, which seems very unlikely). (Various links for this, can find later, e.g. johnwentworth, complex systems stuff, etc.)

It might be hard to test if mech interp can reliably detect scheming using model organisms because the internals of model organisms might be pretty different than natural cases.

Therefore, even in the worlds where it goes well, it seems unlikely that mech interp evidence would result in me making a >10x likelihood ratio update against deceptive alignment.

However, providing strong evidence of deception could be much easier: mech interp could, if successful, provide extremely good adversarial examples that would lead models to try to take over (note that for this use case, mech interp is in competition with other model internals techniques).

One hope is: Even if mech interp can't explain everything, maybe it can confidently rule things out. I'm skeptical, but this is somewhat plausible IMO. I'm excited about trying to test this ASAP for current methods.

Hopes for mechanistic interpretability being useful

ryan_greenblatt

Hopes (as I see them) for mech interp being useful without explaining 99%:

  • Approach: Maybe we can just focus on safety relevant components and explain >98% of just these components which will be easier. Concern: It seems very hard to explain these components to this degree without understanding other components.
  • Approach: Maybe we can focus on tasks where GPT4 and GPT3.5 gap is big. E.g., GPT4 gets 50% on interview level APPS and GPT3.5 gets 20%, if you could explain up to 40% that would be impressive. Concern: Mech interp will have a hard time getting to 40% on APPS for basically the same reason it has a hard time getting to >98% on webtext. Actually, a lot of stuff is involved.
  • Approach: Maybe we can rule out problematic behavior without that much understanding. E.g., with CoT we can currently rule out problematic thinking to a considerable degree without that much understanding. In particular, maybe we can just look at decompositions like dictionary learning and then rule out problematic behavior without really understanding most of the performance of these components. Concern: There isn't really a particular reason why this would be true as opposed to having lots of interesting stuff in the residual. I'm excited for work which tries to directly test "we can rule stuff out".
  • Approach: For non-ambitious applications, we just need something way less strong. Concern: This seems possible, but I'm not sure why a bottom up or reverse engineering approach would be best as opposed to high level interp.
  • Approach: Look, we'll just end up testing mech interp in practice on various tasks and maybe it will be useful. Concern: I'm excited for testing mech interp in cases where tests are viable. But, I'm worried if we don't do this soon and don't have any local progress metrics to see if mech interp is actually on a good path. (Note that I also think it's hard to test if mech interp can audit for deception without have some more principled story that we can test in parts. This is because I expect various difficulties with testing using model organisms.)
  • Approach: Ok, so we explain low % now, but this will naturally translate into higher % on bigger models. Like lower on the scaling curve it's harder to get high % explained and lower % explained corresponds to a higher fraction of model compute (due to diminishing returns). Concern: I'm skeptical things will be easier with bigger models instead of harder. But, it seems good to see where we're at with small models more precisely. Obviously I'd be super impressed if we could explain arbitrary components of GPT2-medium enough to perform as well as GPT2-sm if we applied our approach to all components.
Neel Nanda

This is because I expect various difficulties with testing using model organisms

Why?

Buck

Re "Why?": The question is how you construct models which have similar structure to the scheming models that might arise naturally:

  • If you restrict the scheming-model-construction-team (e.g., only allowing them to insert scheming into models by doing supervised learning), their constructed models might be much easier to use interp on than real deceptive models are (and also if the models they try to turn into scheming model organisms were already scheming, then the tests might fail for other reasons).
  • If you don't restict the scheming-model-construction-team in any way, this team is allowed to optimize against your interp methods in a way that I suspect breaks your interp tools (even if those interp tools would have worked to catch a real scheming model).

And so I don't see how to do tests of interp tools that use model organisms that aren't either insufficiently or overly conservative.

habryka

Hopes (as I see them) for mech interp being useful without explaining this much:

I don't know, maybe this is naive, but I've been thinking about the best application of near-term mech-interp is to help identify cases where there is really obvious scheming behavior going on, but it doesn't make it into the chain-of-thought. More using mech-interp together with model organism stuff.

Like, you won't get full confidence on this, but if you have a model that you ask some innocuous question that could be related to the model trying to break out, and then you see all of the murder and deception-associated neurons firing, then IMO this clearly tells you something. And yeah, this won't catch most forms of deception since you have bad coverage, but being able to create artificial environments where this is happening seems like it does something to give you traction, and also serves as a helpful warning sign.

Buck

I think that interp seems like a less promising way to uncover that evidence than non-interp-based methods--it's harder and simpler approaches (like asking the model questions) seem like they're reasonably likely to work

ryan_greenblatt

I think for this case, it would be more effective to just train a probe at an intermediate layer to classify for murder and deception and then see if this activates during suspicious times. We didn't need to assume the neuron basis (or autoencoder basis) and this didn't clearly get us anything IMO.

If you actually did look at the connections between various (SAE) components and did intervention experiments, then you could improve beyond probe, but you have to actually do this and now we're really getting into pretty different story.

Neel Nanda

+1 that this seems the obvious thing to try first, though I would be pretty curious to see if you can find an SAE feature for this, and I would trust that somewhat more than a probe. Seems harder and less reliable though, and I expect deception to in fact split into many smaller and subtler features, in a complex and weird way that's somewhat continuous with social modelling as a whole

ryan_greenblatt

To be clear, we should totally do something like what you suggested and it's pretty low dignity to not. (I just think that training a probe is currently the best technique here.)

habryka

(I have objections, but seems better to stay focused on the main thread)

ryan_greenblatt

I think we should consider talking about "what tasks would get much closer to mech interp applications". And then thinking about concrete project proposals.

Neel Nanda

I'm down for this

Immediately-relevant mechanistic interpretability projects

Neel Nanda

Tasks I'm curious about:

  • Hallucinations
  • Why did the model refuse that request?
  • Why did the model get jailbroken by this?
  • Was this chain of thought trace faithful?
  • Why does BingChat gaslight/manipulate users?
  • More speculative:
    • Do models plan?
    • Do models have goals?
Buck

These are so much harder than the kinds of things that interpretability has previously been applied to

Neel Nanda

Would it be a significant update for you if any of these were successful? I tentatively feel like current techniques (esp activation patching between various counterfactuals) are good enough to give some insight on some of these in narrow cases. Solving any of these in general seems pretty ambitious.

The first 3 are all things I'm trying to get MATS scholars to take on, so it'd be good to know if this is too wildly ambitious and not in a way that'd fail gracefully!

Buck

Yes, I'd be shocked and very impressed if any of those succeeded by my definition. (Note that by my definition no interp has ever succeeded on a model trained on a real task, afaik.)

Neel Nanda

What about an explanation explaining 20-60%? And what if that mediocre explanation then had some downstream relevance (eg motivating a novel jailbreak)

ryan_greenblatt

I'm always excited about downstream relevance, but I think it's important to actually do serious comparison with other methods in that case. And, we shouldn't treat the mediocre explanation as much evidence (the juice came from using the explantation for something).

ryan_greenblatt

Some (very ambitious) tasks I'm maybe excited about:

  • (hard to setup) Can we rule out algorithms based on doing something like SAE + looking at connections? I'm not sure what the exact emprical test bed for this looks like and this both probably requires SAE on big models (which is maybe fucked) and also some cases where we construct models which do or don't do a given algo which is non-obvious from simple black box tests.
  • GPT4 can do a suprising number of search like problems related to mathematics, including in fuzzy cases. (I can't find examples right now, but I think this does happen.) Can we explain what's going on in a way which allows us to make substantial predictions about what GPT4 can and can't do here? (Not clear mech interp is the right tool here, I'm interested in the no holds barred version)
  • Can we use mech interp to find adversarial examples in a case where we've trained a model to have some kinda specific behavior in rare circumstances based on a somewhat tricky condition. Like imagine we train a model to have a sad tone if the text contains incorrect code. Can we figure this out and construct example?
  • Just measuring the training compute explained for our best explanations for random SAE components. I can get into details here if of interest. I'm interested in the version where our explanations are very high effort, but have to be actual english (but can reference prior components or dataset statistics in non-insane ways)

(I have more ideas, but want to get to other stuff.)

Neel Nanda

Can we use mech interp to find adversarial examples in a case where we've trained a model to have some kinda specific behavior in rare circumstances based on a somewhat tricky condition. Like imagine we train a model to have a sad tone if the text contains incorrect code. Can we figure this out and construct example?

This seems a fun problem to me! This would plausibly be tractable if we had good, working SAEs, such that you could look for any surprising connections to the sad tone feature

ryan_greenblatt

Why did the model refuse that request?

Proposed success criteria: non-trivial strong predictions about refusals. For instance, maybe we can construct examples which look totally innocent but which strongly make the model refuse. (Given our disagreements about % explained : ))

Using a methodology where we first iterate on explaning more seems totally fine of course.

I'm thinking about a version of this project on a model which is like 2-7B (e.g. similar to GPT3 level perf, but overtrained and heavily trained on chat and redteamed to refuse etc.)

Personally, I'm pretty pessimistic about a version of this project which goes totally bottom-up on big models. But, I think it seems good to try a no-holds barred version of this. And then approaches like SAE and doing corresponding interventions experiments could maybe work?

I think strong results here would probably be the best interp results by far.

ryan_greenblatt

Was this chain of thought trace faithful?

I'm currently scared that this project will be very confusing to evaluate and maybe also confusing to run.

You could use a dataset of known cases where CoT obviously isn't faithful? (E.g. the model can answer in one forward pass without CoT and it just confabulates an unrelated answer to the actual reason as established with black box experiments.) And maybe try to make predictions about a very different dataset which may or may not have similar properties? Seems hard/confusing though.

Neel Nanda

Fleshing out the refusal project:

  • Take a model like LLaMA 2 7B Chat
  • Take a dataset of tasks where the model may or may not refuse. Ideally with a specific token in the output that determines whether it's a refusal or not, that can be our patching metric. Use activation patching to find a sparse subgraph for this
  • Zoom in on some specific examples (eg changing a key word from bomb to carrot) and patching to see which nodes respond most to this, to get a better feel for the moving parts here.
  • Zoom in on the most important nodes found via activation patching, and train an SAE on their output (on a narrow dataset of chat/red-teaming prompts) to try to find interpretable features in there. See how much of the performance of these nodes can be explained by the output of the SAE, and hope really hard that the resulting features are interpretable.
  • ????
  • Profit 
  • (My more serious position is that I'd hope we get lucky and find some interesting structure to zoom in on, which gives us some insight into the refusal circuit, but I find it hard to forecast this far in advance, I expect this project to involve a fair amount of pivots)
ryan_greenblatt

Do models plan?

It feels like there are a bunch of cases where something sort of like planning or search must be going on in SOTA models (e.g. GPT4). But I think this is way less clear in smaller models. If we could, I'd be excited about work analyzing how GPT4 decides what to say in cases where planning or search seems really useful. Seems hard though.

Neel Nanda

Proposed success criteria: non-trivial strong predictions about refusals. For instance, maybe we can construct examples which look totally innocent but which strongly make the model refuse. (Given our disagreements about % explained : ))
 

I'd be excited about this, and it might be possible if we combine it with SAEs to search for unexpected features that are related to refusal, to construct adversarial examples? We'd likely need to solve some sub-problems to do with training good SAEs first. 

ryan_greenblatt

For a bunch of these projects, it seems pretty unclear if mech interp-y methods are the best tool and we can measure success using downstream metrics (as opposed to internal validity metrics like % explained).

I think mech interp industial policy is reasonable (subsidize work on mech interp even if returns per unit time seem worse), but I'm sad if the researchers don't seriously try other methods and compare. Like could we have learned these facts by just playing with the model? What about doing simple probing experiments or intervention experiments with techniques like LEACE?

habryka

Seems like we're out of time, so we'll have to cut the fleshing out of interpretability projects short, but I think what we have still seems useful. 

Thank you all for participating in this!

ryan_greenblatt

I'm somewhat sad that in this dialogue, I don't feel like we reached a consensus or a crux on a bunch of claims that seem important to me. Hopefully, even if we didn't get all the way there, this dialogue can still be useful in advancing thinking about mech interp.

Thanks to everyone for participating/helping! (In particular, I appreciate habryka's editing and Neel for being willing to engage with Buck and me on this topic.)

It's plausible that we should try to do a follow-up dialogue where we try really hard to stay on topic and drill into one of these topics (if the other parties are willing). That said, I'm pretty unsure if this is worth doing, and my views might depend on the response to this dialogue.

Neel Nanda

Yeah, it's a shame that we didn't really reach any conclusions, but this was interesting! I particularly found the point about "you need 99% loss recovered because that's the difference between gpt-3.5 and 4" to be interesting. Thanks to you and Buck for the dialogue, and Habryka for moderating

New to LessWrong?

New Comment
54 comments, sorted by Click to highlight new comments since: Today at 7:03 PM

This dialogue mostly makes me want to rant about how all y'all are doing mech interp wrong. So, rant time. This-is-a-rant-so-not-necessarily-reflectively-endorsed, etc.

Starting point: Science In A High-Dimensional World. Quoting from that post:

In a high-dimensional world like ours, there are billions of variables which could influence an outcome. The great challenge is to figure out which variables are directly relevant - i.e. which variables mediate the influence of everything else. In practice, this looks like finding mediators and hunting down sources of randomness. Once we have a set of control variables which is sufficient to (approximately) determine the outcome, we can (approximately) rule out the relevance of any other variables in the rest of the universe, given the control variables.

A remarkable empirical finding across many scientific fields, at many different scales and levels of abstraction, is that a small set of control variables usually suffices. Most of the universe is not directly relevant to most outcomes most of the time.

Ultimately, this is a picture of “gears-level science”: look for mediation, hunt down sources of randomness, rule out the influence of all the other variables in the universe.

This applies to interpretability just like any other scientific field. The real gold-standard thing to look for is some relatively-small set of variables which determine some other variables, basically-deterministically. Or, slightly weaker: a relatively-small Markov blanket which screens off some chunk of the system from everything else.

In order for this to be useful, the determinism/screening does need pretty high precision - e.g. Ryan's 99% number sounds like a reasonable day-to-day heuristic, many nines might be needed if there's a lot of bits involved, etc.

On the flip side, this does not necessarily need to look like a complete mechanistic explanation. Ideally, findings of screening are the building blocks from which a complete mechanistic model is built. The key point is that findings of screening provide an intermediate unit of progress, in between "no clue what's going on" and "full mechanistic interpretation". Those intermediate units of progress can be directly valuable in their own right, because they allow us to rule things out: (one way to frame) the whole point of screening is that lots of interactions are ruled out. And they directly steer the search for mechanistic explanations, by ruling out broad classes of models.

That is the sort of approach to mech interp which would be able to provide valuable incremental progress on large models, not just toy models, because it doesn't require understanding everything about a piece before something useful is produced.

(Side note: yet another framing of all this would be in terms of modules/modularity.)

This seems like exactly what mech interp is doing? Circuit finding is all about finding sparse subgraphs. It continues to work with large models, when trying to explain a piece of the behavior of the large model. SAE stands for sparse autoencoder: the whole point is to find the basis in which you get sparsity. I feel like a lot of mech interp has been almost entirely organized around the principle of modularity / sparsity, and the main challenge is that it's hard (you don't get to 99% of loss recovered, even on pieces of behavior, while still being meaningfully sparse).

One piece missing here, insofar as current methods don't get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%. That's a pretty core part of what makes science work, in general. And yeah, that's hard (at least in the sense of being a lot of work; more arguable whether it's hard in a stronger sense than that).

One piece missing here, insofar as current methods don't get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%.

When you do that using existing methods, you lose the sparsity (e.g. for circuit finding you have to include a large fraction of the model to get to 99% loss recovered).

It's of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn't going to go away with better methods). I do expect we can improve; we're very far from the 99% standard. But the way we improve won't be by "drilling into the residual"; that has been tried and is insufficient. EDIT: Possibly by "drill into the residual" you mean "understand why the methods don't work and then improve them" -- if so I agree with that but also think this is what mech interp researchers want to do.

(Why am I still optimistic about interpretability? I'm not convinced that the 99% standard is required for downstream impact -- though I am pretty pessimistic about the "enumerative safety" story of impact, basically for the same reasons as Buck and Ryan afaict.)

It's of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn't going to go away with better methods).

I have the opposite expectation there; I think it's just that current methods are pretty primitive.

I also think current methods are mostly focused on linear interp. But what if it just ain't linear.

SAEs are almost the opposite of the principle John is advocating for here. They deliver sparsity in the sense that the dictionary you get only has a few neurons not be in the zero state at the same time, they do not deliver sparsity in the sense of a low dimensional summary of the relevant information in the layer, or whatever other causal cut you deploy them on. Instead, the dimensionality of the representation gets blown up to be even larger

My understanding was that John wanted to only have a few variables mattering on a given input, which SAEs give you. The causal graph is large in general, but IMO that's just an unavoidable property of models and superposition.

I'm confused by why you don't consider "only a few neurons being non-zero" to be a "low dimensional summary of the relevant information in the layer"

The causal graph is large in general, but IMO that's just an unavoidable property of models and superposition.

This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept. 

Leaving that aside, the vanilla reading of this claim also seems kind of obviously false for many models, otherwise optimising them in inference through e.g. low rank approximation of weight matrices would never work. You are throwing away at least one floating point number worth of description bits there.

I'm confused by why you don't consider "only a few neurons being non-zero" to be a "low dimensional summary of the relevant information in the layer"

A low-dimensional summary of a variable vector  of size  is a fixed set of random variables  that suffice to summarise the state of .  To summarise the state of  using the activations in an SAE dictionary, I have to describe the state of more than  variables. That these variables are sparse may sometimes let me define an encoding scheme for describing them that takes less than  variables, but that just corresponds to undoing the autoencoding and then performing some other compression.

This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept.

I'd be curious to hear more about this - IMO we're talking past each other given that we disagree on this point! Like, in my opinion, the reason low rank approximations work at all is because of superposition.

For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?

For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?
 

I don't have the time and energy to do this properly right now, but here's a few thought experiments to maybe help communicate part of what I mean:

Say you have a transformer model that draws animals.  As in, you type “draw me a giraffe”,  and then it draws you a giraffe. Unknown to you, the way the model algorithm works is that the first thirty layers of the model perform language processing to figure out what you want drawn, and output a summary of fifty scalar variables that the algorithms in the next thirty layers of the model use to draw the animals. And these fifty variables are things like “furriness”, “size”, “length of tail” and so on.

The latter half of the model does then not, in any real sense, think of the concept “giraffe” while it draws the giraffe. It is just executing purely geometric algorithms that use these fifty variables to figure out what shapes to draw. 

If you then point a sparse autoencoder at the residual stream in the latter half of the model, over a data set of people asking the network to draw lots of different animals, far more than fifty or the network width, I’d guess the “sparse features” the SAE finds might be the individual animal types. “Giraffe”, “elephant”, etc. . 

Or, if you make the encoder dictionary larger, more specific sparse features like “fat giraffe” would start showing up. 

And then, some people may conclude that the model was doing a galaxy-brained thing where it was thinking about all of these animals using very little space, compressing a much larger network in which all these animals are variables. This is kind of true in a certain sense if you squint, but pretty misleading. The model at this point in the computation no longer “knows” what a giraffe is. It just “knows” what the settings of furriness, tail length, etc. are right now. If you manually go into the network and set the fifty variables to something that should correspond to a unicorn, the network will draw you a unicorn, even if there were no unicorns in the training data and the first thirty layers in the network don’t know how to set the fifty variables to draw one. So in a sense, this algorithm is more general than a cleverly compressed lookup table of animals would be. And if you want to learn how the geometric algorithms that do the drawing work, what they do with the fifty scalar summary statistics is what you will need to look at.

Just because we can find a transformation that turns an NNs activations into numbers that correlate with what a human observer would regard as separate features of the data, does not mean the model itself is treating these as elementary variables in its own computations in any meaningful sense. 

The only thing the SAE is showing you is that the information present in the model can be written as a sum of some sparsely activating generators of the data. This does not mean that the model is processing the problem in terms of these variables. Indeed, SAE dictionaries are almost custom-selected not to give you variables that a well-generalizing algorithm would use to think about problems with big, complicated state spaces. Good summary variables are highly compositional, not sparse. They can all be active at the same time in any setting, letting you represent the relevant information from a large state space with just a few variables, because they factorise. Temperature and volume are often good summary variables for thinking about thermodynamic systems because the former tells you nothing about the latter and they can co-occur in any combination of values. Variables with strong sparsity conditions on them instead have high mutual information, making them partially redundant, and ripe for compressing away into summary statistics.

If an NN (artificial or otherwise) is, say, processing images coming in from the world, it is dealing with an exponentially large state space. Every pixel can take one of several values. Luckily, the probability distribution of pixels is extremely peaked. The supermajority of pixel settings are TV static that never occurs, and thermal noise that doesn't matter for the NNs task. One way to talk about this highly peaked pixel distribution may be to describe it as a sum of a very large number of sparse generators. The model then reasons about this distribution by compressing the many sparse generators into a small set of pretty non-sparse, highly compositional variables. For example, many images contain one or a few brown branchy structures of a certain kind, which come in myriad variations. The model summarises the presence or absence of any of these many sparse generators with the state of the variable “tree”, which tracks how much the input is “like a tree”.

If the model has a variable “tree” and a variable “size”, the myriad brown, branchy structures in the data might, for example, show up as sparsely encoded vectors in a two-dimensional (“tree”,“size”) manifold. If you point a SAE at that manifold, you may get out sparse activations like “bush” (mid tree, low size) “house” (low tree, high size), “fir” (high tree, high size). If you increase the dictionary size, you might start getting more fine-grained sparse data generators. E.g. “Checkerberry bush” and “Honeyberry bush” might show up as separate, because they have different sizes.

Humans, I expect, work similarly. So the human-like abstractions the model may or may not be thinking in and that we are searching for will not come in the form of sparse generators of layer activations, because human abstractions are the summary variables you would be using to compress these sparse generators. They are the type-of-thing you use to encode a sparse world, not the type-of-thing being encoded. That our SAE is showing us some activations that correlate with information in the input humans regard as meaningful just tells us that the data contains sparse generators humans have conceptual descriptions for, not that the algorithms of the network themselves are encoding the sparse generators using these same human conceptual descriptions. We know it hasn't thrown away the information needed to compute that there was a bush in the image, but we don't know it is thinking in bush. It probably isn't, else bush would not be sparse with respect to the other summary statistics in the layer, and our SAE wouldn't have found it.

 

Reply1153

This is a great, thought-provoking critique of SAEs.

That said, I think SAEs make more sense if we're trying to explain an LLM (or any generative model of messy real-world data) than they do if we're trying to explain the animal-drawing NN.

In the animal-drawing example:

  • There's only one thing the NN does.
  • It's always doing that thing, for every input.
  • The thing is simple enough that, at a particular point in the NN, you can write out all the variables the NN cares about in a fully compositional code and still use fewer coordinates (50) than the dictionary size of any reasonable SAE.

With something like an LLM, we expect the situation to be more like:

  • The NN can do a huge number of "things" or "tasks."  (Equivalently, it can model many different parts of the data manifold with different structures.)
  • For any given input, it's only doing roughly one of these "tasks."
  • If you try to write out a fully compositional code for each task -- akin to the size / furriness / etc. code, but we have a separate one for every task -- and then take the Cartesian product of them all to get a giant compositional code for everything at once, this code would have a vast number of coordinates.  Much larger than the activation vectors we'd be explaining with an SAE, and also much larger than the dictionary of that SAE.
  • The aforementioned code would also be super wasteful, because it uses most of its capacity expressing states where multiple tasks compose in an impossible or nonsensical fashion.  (Like "The height of the animal currently being drawn is X, AND the current Latin sentence is in the subjunctive mood, AND we are partway through a Rust match expression, AND this author of this op-ed is very right-wing.")
  • The NN doesn't have enough coordinates to express this Cartesian product code, but it also doesn't need to do so, because the code is wasteful.  Instead, it expresses things in a way that's less-than-fully-compositional ("superposed") across tasks, no matter how compositional it is within tasks.
  • Even if every task is represented in a maximally compositional way, the per-task coordinates are still sparse, because we're only doing ~1 task at once and there are many tasks.  The compositional nature of the per-task features doesn't prohibit them from being sparse, because tasks are sparse.
  • The reason we're turning to SAEs is that the NN doesn't have enough capacity to write out the giant Cartesian product code, so instead it leverages the fact that tasks are sparse, and "re-uses" the same activation coordinates to express different things in different task-contexts.
    • If this weren't the case, interpretability would be much simpler: we'd just hunt for a transformation that extracts the Cartesian product code from the NN activations, and then we're done.
    • If it existed, this transformation would probably (?) be linear, b/c the information needs to be linearly retrievable within the NN; something in the animal-painter that cares about height needs to be able to look at the height variable, and ideally to do so without wasting a nonlinearity on reconstructing it.
  • Our goal in using the SAE is not to explain everything in a maximally sparse way; it's to factor the problem into (sparse tasks) x (possibly dense within-task codes).
  • Why might that happen in practice?  If we fit an SAE to the NN activations on the full data distribution, covering all the tasks, then there are two competing pressures:
    • On the one hand, the sparsity loss term discourages the SAE from representing any given task in a compositional way, even if the NN does so.  All else being equal, this is indeed bad.
    • On the other hand, the finite dictionary size discourages the SAE from expanding the number of coordinates per task indefinitely, since all the other tasks have to fit somewhere too.
  • In other words, if your animal-drawing case is one the many tasks, and the SAE is choosing whether to represent it as 50 features that all fire together or 1000 one-hot highly-specific-animal features, it may prefer the former because it doesn't have room in its dictionary to give every task 1000 features.
  • This tension only appears when there are multiple tasks.  If you just have one compositionally-represented task and a big dictionary, the SAE does behave pathologically as you describe.
    • But this case is different from the ones that motivate SAEs: there isn't actually any sparsity in the underlying problem at all!
    • Whereas with LLMs, we can be pretty sure (I would think?) that there's extreme sparsity in the underlying problem, due to dimension-counting arguments, intuitions about the number of "tasks" in natural data and their level of overlap, observed behaviors where LLMs represent things that are irrelevant to the vast majority of inputs (like retrieving very obscure facts), etc.

The way I would phrase this concern is "SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations." E.g. since "tree" is a class of things defined by a bunch of correlations present in the underlying image data, it's possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.

I agree this is a valid critique. Here's one empirical test which partially gets at it: what happens when you train an SAE on a NN with random weights? (I.e. you randomize the parameters of your NN, and then train an SAE on its activations on real data in the normal way.) Then to the extent that your SAE has good-looking features, that must be because your SAE was picking up on structure in the underlying data.

My collaborators and I did this experiment. In more detail, we trained SAEs on Pythia-70m's MLPs, then did this again but after randomizing the weights of Pythia-70m. Take a moment to predict the results if you want etc etc.


The SAEs that we trained on a random network looked bad. The most interesting dictionary features we found were features that activated on particular tokens (e.g. features that activated on the "man" feature and no others). Most of the features didn't look like anything at all, activating on a large fraction (>10%) of tokens in our data, with no obvious patterns.(The features for dictionaries trained on the non-random network looked much better.)

We also did a variant of this experiment where use randomized Pythia-70m's parameters except for the embedding layer. In this variant, the most interesting features we found were features which fired on a few closely semantically related tokens (e.g. the tokens "make," "makes," and "making").

Thanks to my collaborators for this experiment: Aaron Mueller and David Bau.


I agree that a reasonable intuition for what SAEs do is: identify "basic clusters" in NN activations (basic in the sense that you allow compositionality, i.e. you don't try to learn clusters whose centroids are the sums of the centroids of previously-learned clusters). And these clusters might exist because:

  1. your NN has learned concepts and these clusters correspond to concepts (what we hope is the reason), or
  2.  because of correlations present in your underlying data (the thing that you seem to be worried about).

Beyond the preliminary empirics I mentioned above, I think there are some theoretical reasons to hope that SAEs will mostly learn the first type of cluster:

  • Most clusters in NN activations on real data might be of the first type
    • This is because the NN has already, during training, noticed various correlations in the data and formed concepts around them (to the extent that these concepts were useful for getting low loss, which they typically will be if your model is trained on next-token prediction (a task which incentivizes you to model all the correlations)).
  • Clusters of the second type might not have any interesting compositional structure, but your SAE gets bonus points for learning clusters which participate in compositional structure.
    • E.g. If there are five clusters with centroids w, x, y, z, and y + z and your SAE can only learn 2 of them, then it would prefer to learn the clusters with centroids y and z (because then it can model the cluster with centroid y + z for free).

In Towards Monosemanticity we also did a version of this experiment, and found that the SAE was much less interpretable when the transformer weights were randomized (https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-automated-randomized).

(The results for correlations from auto-interp are less clear: they find similar correlation coefficients with and without weight randomization. However, they find that this might be due to single token features on the part of the randomized transformer and when you ignore these features (or correct in some other way I'm forgetting?), the SAE on an actual transformer indeed has higher correlation.)

Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we've found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don't have the data on hand).

It's not clear what experiments this mindset suggests. If I squint I basically get things that are already being tried, like intervention experiments to determine the connections between SAE features.

Thanks for having this dialogue -- I'm very happy to see clearer articulation of the Buck/Ryan views on theories of impact for MI work!

The part that I found most useful was Ryan's bullet points for "Hopes (as I see them) for mech interp being useful without explaining 99%". I would guess that most MI researchers don't actually see their theories of impact as relying on explaining ~all of model performance (even though they sometimes get confused/misunderstand the question and say otherwise). So I think the most important cruxes will lie in disagreements about (1) whether Ryan's list is complete, and (2) whether Ryan's concerns about the approaches listed are compelling.

Here's a hope which (I think) isn't on the list. It's somewhat related to the hope that Habryka raised, though a bit different and more specific.

Approach: maybe model internals overtly represent qualities which distinguish desired vs. undesired cognition, but probing is insufficient for some reason (e.g. because we don't have good enough oversight to produce labeled data to train a probe with).

Here's a concrete example (which is also the example I most care about). Our goal is to classify statements as true/false, given access to a model that knows the answer. Suppose our model has distinct features representing "X is true" and "humans believe X." Further suppose that on any labeled dataset we're able to create, these two features are correlated; thus, if we make a labeled dataset of true/false statements and train a probe on it, we can't tell whether the probe will generalize as an "X is true" classifier or a "humans believe X classifier." However, a coarse-grained mechanistic understanding would help here. E.g., one could identify all of the model features which serve as accurate classifiers on our dataset, and only treat statements as true if all of the features label them as true. Or if we need a lower FPR, one might be able to mechanistically distinguish these features, e.g. by noticing that one feature is causally downstream of features that look related to social reasoning and the other feature isn't.

This is formally similar to what the authors of this paper did. In brief, they were working with the Waterbirds dataset, an image classification task with lots of spuriously correlated features which are not disambiguated by the labeled data. Working with a CLIP ViT, the authors used some ad-hoc technique to get a general sense that certain attention heads dealt with concepts like "texture," "color," and "geolocation." Then they ablated the heads which seemed most likely to attend to confounding features; this resulted in a classifier which generalized in the desired way, without requiring a better-quality labeled dataset.

Curious for thoughts about/critiques of this impact story.

Here's a hope which (I think) isn't on the list. It's somewhat related to the hope that Habryka raised, though a bit different and more specific.

Approach: maybe model internals overtly represent qualities which distinguish desired vs. undesired cognition, but probing is insufficient for some reason (e.g. because we don't have good enough oversight to produce labeled data to train a probe with).

I don't think this exact thing is directly mentioned by my list. Thanks for the addition.

Let me try to state something which captures most of that approach to make sure I understand:

Approach: Maybe we can find some decomposition of model internals[1] such that all or most components directly related to some particular aspect of cognition are overtly obvious and there are also a small number of such components. Then, maybe we can analyze, edit, or build a classifier using these components in cases where baseline training techniques (e.g. probing) are insufficient.

Then, it seems like there are two cases where this is useful:

  1. Merely capturing all (or most) of the components like this is directly useful for ablation or building a classifier (e.g. because we want an ALL over the truth notions in the model like in your example or because we want to do precise removal of some behavior or capability). This requires that our list of components can be sufficiently small such using all of the them doesn't bite too hard (e.g. low enough FPR) and that this list of components includes enough of the action that using all of them is sufficiently reliable (e.g. high enough TPR).
  2. Even without detailed understanding and potentially without capturing "all" components, we can further identify components by looking at their connections or doing similar intervention experiments at a high level. Then, we can use our analysis of these components do something useful (e.g. determine which components correspond to humans merely thinking something is true).

This impact story seems overall somewhat resonable to me. It's worth noting that I can't imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial. My main concerns are:

  • Both stories depend on our decomposition resulting in components which are possible recognize and label to quite a high level of specificity despite not building much understanding of the exact behavior. This seems like a strong property and it seems unlikely we'd be able to find an unsupervised decomposition which consistently has this property for the cases we care about. (I don't see why sparsity would have this property to the extent we need, but it seems at least a bit plausible and it's probably better than the default.)
    • More generally, it feels this story is supposing some level of "magic" on the part of our decomposition. If we don't understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff in the residual between our understanding and what's going on. It's possible that in practice, some unsupervised decomposition (e.g. SAE) cleanly breaks things apart into components which are easy to label while simultaneously these labels are quite specific and quite accurate. But why would this be true? (Maybe forthcoming research will demonstrate this, but my inside view thinks this is unlikely.)
  • If we want to look at connections, then imperfect understanding will probably bite pretty hard particularly as the effect size of the connection gets smaller and smaller (either due to path length >1 or just there being many things which are directly connected but have a small effect).

If you thought that current fundamental science in mech interp was close to doing this, I think I'd probably be excited about building test bed(s) where you think this sort of approach could be usefully applied and which aren't trivially solved by other methods. If you don't think the fundamentals of mech interp are close, it would be interesting to understand what you think will change to make this story viable in the future (better decompositions? something else?).


  1. Either a "default" decomposition like neurons/attention heads or "non-default" decomposition like a sparse autoencoder. ↩︎

Let me try to state something which captures most of that approach to make sure I understand:

Everything you wrote describing the hope looks right to me.

It's worth noting that I can't imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial.

To be clear, what does "ambitious" mean here? Does it mean "producing a large degree of understanding?"

If we don't understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff in the residual between our understanding and what's going on.

[...]

If we want to look at connections, then imperfect understanding will probably bite pretty hard particularly as the effect size of the connection gets smaller and smaller (either due to path length >1 or just there being many things which are directly connected but have a small effect).

These seem like important intuitions, but I'm not sure I understand or share them. Suppose I identify a sentiment feature. I agree there's a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don't really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.

Just so with truth: there's probably lots of different subtly different notions of truth, but for the application of "detecting whether my AI believes statement X to be true" I don't care about that. I do care about the difference between "true" and "humans think is true," but that's a big difference that I can understand (even if I can't produce examples), and where I can articulate the sorts of cognition which probably should/shouldn't be involved in it.

What's the specific way you imagine this failing? Some options:

  • None of the features we identify really seem to correspond to something resembling our intuitive notion of "truth" (e.g. because they frequently activate on unrelated concepts).
  • We get a bunch of features that look like truth, but can't really tell what goes into computing them.
  • We get a bunch of features that look like truth and we have some vague sense of how they're computed, but they don't seem differentiated in how "sketchy" these computational graphs look: either they all seem to rely on social reasoning or they all don't seem to.

Maybe a better question would be - why didn't these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.

What's the specific way you imagine this failing? Some options:

My proposed list (which borrows from your list):

  • We find a large number (e.g. 30,000) of features which all sorta look somewhat like truth, though none exactly look like truth. Further analysis doesn't make it clear which of these are "real" or "actually truth". Some features look more like truth and some look a bit less like truth, but broadly there is a smooth fall of in how "truth like" the features look such that there aren't a small set of discrete truth features. No single feature both looks like truth and correlates perfectly with our labeled datasets.
  • We get a bunch of features that look (at least somewhat) like truth and we have some vague sense of how they're computed, but they don't seem differentiated in how "sketchy" these computational graphs look: either they all seem to rely on social reasoning or they all don't seem to.
  • We get a bunch of features that look like truth, but looking at what they connect to doesn't make much sense and just makes us more confused overall. There are many diffuse connections and it's unclear what they do.
  • Everything looks fine and we apply the method, but it turns out there isn't any feature we've identified which corresponds to "actual truth" as this isn't very salient for the model in the regime we are interested in.

These seem like important intuitions, but I'm not sure I understand or share them. Suppose I identify a sentiment feature. I agree there's a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don't really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.

Sure, but then why not just train a probe? If we don't care about much precision what goes wrong with the probe approach?

It's possible to improve on a just a probe trained on the data we can construct of course, but you'll need non-trivial precision to do so.

The key question here is "why does selecting a feature work while just naively training a probe fails".

We have to be getting some additional bits from our selection of the feature.

In more detail, let's suppose we use the following process:

  1. Select all features which individually get loss < X on our training set. Choose X such that if we get that loss on our training set, we're only worried about generalization error rather than errors which show up on the training set (~equivalently, we're well into diminishing returns on loss).
  2. Try to pick among these features or combine these features to produce a better classifer.

Then there are two issues:

  • Maybe there isn't any feature which gets < X loss. (We can relax our requirements, but we wanted to compete with probes!)
  • When we select among these features do we get a non-trivial number of "bits" of improvement? Is that enough bits to achieve what we wanted? I'm somewhat skeptical we can get much if any improvement here. (Of course, note that this doesn't mean the idea has no promise!)

IMO there are two kinda separate (and separable) things going on:

  • Maybe features are a good prior for classifiers for some reason for the decomposition you picked. (I don't really see why this would be true for currently used decompositions.) Why would selecting features based on performing well on our dataset be better than just training a probe?
  • Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probe based classifiers. (Maybe there is some reason why looking at connections will be especially good for classifers based on picking a feature but not training a probe, but if so, why?)

Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probes. (Maybe there is some reason why looking at connections will be especially good for features but not probes, but if so, why?)

"Can this be applied to probes" is a crux for me. It sounds like you're imagining something like:

  • Train a bunch of truthfulness probes regularized to be distinct from each other.
  • Train a bunch of probes for "blacklsited" features which we don't think should be associated to truth (e.g. social reasoning, intent to lie, etc.).
  • (Unsure about this step.) Check which truth directions are causally downstream of blacklisted feature directions (with patching experiments?). Use that to discriminate among the probes.

Is that right?

This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.

"Can this be applied to probes" is a crux for me. It sounds like you're imagining something like:

I was actually imagining a hybrid between probes and features. The actual classifier doesn't need to be part of a complete decomposition, but for the connections we do maybe want the complete decomposition to fully analyze connections including the recursive case.

So:

  • Train a bunch of truthfulness probes regularized to be distinct from each other.
  • Check feature connections for these probes and select accordingly.

I also think there's a pretty straightforward to do this without needing to train a bunch of probes (e.g. train probes to be orthogonal to undesirable stuff or whatever rather than needing to train a bunch).

As you mentioned you probably can do with entirely just learned probes, via a mechanism like the one you said (but this is less clean than decomposition).

You could also apply amnesic probing which has somewhat different properties than looking at the decomposition (amnesic probing is where you remove some dimensions to avoid being able to discriminate certain classes via LEACE as we discussed in the measurement tampering paper).

(TBC, doesn't seem that useful to argue about "what is mech interp", I think the more central question is "how likely is it that all this prior work and ideas related to mech interp are useful". This is a strictly higher bar, but we should apply the same adjustment for work in all other areas etc.)

More generally, it seems good to be careful about thinking through questions like "does using X have a principled reason to be better than applying the 'default' approach (e.g. training a probe)". Good to do this regardless of actually using the default approach so we know where the juice is coming from.

In the case of mech interp style decompositions, I'm pretty skeptical that there is any juice in finding your classifier by doing something like selecting over components rather than training a probe. But, there could theoretically be juice in trying to understand how a probe works by looking at its connections (and the connections of its connections etc).

Sure, but then why not just train a probe? If we don't care about much precision what goes wrong with the probe approach?

Here's a reasonable example where naively training a probe fails. The model lies if any of N features is "true". One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.

Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.

This particular case can also be solved by adversarially attacking the probe though.

Maybe a better question would be - why didn't these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.

My guess is that the classification task for waterbirds is sufficiently easy that butchering a substantial part of the model is fine. It won't usually be viable to ablate everything that looks similar to an undesirable property. In some cases, this might be fine due to redundancy, but if there is heavy redundancy, I also expect that you've missed some stuff if you just look for components which look to a given target.

Not super high confidence overall.

Edit: it also seems likely to me that there is a more principled and simpler approach like using LEACE which works just as well or better (but I'm unsure and I'm not familiar with that paper or the literature here).

To be clear, what does "ambitious" mean here? Does it mean "producing a large degree of understanding?"

I mean reducing doom by a large amount for very powerful models.

[+][comment deleted]5mo20
[-]kave3mo80

Curated. I’ve often felt that mech interp seems like a subfield of alignment work that’s good traction and is making progress. And in my experience that’s a fairly common view. If that were true, that would be a pretty big deal. It might be that we could make a big dent with a pretty scalable field of research. So it seems pretty valuable to read thoughtful arguments to the contrary.

I gotta say I have some hesitation in curating this dialogue. The best and most informative topics are kind of diffused over the dialogue, and I feel like I never quite get enough depth or concreteness to really think through the claims. I think my main takeaways are:

(a) presumably a lot of safety-relevant stuff is in the diff between weaker and stronger models, and so you have to think about how you’d tell you’re explaining that diff, and
(b) a question about whether ‘induction heads exist’ or not, and what that means for whether or not mech interp has started making meaningful progress.

It’s not clear how mech interp could provide strong evidence against scheming (deceptive alignment)even if mech interp were much better. 

I think an important question here is "how much of the model is in on the scheming?": what proportion of the circuits in it differ significantly from the most similar non-schemeing model. If the answer was, say, >=1%, then once our mechanical interpretability was good enough to reliably tell a scheming circuit from a non-scheming circuit (say, by identifying that the former had a suspicious conditional dependency on some network of "deceitful mode vs. powerseeking mode" flags that can switch it over to another behavior pattern), then even looking at several hundred randomly selected circuits should normally let us detect the schemingness.

There are a couple of possibilities here:
a) The model was learning the human phenomenon of a scheming liar who skillfully pretends to be X when they are actually Y, without much foreshadowing/hints (usually the author gives the reader hints) — a moderately complicated mechanism, but one that it clearly needs to develop in order to model that aspect of real human behavior, and then it started also using this mechanism extensively itself at other times. In this case we could intentional activate this mechanism using text illustrating this human behavior, locate it, then map out its influence.
b) The model somehow spontaneously either developed a separate copy of this mechanism, separate from the one it uses for modeling this behavior in human, or repurposed a spare copy. In this case there's no clear way to trigger this mechanism to help us locate it, [It's unclear how/why a second such mechanism could arise (or get repurposed) during SGD, since until the mechanism is very reliable it would be unhelpful and get penalized (Singular Learning Theory suggests Occam's razor applies to LLMs), and once the mechanism was very reliable, if it never used its "now show my true intent Y" mode during training, it seems like that half of the behavior pattern would inevitably bitrot/get catastrophically forgotten. So to be maintained, the mechanism must use that other mode sometimes, but figuring out how to trigger it is nontrivial.]

A distinctly labor-intensive approach to dealing with the repurposing variant of b) (it would need to be automated) would be to keep numerous checkpoints during the training run, look at each of them, identify copies of mechanism a) as they arise, and watch for one of them getting repurposed.

Setting up a model too small to be dangerous where the repurposing version of b) was incentivized and then demonstrating that it occurs seems like it would give us a lot of information.

If you have <98% perf explained (on webtext relative to unigram or bigram baseline), then you degrade from GPT4 perf to GPT3.5 perf

 

Two quick thoughts on why this isn't as concerning to me as this dialogue emphasized.

1. If we evaluate SAEs by the quality of their explanations on specific narrow tasks, full distribution performance doesn't matter

2. Plausibly the safety relevant capabilities of GPT (N+1) are a phase change from GPT N, meaning much larger loss increases in GPT (N+1) when attaching SAEs are actually competitive with GPT N (ht Tom for this one)

On (1), I agree, if you could explain 80% of GPT-4 performance on a task and metric where GPT-3.5 performs 1/2 as well as GPT-4 than that would suffice for showing something interesting not in GPT-3.5. For instance, if an explanation was able to human interpretably explain 80% of GPT-4's accuracy on solving APPS programing problems, then that accuracy would be higher than GPT-3.5.

However, I expect that performance on these sorts of tasks is pretty sensitive such that getting 80% of performance is much harder than getting 80% of loss recovered on web text. Most prior results look at explaning loss on webtext or a narrow distribution of webtext, not on trying to preserve downstream performance on some task.

There are some reasons why it could be easier to explain a high fraction of training compute in downstream task performance (e.g. it's a task that humans can do as well as models), but also some annoyances related to only having a smaller amount of data.

I'm skeptical that (2) will qualitatively matter much, but I can see the intuition.

"We reverse engineered every single circuit and can predict exactly what the model will do using our hand-crafted code" seems like it's setting the bar way too high for MI.

Instead, aim for stuff like the AI Lie Detector, which both 1) works and 2) is obviously useful.

To do a Side-Channel Attack on a system, you don't have to explain every detail of the system (or even know physics at all).  You just have to find something in the system that is correlated with the thing you care about.

Notably, the ai lie detector work isn't mech interp under the definition I provided.

But certain details there are still somewhat sketchy, in particular we don't have a detailed understanding of the attention circuit, and replacing the query with "the projection onto the subspace we thought was all that mattered" harmed performance significantly (down to 30-40%).

@Neel Nanda FYI my first thought when reading that was "did you try adding random normal noise along the directions orthogonal to the subspace to match the typical variance along those directions?". Mentioning in case that's a different kind of thing than you'd already thought of.

SAEs

I'm pretty sure this means "Sparse Auto-Encoders", but I'm not quite 99% confident, and this did stop me in my tracks.

Edited to clarify

Hopefully, even if we didn't get all the way there, this dialogue can still be useful in advancing thinking about mech interp.


I hope you guys repeat this dialogue again, as I think these kinds of drilled-down conversations will improve the community's ideas on how to do and teach mechanistic interpretability.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

other model internals techniques

 

What are these? I'm confused about the boundary between mechinterp and others.

By mech interp I mean "A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding."

For examples of non-mech interp model internals, see here, here, and here. (Though all of these methods are quite simple.)

I think like 99% reliability is about the right threshold for large models based on my napkin math.

 

Serious question. We have 100% of the information, why can't we get 100%

Suggestion: Why not test if mechanistic interp can detect lies, for out of distribution data 99% of the time? (It should also generalise to larger models)

It's a useful and well studied benchmark. And while we haven't decided on a test suite, [there is some useful code](https://github.com/EleutherAI/elk).

This is refering to 99% in the context of "amount of loss that you explain in a human interpretable way for some component in the model" (a notion of faithfulness). For downstream tasks, either much higher or much lower reliability could be the right target (depending on the exact task).

Other approaches of alignment are just as deserving to be skeptical of as mechanistic interpretability if faced with as much scrutiny.

My greatest hopes for mechanistic interpretability do not seem represented, so allow me to present my pet direction. 

You invest many resources in mechanistically understanding ONE teacher network, within a teacher-student training paradigm. This is valuable because now instead of presenting a simplistic proxy training signal, you can send an abstract signal with some understanding of the world. Such a signal is harder to "cheat" and "hack".

If we can fully interpret and design that teacher network, then our training signals can incorporate much of our world model and morality. True this requires us to become philosophical and actually consider what such a world model and morality is... but at least in this case we have a technical direction. In such an instance a good deal of the technical aspects of the alignment problem is solved. (at least in aligning AI-to-human not human-to-human).

This argument says all mechanistic interpretability effort could be focused on ONE network. I concede this method requires the teacher to have a decent generalizable world model... At which point, perhaps we are already in the danger zone. 

Could you say more? Why would a teacher network be more capable of training a student network than literal humans? By what mechanism do you expect this teacher network to train other networks in a way that benefits from us understanding its internals?

Teacher-student training paradigms are not too uncommon. Essentially the teacher network is "better" than a human because you can generate far more feedback data and it can react at the same speed as the larger student network. Humans also can be inconsistent, etc. 

What I was discussing is that currently with many systems (especially RL systems) we provide a simple feedback signal that is machine interpretable. For example, the "eggs" should be at coordinates x, y. But in reality, we don't want the eggs at coordinates x, y we just want to make an omelet. 

So, if we had a sufficiently complex teacher network it could understand what we want in human terms, and it could provide all the training signal we need to teach other student networks. In this situation, we may be able to get away with only ever fully mechanistically understanding the teacher network. If we know it is aligned, it can keep up and provide a sufficiently complex feedback signal to train any future students and make them aligned. 

If this teacher network has a model of reality that models our morality and the complexity of the world then we don't fall into the trap of having AI doing stupid things like killing everyone in the world to cure cancer. The teacher network's feedback is sufficiently complex that it would never allow such actions to provide value in simulations, etc. 

ryan_greenblatt – By mech interp I mean "A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding."

That makes sense to me, and I think it is essential that we identify those low-level components. But I’ve got problems with the “working upward” part. 

The low-level components of a gothic cathedral, for example, consist of things like stone blocks, wooden beams, metal hinges and clasps and so forth, pieces of colored glass for the windows, tiles for the roof, and so forth. How do you work upward from a pile of that stuff, even if neatly organized and thoroughly catalogues, how do you get from there to the overall design of the overall cathedral. How, for example, can you look at that and conclude, “this thing’s going to have flying buttresses to support the roof?”

Somewhere in “How the Mind Works” Steven Pinker makes the same point in explaining reverse engineering. Imagine you’re in an antique shop, he suggests, and you come across odd little metal contraption. It doesn’t make any sense at all. The shop keeper sees your bewilderment and offers, “That’s an olive pitter.” Now that contraption makes sense. You know what it’s supposed to do.

How are you going to make sense of those things you find under the hood unless you have some idea of what they’re supposed to do?