Strong disagree. This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs (e.g. imagine trying to describe the useful applications of electricity before anyone knew what it was or how it worked; or imagine Galileo trying to justify the practical use of studying astronomy).
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works; this by itself is sufficient to recommend it. (Algon makes a similar point in another comment.) Though I do agree that, based on the numbers you gave for how many junior researchers' projects are focusing on interpretability, people are probably overweighting it.
I think this post is an example of a fairly common phenomenon where alignment people are too focused on backchaining from desired end states, and not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better. (By contrast, most ML researchers are too focused on the latter.)
...Perhaps the main problem
This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs.
What type of reasoning do you think would be most appropriate?
This proves too much. The only way to determine whether a research direction is promising or not is through object-level arguments. I don't see how we can proceed without scrutinizing the agendas and listing the main difficulties.
this by itself is sufficient to recommend it.
I don't think it's that simple. We have to weigh the good against the bad, and I'd like to see some object-level explanations for why the bad doesn't outweigh the good, and why the problem is sufficiently tractable.
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works;
Maybe. I would still argue that other research avenues are neglected in the community.
not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better
I provided plenty of technical research direction in the "preventi...
This post would have been far more productive if it had focused on exploring them.
So the sections "Counteracting deception with only interp is not the only approach" and "Preventive measures against deception", "Cognitive Emulations" and "Technical Agendas with better ToI" don't feel productive? It seems to me that it's already a good list of neglected research agendas. So I don't understand.
if you hadn't listed it as "Perhaps the main problem I have with interp"
In the above comment, I only agree with "we shouldn't do useful work, because then it will encourage other people to do bad things", and I don't agree with your critique of "Perhaps the main problem I have with interp..." which I think is not justified enough.
So the sections "Counteracting deception with only interp is not the only approach" and "Preventive measures against deception", "Cognitive Emulations" and "Technical Agendas with better ToI" don't feel productive? It seems to me that it's already a good list of neglected research agendas. So I don't understand.
You've listed them, but you haven't really argued that they're valuable, you're mostly just asserting stuff like Rob Miles having a bigger impact than most interpretability researchers, or the best strategy being copying Dan Hendrycks. But since I disagree with the assertions, these sections aren't very useful; they don't actually zoom in on the positive case for these research directions.
(The main positive case I'm seeing seems to be "anything which helps with coordination is really valuable". And sure, coordination is great. But most coordination-related research is shallow: it helps us do things now, but doesn't help us figure out how to do things better in the long term. So I think you're overstating the case for it in general.)
I agree that I haven't argued the positive case for more governance/coordination work (and that's why I hope to do a next post on that).
We do need alignment work, but I think the current allocation is too focused on alignment, whereas AI X-Risks could arrive in the near future. I'll be happy to reinvest in alignment work once we're sure we can avoid X-Risks from misuses and grossly negligent accidents.
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works
If our goal is developing a principled understanding of deep learning, directly trying to do that is likely to be more effective than doing interpretability in the hope that we will develop a principled understanding as a side effect. For this reason I think most alignment researchers have too little awareness of various attempts in academia to develop "grand theories" of deep learning such as the neural tangent kernel. I think the ideal use for interpretability in this quest is as a way of investigating how the existing theories break down - e.g. if we can explain 80% of a given model's behavior with the NTK, what are the causes of the remaining 20%? I think of interpretability as basically collecting many interesting data points; this type of collection is essential, but it can be much more effective when it's guided by a provisional theory which tells you what points are expected and what are interesting anomalies which call for a revision of the theory, which in turn guides further exploration, etc.
I agree that work like NTK is worth thinking about. But I disagree that it's a more "direct" approach to a principled understanding of deep learning. To find a "grand theory" of deep learning, we're going to need to connect our understanding of neural networks to our understanding of the real world, and I don't think NTKs or other related things can help very much with that step - for roughly the same reasons that statistical learning theory wasn't very helpful (and was in fact anti-helpful) in predicting the success of deep neural networks.
Btw, this isn't a general-purpose critique of theoretical work - e.g. it doesn't apply to this paper by Lin, Tegmark and Rolnick, which actually ties neural network success to properties of the real world like symmetry, locality, and compositionality. This is the sort of thing which I can much more easily imagine leading to alignment breakthroughs.
I think of interpretability as basically collecting many interesting data points
I'd agree if interpretability were just about "here's a circuit for recognizing X" (although even then, the concept of circuits itself was nontrivial to develop), but in fact a lot of the most promising work has been on more important and fundamental phenomena like superposition and induction heads.
we're going to need to connect our understanding of neural networks to our understanding of the real world
The NTK and related theories aim to go from "SGD finds a giant blob of parameters that performs well on the data for some reason" to "SGD finds a solution with such-and-such clean mathematical characterization". To fully explain the success of deep learning you do then have to relate the clean mathematical characterization to the real world, but I think this can be done separately to some extent and is less of a bottleneck on progress. My #2 use case for interpretability would be doing stuff like this - basically conceptual/experimental investigation of the types of solutions favored by a given mathematical theory, with the goal of obtaining a high-level story about "why it works in the real world". Plus attempts to carry out alignment/interpretability/ELK tasks in the simplified setting.
This is the sort of thing which I can much more easily imagine leading to alignment breakthroughs
Hmm, it's been a while since I looked at this paper but if I recall it doesn't really try to make any specific predictions about the inductive bias of neural nets in practice, it's more like a series of suggestive analogies. That's fine, but I think that sort of thing is more likely to be productive if guided by a more detailed theory.
I can't speak for Richard, but I think I have a similar issue with NTK and adjacent theory as it currently stands (beyond the usual issues). I'm significantly more confident in a theory of deep learning if it cleanly and consistently explains (or better yet, predicts) unexpected empirical phenomena. The one that sticks out most prominently in my mind, that we see constantly in interpretability, is this strange correspondence between the algorithmic "structure" we find in trained models (both ML and biological!) and "structure" in the data generating process.
That training on Othello move sequences gets you an algorithmic model of the game itself is surprising from most current theoretical perspectives! So in that sense I might be suspicious of a theory of deep learning that fails to "connect our understanding of neural networks to our understanding of the real world". This is the single most striking thing to come out of interpretability, in my opinion, and I'm worried about a "deep learning theory of everything" if it doesn't address this head on.
That said, NTK doesn't promise to be a theory of everything, so I don't mean to hold it to an unreasonable standard. It does what it says...
I intended my comment to apply to "theories of deep learning" in general, the NTK was only meant as an example. I agree that the NTK has problems such that it can at best be a 'provisional' grand theory. The big question is how to think about feature learning. At this point, though, there are a lot of contenders for "feature learning theories" - the Maximal Update Parameterization, Depth Corrections to the NTK, Perturbation Theory, Singular Learning Theory, Stochastic Collapse, SGD-Induced Sparsity....
So although I don't think the NTK can be a final answer, I do like the idea of studying it in more depth - it provides a feature-learning-free baseline against which we can compare actual neural networks and other potential 'grand theories'. Exactly which phenomena can we not explain with the NTK, and which theory best predicts them?
I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.
I wish the original post had been more careful about its claims, so that I could respond to them more clearly. Instead there's a mishmash of sensible arguments, totally unjustified assertions, and weird strawmen (like "I don't understand how “Looking at random bits of the model and identify circuits/features” will help with deception"). And in general a lot of this is of the form "I don't see how X", which is the format I'm objecting to, because of course you won't see how X until someone invents a technique to X.
This is exacerbated by the meta-level problem that people have very different standards for what's useful (e.g. to Eliezer, none of this is useful), and also standards for what types of evidence and argument they accept (e.g. to many ML researchers, approximately all arguments about long-term theories of impact are too speculative to be worth engaging in depth).
I still think that so many people are working on interpretability mainly because they don't see alternatives that are as promising; in general I'd welcome writing that clearly lays out solid explanations and intuitions about why those other research directions are worth working on, and think that this would be the best way to recalibrate the field.
EDIT: Nuance of course being impossible, this no doubt comes off as rude - and is in turn a reaction to an internet-distorted version of what you actually wrote. Oh well, grain of salt and all that.
The way you get safety by design is understanding what's going on inside the neural networks.
This is equivocation. There are some properties of what's going on inside a NN that are crucial to reasoning about its safety properties, and many, many more that are irrelevant.
I'm actually strongly reminded of a recent comment about LK-99, where someone remarked that a good way to ramp up production of superconductors would be to understand how superconductors work, because then we could design one that's easier to mass-produce.
Except:
It's literally point -2 in List of Lethalities that we don't need "perfect" alignment solution, we just don't have any.
After spending a while thinking about interpretability, my current stance is:
Note that this is just for "mechanistic interpretability". I think that high level top down interpretability (both black box and white box) has a clearer story for usefulness which doesn't require very ambitious success.
For mechanistic interpretabilty, very ambitious success looks something like:
The main reason why I think mechanistic interpretability is very far from ambitious success is that current numbers are extremely bad and what people explain is extremely cherry picked. Like people's explanations typically result in performance which is worse than that of much, much tinier models even though heavy cherry picking is applied.
If people were getting ok perf on randomly selected "parts" of models (for any notion of decomposition), then we'd be much closer. I'd think we were be much closer even if this was extremely labor intensive.
(E.g., the curve detectors work explained ~50% of the loss which is probably well less than 10% of the bits given sharply diminishing returns to scale on typical scaling laws.)
I think I would particularly critique DeepMind and OpenAI's interpretability works, as I don't see how this reduces risks more than other works that they could be doing, and I'd appreciate a written plan of what they expect to achieve.
I can't speak on behalf of Google DeepMind or even just the interpretability team (individual researchers have pretty different views), but I personally think of our interpretability work as primarily a bet on creating new affordances upon which new alignment techniques can be built, or existing alignment techniques can be enhanced. For example:
To give props to your last paragraphs, you are right about my concern that most alignment work is less important than governance work. Most of the funding in AI safety goes to alignment, AI governance is comparatively neglected, and I'm not sure that's the best allocation of resources. I decided to write this post specifically on interpretability as a comparatively narrow target to train my writing.
I hope to work on a more constructive post, detailing constructive strategic considerations and suggesting areas of work and theories of impact that I think are most productive for reducing X-risks. I hope that such a post would be the ideal place for more constructive conversations, although I doubt that I am the best suited person to write it.
I think that this is a well-done post overall, though I mostly disagree with it. A couple of thoughts below.
First, I was surprised not to see unknown unknowns addressed, as Richard pointed out.
Second, another theory of impact that I didn't see addressed here is the case that I've been trying to make recently that interpretability is likely to be necessary to build good safety evaluations. This could be quite important if evaluations end up being the primary AI governance tool, as currently looks somewhat likely to me.
Third, though you quote me talking about why I think detecting/disincentivizing deception with interpretability tools is so hard, what is not quoted is what I think about the various non-interpretability methods of doing so—and what I think there is that they're even harder. Though you mention a bunch of non-interpretability ways of studying deception (which I'm definitely all for), studying it doesn't imply that we can disincentivize it (and I think we're going to need both). You mention chain-of-thought oversight as a possible solution, but I'm quite skeptical of that working, simply because the model need not write out its deception in the scratchpad in any legible ...
My personal theory of impact for doing nonzero amounts of interpretability is that I think understanding how models think will be extremely useful for conceptual research. For instance, I think one very important data point for thinking about deceptive alignment is that current models are probably not deceptively aligned. Many people have differing explanations for which property of the current setup causes this (and therefore which things we want to keep around / whether to expect phase transitions / etc), which often imply very different alignment plans. I think just getting a sense of what even these models are implementing internally could help a lot with deconfusion here. I don't think it's strictly necessary to do interpretability as opposed to targeted experiments where we observe external behaviour for these kinds of things, but probably experiments that get many bits are much better than targeted experiments for deconfusion, because oftentimes the hypotheses are all wrong in subtle ways. Aside from that, I am not optimistic about fully understanding the model, training against interpretability, microscope AI, or finding the "deception neuron" as a way to audit deception. I don't think future models will necessarily have internal structures analogous to current models.
(context: I ran the most recent iteration of ARENA, and after this I joined Neel Nanda's mech interp stream in SERI MATS)
Registering a strong pushback to the comment on ARENA. The primary purpose of capstone projects isn't to turn people into AI safety technical researchers or to produce impressive capstones, it's to give people engineering skills & experience working on group projects. The initial idea was not to even push for things that were safety-specific (much like Redwood's recommendations - all of the suggested MLAB2 capstones were either mech interp or non-safety, iirc). The reason many people gravitated towards mech interp is that they spent a lot of time around researchers and people who were doing interesting work in mech interp, and it seemed like a good fit for both getting a feel for AI safety technical research and for general skilling up in engineering.
Additionally, I want to mention that participant responses to the question "how have your views on AI safety changed?" included both positive and negative updates on mech interp, but much more uniformly showed positive updates on AI safety technical research as a whole. Evidence like this updates me away from the...
The biggest thing that worries me about the idea of interpretability, which you mention, is that any sufficiently low-level interpretation of a giant, intractably complex AGI-level model would likely be also intractably complex. And any interpretation of that. And so on so forth, until you start getting the feel that you'll probably need AI to interpret the interpretation, and then AI to interpret the interpreter, and so on in a chain which you might try to carefully validate but that increasingly feels like a typical Godzilla Strategy. This does not lead to rising property values in Tokyo.
That said, maybe it can be done, and even be reliable enough. But it would also enhance significantly our ability to distil models. Like, if you could take a NN-based model, interpret it, and map it to a GOFAI-style extremely interpretable system, now you probably have a much faster, leaner and cleaner version of the same AI - so you can probably just build an even bigger AI. And the question then becomes if this style of interpretability can ever catch up to the increase in capabilities it would automatically foster.
In my opinion, much of the value of interpretability is not related to AI alignment but to AI capabilities evaluations instead.
For example, the Othello paper shows that a transformer trained on the next-word prediction of Othello moves learns a world model of the board rather than just statistics of the training text. This knowledge is useful because it suggests that transformer language models are more capable than they might initially seem.
Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool.
This dramatically undersells the potential impact of Olsson et al. You can't dismiss modus ponens as "just regex". That's the heart of logic!
For many the argument for AI safety being a urgent concern involves a belief that current systems are, in some rough sense, reasoning, and that this capability will increase with scale, leading to beyond human-level intelligence within a timespan of decades. Many smart outsiders remain sceptical, because they are not convinced that anything like reasoning is taking place.
I view Olsson et al as nontrivial evidence for the emergence of internal computations resembling reasoning, with increasing scale. That's profound. If that case is made stronger over time by interpretability (as I expect it to be) the scientific, philosophical and societal impact will be immense.
Very good post! I agree with most of what you have written, but I'm not sure about the conclusions. Two main reasons:
I'm not sure if mech interp should be compared to astronomy, I'd say it is more like mechanical engineering. We have JWST because long long time ago there were watchmakers, gunsmiths, opticans etc who didn't care at all about astronomy, yet their advances in unrelated fields made astronomy possible. I think something similar might happen with mech interp - we'll keep creating better and better tools to achieve some goals, these goals will
Fully agree with the post. Depending solely on interpretability work and downloading activations without understanding how to interpret the numbers is a big waste of time. Met smart people stuck in aimless exploration; bad in the long run. Wasting time slowly is not immediately painful, but it really hurts when projects fail due to poor direction.
I roughly agree with the case made here because I expect interpretability research to be much, much harder than others seem to appreciate. This is a consequence of strong intuitions from working on circuit complexity. Figuring out the behavior of a general circuit sounds like it's in a very hard complexity class - even writing down the truth table for a circuit takes exponential time in the number of inputs! I would be surprised if coming up with a human interpretable explanation of sub circuits is easy; there are some reasons to believe that SGD wil...
see the current plan here EAG 2023 Bay Area The current alignment plan, and how we might improve it
Link to talk above doesn't seem to work for me.
Outside view: The proportion of junior researchers doing interp rather than other technical work is too high
Quite tangential[1] to your post but if true, I'm curious about what this suggests about the dynamics of field-building in AI safety.
Seems to me like certain organisations and individuals have an outsized influence in funneling new entrants into specific areas, and because the field is small (and ...
I finally got around to reading this today, because I have been thinking about doing more interpretability work, so I wanted to give this piece a chance to talk me out of it.
It mostly didn't.
I broadly agree, but I think there's more safety research along with "Retarget the search" that focuses on using a trained AI's own internals to understand things like deception, planning, preferences, etc, that you didn't mention. You did say this sort of thing isn't a central example of "interpretability," which I agree with, but some more typical sorts of interpretability can be clear instrumental goals for this.
E.g. suppose you want to use an AI's model of human preferences for some reason. To operationalize this, given a description of a situation, yo...
I thought the section on interpretability as a tool to predict future systems was poor. The posts arguments against that theory of impact are: reading current papers is a better predictor of future capabilities than current interpretability work & examples of interpretability being applied after phenomenon are discovered. But no one is saying current interpretability tech & insights will let you predict the future! As you point out, we barely even understand what a feature is!
Which could change. If we advance enough to reverse engineer GPT-4, and f...
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
A feature is still a fuzzy concept,
"Gene", "species", and even "concept" are also fuzzy concepts but despite that, we managed to substantially improve our understanding of the-things-in-the-world-they-point-to and the phenomena they interact with. Using these fuzzy concepts even made us realize how fuzzy they are, what's the nature of their fuzziness, and what other (more natural/appropriate/useful/reality-at-joint-carving) abstractions we may replace them with.[1] In other words, we can use fuzzy concepts as a ladder/provisional scaffold for understa...
Some of your YouTube links are broken because the equals sign got escaped as "%3D". If I were you I'd spend a minute to fix that.
Strong disagree. Can’t say I’ve worked through the entire article in detail but wanted to chime in as one of the many of junior researchers investing energy in interpretability. Noting that you erred on the side of making arguments too strong. I agree with Richard about this being the wrong kind of reasoning for novel scientific research and with Rohin’s idea that we’re creating new affordances. I think generally MI is grounded and much closer to being a natural science that will progress over time and be useful for alignment, synergising with other approa...
interp?
interpolation, interpretation or which word i
Epistemic Status: I believe I am well-versed in this subject. I erred on the side of making claims that were too strong and allowing readers to disagree and start a discussion about precise points rather than trying to edge-case every statement. I also think that using memes is important because safety ideas are boring and anti-memetic. So let’s go!
Many thanks to @scasper, @Sid Black , @Neel Nanda , @Fabien Roger , @Bogdan Ionut Cirstea, @WCargo, @Alexandre Variengien, @Jonathan Claybrough, @Edoardo Pona, @Andrea_Miotti, Diego Dorn, Angélina Gentaz, Clement Dumas, and Enzo Marsot for useful feedback and discussions.
When I started this post, I began by critiquing the article A Long List of Theories of Impact for Interpretability, from Neel Nanda, but I later expanded the scope of my critique. Some ideas which are presented are not supported by anyone, but to explain the difficulties, I still need to 1. explain them and 2. criticize them. It gives an adversarial vibe to this post. I'm sorry about that, and I think that doing research into interpretability, even if it's no longer what I consider a priority, is still commendable.
How to read this document? Most of this document is not technical, except for the section "What does the end story of interpretability look like?" which can be mostly skipped at first. I expect this document to also be useful for people not doing interpretability research. The different sections are mostly independent, and I’ve added a lot of bookmarks to help modularize this post.
If you have very little time, just read (this is also the part where I’m most confident):
Here is the list of claims that I will defend:
(bolded sections are the most important ones)
Note: The purpose of this post is to criticize the Theory of Impact (ToI) of interpretability for deep learning models such as GPT-like models, and not the explainability and interpretability of small models.
The emperor has no clothes?
I gave a talk about the different risk models, followed by an interpretability presentation, then I got a problematic question, "I don't understand, what's the point of doing this?" Hum.
The considerations in the last bullet points are based on feeling and are not real arguments. Furthermore, most mechanistic interpretability isn't even aimed at being useful right now. But in the rest of the post, we'll find out if, in principle, interpretability could be useful. So let's investigate if the Interpretability Emperor has invisible clothes or no clothes at all!
The overall Theory of Impact is quite poor
Neel Nanda has written A Long List of Theories of Impact for Interpretability, which lists 20 diverse Theories of Impact. However, I find myself disagreeing with the majority of these theories. The three big meta-level disagreements are:
Other less important disagreements:
Here are some key theories with which I disagree:
In the appendix, I critique almost all the other Theories of Impact.
Interp is not a good predictor of future systems
Theory of Impact 2: “Better prediction of future systems: Interpretability may enable a better mechanistic understanding of the principles of how ML systems and work, and how they change with scale, analogous to scientific laws. This allows us to better extrapolate from current systems to future systems, in a similar sense to scaling laws. E.g, observing phase changes a la induction heads shows us that models may rapidly gain capabilities during training” from Neel Nanda.
Auditing deception with interp is out of reach
Auditing deception is generally the main motivation for doing interp. So here we are:
Theory of Impact 4: Auditing for deception: Similar to auditing, we may be able detect deception in a model. This is a much lower bar than fully auditing a model, and is plausibly something we could do with just the ability to look at random bits of the model and identify circuits/features - I see this more as a theory of change for 'worlds where interpretability is harder than I hope' from Neel Nanda.
Counteracting deception with only interp is not the only approach:
Inspired by every discussion I’ve had with friends defending interp. “Your argument for astronomy is too general”, so let's deep dive into some object-level arguments in the following section!
What does the end story of interpretability look like? That’s not clear at all.
This section is more technical. Feel free to skip it and go straight to "So far my best ToI for interp: Outreach" , or just read the "Enumerative safety" section, which is very important.
Of course, it seems that interpretability in deep learning is inherently more feasible than neuroscience because we can save all activations and run the model very slowly, by trying causal modifications to understand what is happening, and allows much more control than an fMRI. But it seems to me that this is still not enough - we don't really know what we are aiming for and rely too much on serendipity. Are we aiming for:
Enumerative safety?
Enumerative safety, as Neel Nanda puts it, is the idea that we might be able to enumerate all features in a model and inspect this for features related to dangerous capabilities or intentions. I think this strategy is doomed from the start (from most important to less important):
Reverse engineering?
Reverse engineering is a classic example of interpretability, but I don't see a successful way forward. Would this be:
You can notice that “Enumerative safety” is often hidden behind the “reverse engineering” end story.
From the IOI paper. Understanding this diagram from 'Interpretability in the Wild' by Wang et al. 2022 is not essential for our discussion. Understanding the full circuit and the method used would require a three-hour video. And, this analysis only focuses on a single token and involves numerous simplifications. For instance, while we attempt to explain why the token 'Mary' is preferred over 'John', we do not delve into why the model initially considers either 'Mary' or 'John'. Additionally, this analysis is based solely on GPT2-small.
Indeed, this figure is quite terrifying. from Causal scrubbing: results on induction heads, for a 2 layer model. After refining 4 times the hypothesis, they are able to restore 86% of the loss. But even for this simple task they say “we won’t end up reaching hypotheses that are fully specific or fully human-understandable, causal scrubbing will allow us to validate claims about which components and computations of the model are important.”.
The fact that reverse engineering is already so difficult in the two toy examples above seems concerning to me.
Olah’s interpretability dream?
Or maybe interp is just an exploration driven by curiosity waiting for serendipity?
Overall, I am skeptical about Anthropic's use of the dictionary learning approach to solve the superposition problem. While their negative results are interesting, and they are working on addressing conceptual difficulties around the concept of "feature" (as noted in their May update), I remain unconvinced about the effectiveness of this approach, even after reading their recent July updates, which still do not address my objections about enumerative safety.
One potential solution Olah suggests is automated research: "it does seem quite possible that the types of approaches […] will ultimately be insufficient, and interpretability may need to rely on AI automation". However, I believe that this kind of automation is potentially harmful [section Harmful].
This is still a developing story, and the papers published on Distill are always a great pleasure to read. However, I remain hesitant to bet on this approach.
Retargeting the search?
Or maybe interp could be useful for retargeting the search? This idea suggests that if we find a goal in a system, we can simply change the system's goal and redirect it towards a better goal.
I think this is a promising quest, even if there are still difficulties:
Relaxed adversarial training?
Relaxed adversarial training? The TL;DR is that relaxed adversarial training is the same as adversarial training, but instead of creating adversarial inputs to test the network, we create adversarial latent vectors. This could be useful because creating realistic adversarial inputs is a bottleneck in adversarial training. [More explanations here]
This seems valid but very hard, and there are still significant conceptual difficulties. A concrete approach, Latent Adversarial Training, has been proposed, and seems to be promising but:
The exact procedure described in Latent Adversarial Training hasn't been tested, as far as I know. So we should probably work on it.[8]
Microscope AI?
Maybe Microscope AI i.e. Maybe we could directly use the AI’s world model without having to understand everything. Microscope AI is an AI that would be used not in inference, but would be used just by looking at its internal activations or weights, without deploying it. My definition would be something like: We can run forward passes, but only halfway through the model.
A short case study of Discovering Latent Knowledge technique to extract knowledge from models by probing is included in the appendix.
So far my best ToI for interp: Outreach?
1. Interp for Nerd Sniping/honeypot?
2. Honorable mentions:
Preventive measures against Deception seem much more workable
TL;DR: It would be more desirable to aim for a world where we wouldn't need to delve into the internal components of models. Prevention is better than cure, or at least, it is a neglected strategy.
I don't believe interpretability is the most promising method for monitoring near human-level AGI, Here’s why:
I don't think neural networks will be able to take over in a single forward pass. Models will probably reason in English and will have translucent thoughts (and we could even hope to get transparent thoughts, translucence is the scary place where steganography is possible). In order to devise an effective plan to destroy the world, the first human-level AGI will need to somewhat externalize its reasoning, by chain-of-Thought (CoT), and it seems to me way easier to monitor this chain-of-Thought than probing into the models' internals. We can probably use The Translucent Thoughts Hypotheses (Fabien, the author, gives a 20% chance, but I think it will be more like 60%). I also think that we have to do everything we can to pass regulations and place ourselves in a world where those hypotheses are true.
For example, magic is possible, and the fact that human brains can be manipulated so easily by magic is a great danger. So we should probably try to make this magic unlikely, for example with process-based training by training AI Systems to reason only step-by-step (the process-based training approach is summarized in this AGISF distillation).
Steering the world towards transparency
Thinking ahead about the training setup is very important and possibly dominates interp considerations because if the concepts manipulated and by the models are totally alien, it will be much, much harder to provide oversight. And it is much easier to align those chatbots pretrained on human generated text than aligning AGIs trained from scratch with RL / evolutionary methods, etc.
If this is the case, we should focus on various aspects and important recommendations as detailed by Fabien here. Here are some highlights and comments:
I think the strategy should be: let’s target a world where deception is unlikely. (I'm not saying we should make plans that work conditional on deception being unlikely by default, but we should try to steer AGI/the world towards a place where deception is unlikely). I believe there are multiple ways to think and address this problem, and much more technical research needed here, starting from Conditioning Predictive Models: Risks and Strategies.
Cognitive Emulations - Explainability By Design
If interpretability was really a bottleneck, we could use cognitive emulation, which, in my opinion, allows way better explainability and transparency than interpretability will ever get us.
My understanding of cognitive emulation: Emulating GPT-4 using LLMs like GPT-3 as different submodules that send messages written in plain English to each other before outputting the next token. If the neural network had deceptive thoughts, we could see them in these intermediate messages.
Some caveats are in the section Cognitive Emulation of the appendix.
Interpretability May Be Overall Harmful
(Note that some of the following points are not specific to interp, but I think they apply particularly well to interp.)
False sense of control:
The world is not coordinated enough for public interpretability research:
Thus the list of "theory of impact" for interpretability should not simply be a list of benefits. It's important to explain why these benefits outweigh the possible negative impacts, as well as how this theory can save time and mitigate any new risks that may arise.
The concrete application of the logit lens is not an oversight system for deception, but rather capability works to accelerate inference speed like in this paper. (Note that the paper does not cite logit lens, but relies on a very similar method).
Outside view: The proportion of junior researchers doing interp rather than other technical work is too high
It seems to me that many people start alignment research as follows:
"Not putting all your eggs in one basket" seems more robust considering our uncertainty, and there are more promising ways to reduce x-risk per unit of effort (to come in a future post, mostly through helping/doing governance). I would rather see a more diverse ecosystem of people trying to reduce risks. More on this in section Technical Agendas with better ToI.
If you ask me if interp is also over represented in senior researchers, I'm a bit less confident. Interp also seems to be a significant portion of the pie: this year, while Conjecture and Redwood have partially pivoted, there are new active interp teams in Apollo, DeepMind, OpenAI, and still in Anthropic. I think I would particularly critique DeepMind and OpenAI's interpretability works, as I don't see how this reduces risks more than other works that they could be doing, and I'd appreciate a written plan of what they expect to achieve.
Even if we completely solve interp, we are still in danger
No one has ever claimed otherwise, but it's worth remembering to get the big picture. From stronger arguments to weaker ones:
That is why focusing on coordination is crucial! There is a level of coordination above which we don’t die - there is no such threshold for interpretability. We currently live in a world where coordination is way more valuable than interpretability techniques. So let’s not forget that non-alignment aspects of AI safety are key! AI alignment is only a subset of AI safety! (I’m planning to deep-dive more into this in a following post).
A version of this argument applies to "alignment" in general and not just interp and those considerations will heavily influence my recommendations for technical agendas.
Technical Agendas with better ToI
Interp is not such a bad egg, but opportunity costs can be huge (especially for researchers working in big labs).
I’m not saying we should stop doing technical work. Here's a list of technical projects that I consider promising (though I won't argue much for these alternatives here):
In short, my agenda is "Slow Capabilities through a safety culture", which I believe is robustly beneficial, even though it may be difficult. I want to help humanity understand that we are not yet ready to align AIs. Let's wait a couple of decades, then reconsider.
And if we really have to build AGIs and align AIs, it seems to me that it is more desirable to aim for a world where we don't need to probe into the internals of models. Again, prevention is better than cure.
Conclusion
I have argued against various theories of impact of interpretability, and proposed some alternatives. I believe working back from the different risk scenarios and red-teaming the theories of impact gives us better clarity and a better chance at doing what matters. Again, I hope this document opens discussions, so feel free to respond in parts. There probably should be a non-zero amount of researchers working on interpretability, this isn’t intended as an attack, but hopefully prompts more careful analysis and comparison to other theories of impact.
We already know some broad lessons, and we already have a general idea of which worlds will be more or less dangerous.Some ML researchers in top labs aren't even aware of, or acknowledging, that AGI is dangerous, that connecting models to the internet, encouraging agency, doing RL and maximizing metrics isn't safe in the limit.
Until civilization catches up to these basic lessons, we should avoid playing with fire, and should try to slow down the development of AGIs as much as possible, or at least steer towards worlds where it’s done only by extremely cautious and competent actors.
Perhaps the main problem I have with interp is that it implicitly reinforces the narrative that we must build powerful, dangerous AIs, and then align them. For X-risks, prevention is better than cure. Let’s not build powerful and dangerous AIs. We aspire for them to be safe, by design.
Appendix
Related works
There is a vast academic literature on the virtues and academic critiques of interpretability (see this page for plenty of references), but relatively little holistic reflection on interpretability as a strategy to reduce existential risks.
The most important articles presenting arguments for interpretability:
Against interpretability
The Engineer’s Interpretability Sequence
I originally began my investigation by rereading “The Engineer’s Interpretability Sequence”, in which Stephen Casper raises many good critiques of interpretability research, and this was really illuminating.
Interpretability tools lack widespread use by practitioners in real applications.
Broad critiques. He explains that interp is generally not scaling, relying too much on humans, failing to combine techniques. He also criticizes mech interp, which may not be the best way of doing interp, because of cherry-picking, focusing only on toy examples and lack of scalability, and failing to do useful things.
Methodological problems:
Cognitive Emulations - Explainability By design
If interpretability was really a bottleneck, we could use cognitive emulation, which, in my opinion, allows way better explainability/transparency than interpretability will ever get us.
My understanding of cognitive emulation: Emulating GPT-4 using LLMs like GPT-3 as different submodules that send messages written in plain English to each other before outputting the next token. If the neural network had deceptive thoughts, we could see them in these intermediate messages.
Interpretability is needed only if one module of the cognitive emulation is deceptive. Then maybe you can use interpretability to explain this module. But I'm confident models will need Chain of Thought for the foreseeable future. So interpretability is not 'urgent' research.
If we don’t use Reinforcement learning on this cognitive emulation, and if we only prompt in English the different submodules, the cognitive emulation won’t create whole “new languages” or be able to obfuscate a lot of its thoughts.
Yes, there are problems with this proposal:
But we could further reduce these risks with:
Spicy: However, cognitive emulation will quite likely be an engineering nightmare, facing significant robustness issues that are always present in small models. The alignment tax will be higher than for end-to-end systems, making it unlikely that we will ever use this technology. The bottleneck is probably not interp, but rather an ecosystem of preventive safety measures and a safety culture. Connor Leahy, CEO of Conjecture, explaining the difficulties of the problem during interviews and pushing towards a safety culture, is plausibly more impactful than the entire CoEm technical agenda.
Detailed Counter Answers to Neel’s list
Here is Neel’s Longlist of Theories of Impact for Interpretability with critiques for each theory. Theories proposed by Neel are displayed in italics, whereas my critiques are rendered in standard font.
Case study of some cool interp papers
This section is more technical.
Stephen Casper lists a bunch of impressive interpretability papers, as of February 2023. Let's try to investigate whether these papers could be used in the future to reduce risks. For each article, I mention the corresponding end story, and the critic of this end story applies to the article.
Bau et al. (2018)
Bau et al. (2018): Reverse engineer and repurpose a GAN for controllable image generation.
Ghorbani et al. (2020)
Ghorbani et al. (2020): Identify and successfully ablate neurons responsible for biases and adversarial vulnerabilities.
Burns et al. (2022)
Burns et al. (2022): Identify directions in latent space that were predictive of a language model saying false things.
Casper et al. (2022)
Casper et al. (2022): Identify hundreds of interpretable copy/paste attacks.
Ziegler et al. (2022)
Ziegler et al. (2022): Debug a model well enough to greatly reduce its rate of misclassification in a high-stakes type of setting.
Is feature visualization useful? Some findings suggest no: Red Teaming Deep Neural Networks with Feature Synthesis Tools.
GradCam: Maybe this paper? But this is still academic work.
I have organized two hackathons centered around the topic of spurious correlations. I strongly nudged using interp, but unfortunately, nobody used it...Yes this claim is a bit weak, but still indicates a real phenomenon, see [section Lack of real applications]
Note: I am not making any claims about ex-ante interp (also known as intrinsic interp), which has not been so far able to predict the future system either.
Other weaker difficulties for auditing deception with interp: This is already too risky and Prevention is better than cure. 1) Moloch may still kill us:"auditing a trained model" does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don't deploy your AI system, and someone else destroys the world instead). […] a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well?” [Source Rohin Shah]. 2) We probably won’t be competent enough to fix our mistake: “in order for auditing the model to help (directly), you have to actually be pretty confident in your ability to understand and fix your mistakes if you find one. It's not like getting a coin to land Heads by flipping it again if it lands Tails - different AGI projects are not independent random variables, if you don't get good results the first time you won't get good results the next time unless you understand what happened. This means that auditing trained models isn't really appropriate for the middle of the skill curve.” [Source Charlie Steiner].
From “Conditioning Generative Models. “Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”)?”
For example, what happens If you prompt a deceptive model with a Joyous prompt, and prompt the same deceptive model with a sad prompt and then take the difference, you obtain a Joyous Deceptive model?
But at the same time, we could be pessimistic, because this good idea has been out there in the wild since Christiano described it in 2019. So either this idea does not work and we have not heard about it. Or the community has failed to recognize a pretty simple good idea.
Causal scrubbing could be a good way for evaluating interp techniques using something other than intuition. However, this is only suitable for localization assessment and does not measure how understandable the system is for humans.
“I was previously pretty dubious about interpretability results leading to capabilities advances. I've only really seen two papers which did this for LMs and they came from the same lab in the past few months. It seemed to me like most of the advances in modern ML (other than scale) came from people tinkering with architectures and seeing which modifications increased performance. But in a conversation with Oliver Habryka and others, it was brought up that as AI models are getting larger and more expensive, this tinkering will get more difficult and expensive. This might cause researchers to look for additional places for capabilities insights, and one of the obvious places to find such insights might be interpretability research.” from Peter barnett.
Not quite! Hypotheses 4 (and 2?) are missing. Thanks to Diego Dorn for presenting this fun concept to me.
This excludes the governance hackathon, though, this is only from the technical ones. Source: Esben Kran.