Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

TL;DR: In this post, I want to argue why Interpretability & Transparency tools have a defender’s advantage if they are used correctly, i.e. they improve alignment much more than new capabilities and therefore mitigate risks of dual use. I draw parallels from biosecurity researchers who have thought about the risks of dual-use and defender’s advantages in more detail and I think that the AI safety community can learn a lot from them. Lastly, I want to point out that not all interpretability tools have a clear defender’s advantage and some interpretability research might still carry a lot of risks when used incorrectly. 

I’d like to thank Lee Sharkey and Simon Grimm for their feedback on this post.


Most technology is dual-use in some way--a knife can be used as a household appliance or as a weapon. However, different technologies have different propensities to be used for good or bad, e.g. more research into walls will likely benefit the defender more than the attacker while more research into the capabilities of viruses benefits attackers more than defenders. 

I feel like we, the AI safety community, have not thought enough about which approaches have a clear defender’s advantage or how we could steer existing approaches to have more of a defender's advantage. To my (very limited) understanding, the biosecurity community has thought a bit more about these kinds of dual-use trade-offs. Therefore, we could probably learn some things from them. 

In this post, I want to briefly look at some of the possible lessons from biosecurity and see if we can translate them to AI safety. Then I want to argue why interpretability is one of the approaches that plausibly has a defender’s advantage. 

I’m certainly not the first person to have come to the conclusion that interpretability is important for alignment. Chris Olah has made the case for interpretability for years. Neel Nanda has provided a long theory of impacts of interpretability research. Quintin Pope has made the case for optimism about interpretability. Evan Hubinger has provided 11 proposals to build safe AI that are all essentially something+interpretability, has developed an interpretability tech tree and summarized transformer circuits. ARC is working on ELK (and related topics) that certainly read to me as if they are intended to prevent deceptive alignment. There are many further good posts on aspects of interpretability (see e.g. hereherehere, or here). 

The reason why I add this post to the long list of posts arguing for the importance of interpretability is that I feel like the “defender’s advantage” framework allows for an easy way to decide which kind of interpretability research will help more with alignment than with capabilities and thus alleviates one major concern that some people have against it (personal conversations, not sure if someone wrote this down). 

Lessons from Biosecurity

Most of the following comes from personal discussions with biosecurity researchers or podcasts like “Hear This Idea”s interview with Kevin Esvelt and Jonas Sandbrink. I’m not a biosecurity researcher myself and the following is likely to lack nuance. 

  1. Gain-of-function(Enhancement of potential pandemic pathogens) = bad: More specifically, approaches that require us to build a new capability in order to learn how to safeguard against it makes offensive scaling easier than defensive scaling, e.g. the new capability enables the attacker to do more new things than the defender. Firstly, you have created a deadly virus with certainty but the development of the vaccine is uncertain--you stacked the odds against you. Secondly, even if your vaccine (or other defense) is successful, it’s likely easy to modify the new deadly virus in ways that circumvent this defense. Additionally, even if a bad actor just uses the exact virus you created, it’s not clear that we’d be able to roll out the new vaccine fast enough. In some sense, the strategy is too specifically tailored to the problem you just created yourself and thus the potential damage is larger than the potential gains. 
  2. Broad spectrum vaccines = good: We might be able to design vaccines against entire families of viruses, e.g. all coronaviruses rather than just Covid19 or a specific wave of Covid19. Broad spectrum vaccines have a favorable risk profile, as they don’t require the identification of highly pathogenic viruses, and secondly, guard against a swath of different viruses within the same group of viruses. Therefore, they guarantee broad defence, without requiring detailed knowledge or experiments that could go wrong or be misused. Thus they have a more robust risk profile (though some defences are even more robust, such as PPE, pandemic shelters, or ventilation.)
  3. Preparation & rapid deployment = good: A rapidly spreading pandemic can realistically infect large parts of the world’s population within 100 days. This is likely not enough time to understand the virus and develop, produce and distribute the vaccine. Therefore, being well-prepared, e.g. with broad spectrum vaccines, large vaccine production facilities, large stockpiles of PPE, etc. likely decreases the damage done by the pandemic. However, all of these techniques are unlikely to increase the spread or enable active misuse of the virus, therefore, they create a defender’s advantage. 

In summary, a) some defensive tools do not require novel capabilities, e.g. broad spectrum vaccines, better PPE or better ventilation, and b) the knowledge of defensive insights can sometimes be used intentionally or accidentally to create more powerful offensive tools. Thus, we should keep the offensive-defensive scaling in mind when creating a new tool. 

I’m probably missing a lot of nuance and some important points but even these fairly general ideas can already be translated to AI safety--at least to some extent.

The Defender’s Advantage of Interpretability

For the following section, I will use Interpretability in a very broad sense, i.e. including mechanistic interpretability but also more high-level approaches that aim to understand NNs (sometimes called “Science of DL”).

My reasons to think that interpretability has a defenders advantage include

  1. New interpretability tools improve most alignment research but not most capabilities: If someone develops a tool that makes interpreting the neural network really easy, this would immediately, without further work, improve alignment because we could directly act on new information, e.g. turn off dangerous AIs (if this is still possible). While some of this information could be used to improve the capabilities of the AI, it requires further work to do that (i.e. you still have to develop the capability improvements). Due to interpretability tools, this work might be easier but it is still more costly than the immediate insights for alignment. 
    However, it is important to point out that this advantage differs between interpretability applications. Understanding one specific phenomenon really well might have nearly no defender’s advantage while very general interpretability methods like circuits have a clearer defender’s advantage.
  2. Does not require new capabilities: Interpretability tools can usually be applied to all levels of capabilities (might not hold true for highly capable AIs if they want to hide information), e.g. we can use them on small MLPs, and large LLMs. Other alignment approaches sometimes require a specific level of capabilities to work such as OpenAI’s approach to automate alignment research or the translator head in ELK
  3. Does not require deployment: Interpretability tools can be used during training, e.g. to monitor the emergence of dangerous behavior. We might also be able to run only subparts of the network to understand them which removes the risk of running the entire network or deploying it to detect a specific capability. 
  4. Is fairly general: In theory, we should be able to develop interpretability tools for all kinds of DL systems and learnings from one likely translate to others, e.g. the idea of circuits translated from CNNs to LLMs. 
  5. Preparation & rapid deployment: Interpretability tools don’t have a strong preparation advantage because you need to have a network to interpret it. However, if we are ever able to scale and automate interpretability tools to a level that you can very quickly interpret networks then rapid deployment of interpretability tools might be possible. If the deployment is rapid enough, we could use interpretability tools during training to monitor and react to the formation of potentially dangerous circuits. I’m uncertain though if this will be realistic in the near future. 
  6. Level of necessity: In general, I feel like interpretability is approximately necessary for alignment but not for capabilities (echoing the sentiment of Evan Hubinger). With capabilities, you can try new things and see if they improve your desired metric. Understanding the network better might help you to come up with an idea to improve the metric but it is by no means necessary. With alignment, on the other hand, I don’t really see how we get around “understanding the system in great detail” in the long run. We might be able to align a network with adversarial training but we would still “want to double-check” if it actually learned the right concept. Furthermore, if we think that deceptive alignment is where the big risks come from, interpretability (or related ways to understand the network such as ELK) are the most straightforward way to defend. 


I argue that many forms of interpretability and transparency have a defender’s advantage, i.e. that they are more likely to help with alignment than with developing new capabilities. However, results from interpretability investigations should still be handled with care. Specific use cases and types of interpretability can still carry substantial risk of increasing capabilities without meaningfully increasing alignment. For example, I think there is some chance that Neel Nanda’s mechanistic analysis of grokking will lead to capability improvements in the long run. I still think it was correct to publish these results on balance but one should think about possible harms beforehand (and I expect Neel to have done that). I expect the situations where the defender's advantage doesn’t hold anymore to be “we understood the system well enough to make it better but not really why it got better” similar to how our current understanding of scaling laws allows us to build more capable models but we don’t really understand why. 

To offer a simple solution, there is always the option to share results only with a select group of people rather than publishing them or doing research that is private by default

Furthermore, I want to encourage other AI safety researchers to apply the defender’s advantage framework more generally and pursue research that has a high chance to help with alignment while not increasing capabilities unnecessarily.


New Comment
4 comments, sorted by Click to highlight new comments since: Today at 9:51 AM

For example, I think there is some chance that Neel Nanda’s mechanistic analysis of grokking will lead to capability improvements in the long run.

I'm curious if you have a particular concern in mind here?

My personal take is that this is the kind of interpretability work where I'm least concerned about it leading to capabilities improvements, since it's very specific to toy models and analysing deep learning puzzles, and pretty far from the state of the art frontier.

In a world where it does lead to advancements, my best guess is that it follows a pretty indirect and diffuse trajectory (eg, furthers science of deep learning studies which lead to new insights that let us build better models, or get more people excited about interpretability which leads to more research and some of that advances capabilities), which seems extremely hard to model. I'd guess the alignment benefits of the work are minor to moderate (definitely not the interpretability work I think is most relevant to pushing on reducing x-risk, but likely somewhat useful), and strongly outweigh this kind of concern about diffuse and hard-to-predict effects

Just to be clear, I also think that your grokking work increases alignment much more than capabilities on balance. 

I think the way in which it increases capabilities would roughly look like this: "your insight on grokking is a key to understanding fast generalization better; other people build on this insight and then modify training; this improves the speed of learning and thus capabilities". 

I think your work is clearly net positive, I just wanted to use a concrete example in the post to show that there are trade-offs worth taking. 

Thank you so much, Marius, for writing this pertinent post! The question of whether a given interpretability tool will help us or hurt us in expectation is an extremely important one. 

The answer, however, differs on a situation-to-situation basis. The scientific benefit of an interpretability tool (more generally, of any information channel) is difficult to estimate a priori, but likely is tied to its informational efficiency. Roughly speaking, how much informational value can the interpretability tool/information channel yield per unit of bitrate?

The cost of an interpretability tool include the risk of capabillities gains (if AI capabilities orgs know about it). It also includes the risk that a superintelligent AGI can use the interpretability tool/information channel for its own purposes, such as by hacking or socially engineering the alignment researchers for its own purposes (i.e., probably to escape into the Internet and then power-seek).

This perspective suggests that an interpretability tool/information channel will be disadvantageous for the defender if (1) the informational efficiency of the channel is low, (2) the interpreted AI has high capabilities (or the interpretability tool helps capabilities organizations increase the model's capabilities), and (3) its box setup has weak cybersecurity.

Keeping interpretability research off the Internet (such as via an encrypted or airgapped library that coordinates between AI researchers) is a low-hanging fruit that would make dual-use interpretability tools more likely to be helpful to humanity rather than counterproductive. 

To put it bluntly, once an interpretability tool (the "battle plan") is posted online, we should assume in the worst-case scenario that the battle plan has just been given to the superintelligent AGI. First, the AGI will be misaligned in the worst-case scenario, and this scenario may in fact occur with high probability. Second, the likely misaligned AGI will probably be trained on the whole Internet, and thus will already know how the tool works, what we expect to see from it, and how to manipulate the tool for the AGI's own purposes.

As for the six proposed arguments why "interpretability has a defenders advantage," I am uncertain that arguments 1-5 will hold true for the interpretability tools that will be ready by the time AGI emerges. I agree with argument 6, but in a war against misalignment, I wouldn't unconditionally assume that any battle plan which cannot guarantee a complete victory is a bad plan. In fact, a plan that tries to guarantee a complete victory but makes the outcome worse on expectation is probably a bad plan. In contrast, a plan that cannot guarantee a complete victory but makes the outcome better on expectation is probably a good plan.

Personally, I am most optimistic about interpretability tools of high informational efficiency that have not yet been posted on the Internet. We should try really hard to develop and use high-quality interpretability tools. However, using a high quantity of low-quality (or publicly posted) interpretability tools may actually decrease the odds of human survival.

Given that the audience of this post has signalled mixed responses to your comment, and I'm confused as to why (because your basic argument makes sense to me), and that no one has replied to you, here's an attempt to understand this situation.

The core thesis of Marius' argument, it seems, is the fact that the marginal cost for alignment of an AI model is less than that of increasing SOTA AI model capabilities, given marginal increase in interpretability research. He refers to biorisk research arguments to imply that a similar situation arises in alignment research.

You claim, however, that this isn't true broadly speaking, since what actually matters is the amount of information we get from an interpretability tool per bit of information transferred.

Marius' threat model is alignment research also increasing capabilities and therefore shortening timelines. Your threat model seems to be that of the uninhibited use of interpretability tools resulting in AI researchers (and by extension, the world) being taken control over by a sufficiently capable AI.

If this is the case, then it seems that both of you are talking across each other, and the readers' responses (or the lack thereof) makes sense.