This is the second post in the sequence “Interpretability Research for the Most Important Century”. The first post, which introduces the sequence, defines several terms, and provides a comparison to existing works, can be found here: Introduction to the sequence: Interpretability Research for the Most Important Century.
This post explores the extent to which interpretability is relevant to the hardest, most important parts of the AI alignment problem (property #1 of High-leverage Alignment Research).
First, I give an overview of the four important parts of the alignment problem (following Hubinger): outer alignment, inner alignment, training competitiveness and performance competitiveness (jump to section). Next I discuss which of them is “hardest”, taking the position that it is inner alignment (if you have to pick just one), and also that it’s hard to find alignment proposals which simultaneously address all four parts well.
Then, I move onto exploring how interpretability could impact these four parts of alignment. Our primary vehicle for this exploration involves imagining and analyzing seven best-case scenarios for interpretability research (jump to section). Each of these scenarios represents a possible endgame story for technical alignment, hinging on one or more potential major breakthroughs in interpretability research. The scenarios’ impacts on alignment vary, but usually involve solving inner alignment to some degree, and then indirectly benefiting outer alignment and performance competitiveness; impacts on training competitiveness are more mixed.
Finally, I discuss the likelihood that interpretability research could contribute to unknown solutions to the alignment problem (jump to section). This includes examining interpretability’s potential to lead to breakthroughs in our basic understanding of neural networks and AI, deconfusion research and paths to solving alignment that are difficult to predict or otherwise not captured by the seven specific scenarios analyzed.
Tips for navigating this long post! If you get lost scrolling through this post on mobile, consider reading on desktop for two reasons: 1) To take advantage of LessWrong's convenient linked outline feature that appears in the sidebar, and 2) To be able to glance at the footnotes and posts that I link to just by hovering over them.
Lots of people greatly improved this post by providing insightful discussions, critical points of view, editing suggestions and encouraging words both before and during its writing.Many thanks in particular to Joe Collman, Nick Turner, Eddie Kibicho, Donald Hobson, Logan Riggs Smith, Ryan Murphy, the EleutherAI Interpretability Reading Group, Justis Mills (along with LessWrong's amazing free editing service!) and Andrew McKnight for all their help.
Thanks also to the AGI Safety Fundamentals Curriculum, which is an excellent course I learned a great deal from leading up to writing this post, and for which I started this sequence as my capstone project.
After several days of research and deliberation, I concluded that the most important parts of alignment are well-stated in Hubinger (2020):
Even though Evan Hubinger later proposed training stories as a more general framework, I still find thinking about these four components highly useful for many scenarios, even if they don’t neatly apply to a few proposed alignment techniques. So I’ll consider these to be a good definition of the important parts of AI alignment.
But which one of these four parts is the “hardest”? Well, today there are definitely many proposals which look promising for achieving the two alignment parts (#1 and #2) but seem questionable in one or both of the competitiveness parts (#3 and #4). For example, Microscope AI. Conversely, there are some approaches which seem competitive but not aligned (missing #1 and/or #2). For example, reinforcement learning using a hand-coded specification, and without any interpretability tools to guard against inner misalignment.
However, another thing I observe is that many proposals currently seem to be bottlenecked by #2, inner alignment. For example, in Hubinger (2020), none of the 11 proposals presented could be inner aligned with any modicum of confidence using technology that exists today.
So, I’ll be operating as though the hardest part of alignment is inner alignment. However, we’ll still pay attention to the other three components, because it’s also difficult to find a proposal which excels at all four of the important parts of alignment simultaneously.
Interpretability cannot be a complete alignment solution in isolation, as it must always be paired with another alignment proposal or AI design. I used to think this made interpretability somehow secondary or expendable.
But the more I have read about various alignment approaches, the more I’ve seen that one or another is stuck on a problem that interpretability could solve. It seems likely to me that interpretability is necessary, or at least could be instrumentally very valuable, toward solving alignment.
For example, if you look closely at Hubinger (2020), every single one of the 11 proposals relies on transparency tools in order to become viable.
So even though interpretability cannot be an alignment solution in isolation, as we’ll see its advancement does have the potential to solve alignment. This is because in several different scenarios which we’ll examine below, advanced interpretability has large positive impacts on some of alignment components #1-4 listed above.
Usually this involves interpretability being able to solve all or part of inner alignment for some techniques. Its apparent benefits on outer alignment and performance competitiveness are usually indirect, in the form of addressing inner alignment problems for one or more techniques that conceptually have good outer alignment properties or performance competitiveness, respectively. It’s worth noting that sometimes interpretability methods do put additional strain on training competitiveness.
We’ll examine this all much more closely in the Interpretability Scenarios with Alignment-Solving Potential section below.
This post largely assumes that we need to solve prosaic AI alignment. That is, I assume that transformative AI will come from scaled-up-versions of systems not vastly different from today’s deep learning ML systems. Hence we mostly don’t consider non-prosaic AI designs. I also don’t make any attempt to address the embedded agency problem. (However, Alex Flint’s The ground of optimization, referenced later on, does seem to have bearing on this problem.)
There are important AI governance and strategy problems around coordination, and important misuse risks to consider if aligned advanced AI is actually developed. Neel Nanda’s list of interpretability impact theories also mentions several theories around setting norms or cultural shifts. I touch on some of these briefly in the scenarios below. However, I don’t make any attempt to cover these comprehensively. Primarily, in this sequence, I am exploring a world where technical research can drive us toward AI alignment, with the help of scaled up funding and talent resources as indicated in the Alignment Research Activities Question.
In attacking the Alignment Research Activities Question, Karnofsky (2022) suggests ‘visualizing the “best case”’ for each alignment research track examined—in the case we're examining, that means the best case for interpretability.
I think the nature of interpretability lends itself to multiple “best case” and “very good case” scenarios, perhaps more so than many other alignment research directions.
I tried to think of ambitious milestones for interpretability research that could produce game-changing outcomes for alignment. This is not an exhaustive list. Further investigation: Additional scenarios worth exploring discusses a few more potentially important scenarios, and even more may come to light as others read and respond to this post, and as we continue to learn more about AI and alignment. There are also a few scenarios I considered but decided to exclude from this section because I didn't find that any potential endgames for alignment followed directly from them (see Appendix 2: Other scenarios considered but lacked clear alignment-solving potential).
Some of these scenarios below may also be further developed as an answer to one of the other questions from Karnofsky (2022), i.e. "What’s an alignment result or product that would make sense to offer a $1 billion prize for?"
The list of scenarios progresses roughly from more ambitious/aspirational to more realistic/attainable, though in many cases it is difficult to say which would be harder to attain.
It is true that AI alignment research aims to protect us from worst-case scenarios. However, Karnofsky (2022) suggests and I agree that envisioning/analyzing best-case scenarios of each line of research is important to help us learn: “(a) which research tracks would be most valuable if they went well”, and “(b) what the largest gaps seem to be [in research] such that a new set of questions and experiments could be helpful.”
Next we’ll look at a few more background considerations about the scenarios, and then we’ll dive into the scenarios themselves.
In each of the scenarios below, I’ll discuss specific impacts we can expect from that scenario. In these impact sections, I’ll discuss general impacts on the four components of alignment presented above.
I also consider more in depth how each of these scenarios impacts several specific robustness and alignment techniques. To help keep the main text of this post from becoming too lengthy, I have placed this analysis in Appendix 1: Analysis of scenario impacts on specific robustness and alignment techniques.
I link to the relevant parts of this appendix analysis throughout the main scenarios analysis below. This appendix is incomplete but may be useful if you are looking for more concrete examples to clarify any of these scenarios.
In each of the scenarios, I’ll also discuss specific reasons to be optimistic or pessimistic about their possibility. But there are also reasons which apply generally to all interpretability research, including all of the scenarios considered below.
In the rest of this section, I'll go over those generally-applicable considerations, rather than duplicate them in every scenario.
The holy grail of interpretability research, in this scenario the state of interpretability is so advanced that we can fully understand any artificial neural network in a reasonably short amount of time.
Neural networks are no longer opaque or mysterious. We effectively have comprehensive mind-reading abilities on any AI where we have access to both the model weights and our state of the art transparency tools.
Note for the impatient skeptic: If you're finding this scenario too far-fetched, don't abandon just yet! The scenarios after this one get significantly less "pie in the sky", though they're still quite ambitious. This is the most aspirational scenario for interpretability research I could think of, so I list it first. I do think it's not impossible and still useful to analyze. But if your impatience and skepticism is getting overwhelming, you are welcome to skip to Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs.
What does it mean to “fully understand” a neural network? Chris Olah provides examples of 3 ways we could operationalize this concept in the Open Phil 2021 RFP:
In this scenario, we don’t necessarily achieve a full and timely understanding of everything happening inside of neural networks. But here, interpretability does advance to the state that it grants us two key abilities:
In the weak version of this scenario, this #1 ability amounts to detecting whether an AI is a goal-directed agent system according to Alex Flint’s The ground of optimization. But if it’s not, we won’t be able to distinguish whether it’s still a non-agentic optimizing system, or if it’s another kind of AI system. Also, we can only read off the terminal goal of the AI in the weak version.
In the strong version, we can tell whether it’s a goal-directed agent system or some kind of non-agentic optimizing system. If it’s a non-agentic optimizing system, #2 becomes the ability to determine the target configuration set of the optimizing system. For example, if the AI isn’t driven by a strong optimization process, but rather by a collection of heuristics, then the interpretability tools reveal precisely what those heuristics are to us. We also gain insight into all the AI’s instrumental goals, in addition to its terminal goal.
Because obfuscating neural activity may be an instrumentally convergent subgoal, the model should undergo these checks for mesa-optimization and for its goal content starting early in training.
This is similar to “Transparency for agent objectives”, proposed in Hubinger (2019).
In this scenario, reliable lie detection is developed, such that we can tell 100% of the time if an AI is lying through natural language.
One path for realizing this scenario is that after studying many neural networks in deceptive situations, we discover neural activity signatures that are reliably present when an AI is lying. I’ll refer to this as the “neural tell” for lying.
What might this neural tell look like? When an AI is being honest about things it knows or has seen, we’d expect to see activity in some part of its neural network corresponding to facts about the world. If an AI is lying, on the other hand, we might expect to see additional activity in some kind of “hypothetical world model” circuitry which might otherwise be activated if an AI is exploring something new or writing fiction, but not when it’s honestly discussing high-confidence beliefs that it holds.
It probably won’t be exactly like this. But the hope in this scenario is that there is some kind of reliably present neural activity when an AI is lying.
Here lying is considered to only be one class of deception that AIs may be capable of. But we only need to reliably detect lies rather than all kinds of deception in order to unlock valuable potential impacts, as we’ll see below.
This scenario does, of course, depend on the AI having natural language capabilities. It also assumes that advanced AI will have coherent beliefs.
This scenario is similar to the one from Neel Nanda’s list titled “4. Auditing for deception”.
Empirically honest AI
If we can train a model which always passes our lie detector test, then we have (empirically) honest AI.
With advanced honest AI, there are a number of powerful alignment-related abilities we gain. For example, we can ask the AI about what its goals or driving heuristics are, and be confident that any answers it provides are honest.
This does assume that the AI will know about what its goals or heuristics are, but that doesn’t seem like an unreasonable expectation for advanced AI. If the AI can’t articulate its high-level goals for some reason, we can ask it questions about how it would behave in hypothetical situations to start to build up a picture of its motivations and expected behavior (credit to Owain Evans’ Truthful and honest AI for this idea).
Just because our AI is honest, that doesn’t mean that it can’t make mistakes, be negligent or have inconsistent beliefs and motivations. The AI could provide misleading answers to questions, ones that are technically true but designed to misdirect us. For effectively using such an AI, very careful questioning might be necessary. We'd need to systematically ask it questions and follow-up questions from many different angles, checking critically for multiple interpretations of any word that could change the meanings of answers provided. This would be a tedious process, but it could probably be mostly automated.
Perhaps a larger concern, though, is that an AI under lie detection still may not be forthcoming about what it knows. We’ll explore this issue in the next section.
Myopia, goals and addressing the ELK in the room
Just because we have a reliable lie detector and an honest AI doesn’t mean it will tell us everything it knows. It could still refuse to answer questions sometimes, perhaps even on crucial matters. How can we reliably get the information we care about out of our AI? This is the Eliciting Latent Knowledge (ELK) problem.
Having a reliable lie detector doesn’t directly solve ELK, but it may provide the means for such a solution. This becomes easy to see using a technique such as approval-directed amplification, debate or market making (see Overview of techniques to be analyzed in Appendix 1 if you’re unfamiliar with any of these techniques).
I’ll use debate to make this concrete. Let’s imagine we ask our lie detector-verified honest AI about what its primary objective or terminal goal is. There are a few different ways that this could go:
We can observe a few interesting things about the above debates. One is that our reliable lie detector has turned out to be a reliable myopia verification tool. This is because the myopia debate above looks clearly different from either possibility of how the non-myopia debate goes. This means that the present scenario subsumes Scenario 4: Reliable myopia verification.
If our AI is myopic and under effective lie detection, then it will always have an incentive to answer the question honestly. So if you can train a myopic question-answering AI in this scenario, then you have a solution to ELK and you also have also realized Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs (strong version), having access to its precise terminal and instrumental goals.
We don’t know whether possibility 1 or 2 above will turn out to be what non-myopic debaters do. If it’s Possibility 1 on the terminal goal question, then we have effectively read off the AI’s goal and this scenario subsumes Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs (weak version). If it’s always Possibility 1 for any question, then this scenario is also a solution to ELK.
But for non-myopic AI, Possibility 2 might very well happen at least some of the time. In this case we don’t have an ELK solution (can’t elicit any knowledge we want from the AI), and we also don’t know what its terminal goal is. We still have an AI where we can verify whether its statements are honest or not, however having a long-term planning AI with unknown goals around makes me very nervous and seems rather risky. I would probably recommend retraining such a model until Possibility 1 or a myopic AI is attained.
A prerequisite for deceptive alignment is that an AI be non-myopic. That is, the AI must be concerned not just about the present training step and near-future, but also about a long-term future post-deployment.
In deceptive alignment, an AI is optimizing for an objective which is a proxy for the base objective of its training process. In addition to the proxy objective which is its terminal goal, it also has a model of the base objective and an awareness that it is in some kind of training process. The AI realizes that if it doesn’t score well on the base objective, then it will be modified. Acting rationally toward the instrumentally convergent goals of self-preservation and goal-content integrity, it pretends to be aligned to the base objective in the short term. This is because it’s anticipating a future when it will be deployed, at which point it can pursue its primary goal into the far future and obtain a very high total reward.
Along with scenarios 1-3, verifying that an AI is myopic is another promising route toward resolving deceptive alignment. Myopic agents may also have other beneficial properties. For example, since they don’t pursue instrumentally convergent sub-goals, they avoid many of the dangers of long-term planning agents. Approval-directed myopic agents are also easy to “course correct” since they’re only interested in a short-term reward; future approval can steer them in a different direction if needed: “While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.” (comment from Rohin Shah)
There are various approaches to myopia and how to limit an agent’s long-term planning. Two prominent ideas are per-step myopia and per-episode myopia:
These two types of myopia have some interesting tradeoffs, which I’ll consider in the “Expected impacts on alignment” section for this scenario below. However, Evan Hubinger makes a compelling argument for why per-step myopia will be easier to verify: “Unfortunately, in my opinion, per-episode myopia seems like a very difficult condition to enforce—once your agents are running multi-step optimization algorithms, how do you tell whether that optimization passes through the episode boundary or not? Enforcing per-step myopia, on the other hand, just requires detecting the existence of multi-step optimization, rather than its extent, which seems considerably easier.”
Hence I’ll branch this scenario into a strong and weak version like I’ve done in some of the previous scenarios:
There are various ways these kinds of verification might be achieved using interpretability. For example, it could involve locating and directly monitoring the time horizon of plans in the AI. Alternatively, it could be that specific circuitry encoding the model’s time preference is located and interpreted. (If an agent has an extreme time preference toward the near future, then they are effectively myopic.)There are probably other ways a useful myopic cognition can be encoded in neural circuits as well. More research can uncover those and help determine which are most promising. This scenario is meant to encapsulate any and all of these interpretability-based approaches for verifying myopia.
This scenario is focused around the specific interpretability breakthroughs required to make relaxed adversarial training (see Hubinger 2019) effective. The need for interpretability in this context was presented in “The core problem: transparency” section of that last link, and it was also outlined in Beth Barnes’ Another list of theories of impact for interpretability.
Reposting Beth’s summary of this idea:
In order for this to work, you need to be searching over the right part of the model/model’s belief space. For instance, if you go too narrow and are just searching over ‘whether the model believes it’s raining today’, you might not find anything that triggers defection. However, if you search too broadly (for instance, ‘whether the model has gotten strong evidence that the humans want it to do a behaviour that looks like defection’), then you might find a case where the model defects that isn’t actually a sign you should be concerned in practice.”
Another example in the “computationally hard to produce” category that Paul Christiano and Hubinger use, which I find helpful, is the RSA-2048 example. A deceptively aligned model could be waiting to defect conditional on seeing a factorization of RSA-2048 in its inputs. This could be a good proxy for the model to use to know that it’s been deployed in the real world and is no longer in training, because the factorization of RSA-2048 is probably too computationally expensive for it to ever be provided as an input in training, but it if the AI is deployed in the real world and around for a long time, it will probably encounter this factorization at some point.
So with this interpretability breakthrough, we’d be able to identify a part of the model which is “waiting to see factorization of RSA-2048” in its inputs. Relaxed adversarial training would search over the collection of such beliefs that are present in the model to ensure there are no conditional defection plans and rule out that a model is deceptively aligned.
This scenario subsumes Scenario 4: Reliable myopia verification, as relaxed adversarial training is one avenue toward achieving myopia verification, e.g. by regularly searching the AI’s beliefs about its observations and penalizing the presence of any which would cause it to veer into non-myopic reward pursuits.
Some alignment proposals, notably STEM AI, are based on the idea that there are dangers around having an AI model humans.
In this scenario, our interpretability has made great strides in detecting the presence of human modeling in a deep learning model. Hence by training against this, we can be confident that our AI will not be modeling humans in any significant way.
The strong version of this scenario allows reading specific details about the AI’s model of humans. A weaker version would be more like a simple binary detection of the presence of human modeling.
Expected impacts on alignment
Since this scenario primarily impacts STEM AI, much of this section includes quotes from the corresponding alignment components analysis of STEM AI from Hubinger (2020):
In this scenario, we find a way to reliably locate and interpret an AI’s beliefs about training and deployment distributions in its neural network.
There are a couple interesting things we could do with such information:
Both of these approaches could produce an i.i.d. guarantee which has many desirable impacts we’ll examine in the next section.
This idea was previously articulated in the "Make use of M's x∼train model" part of Hubinger (2019).
You can find further details in the Specific technique impacts analysis for Scenario 7: Identify beliefs about training vs. deployment. Though please note that this is one of the more incomplete/rough analyses in Appendix 1.
The previous section considered interpretability's potential impacts on alignment by visualizing and analyzing seven specific scenarios. But what about important scenarios that are difficult to foresee, and which would really surprise us?
I believe this is another area which should be considered in evaluating any research activity's potential impact on alignment. In other words, what is the potential of a research activity to contribute to relevant basic research breakthroughs, deconfusing machine learning systems and the like?
For example, there is so little we understand about what it even means for an AI to "know" something or have "goals". Having a clearer understanding of such things could open up many possibilities for how to leverage them to the ends of AI alignment.
I expect that future research could invalidate some of the alignment techniques I analyzed throughout the scenarios and in Appendix 1, and there will be new ones proposed that we haven't considered here. But given the broad potential impacts we can from see on the current landscape of alignment proposals, there is good reason to think interpretability will also be valuable to future proposals as well.
Earlier, we mentioned a broad assumption in this post that we are in a world which depends on prosaic AI alignment. In this world, interpretability seems to be strong for the kind of basic research that benefits AI alignment. Certainly for increasing our basic understanding of neural networks, it will help to look inside them!
However, what if it turns out we're living in a non-prosaic AI world, where the important AGI or other transformative AI systems will be created using an approach very different from the deep learning neural networks of today? In this case, doing interpretability on present-day deep learning systems could be much less valuable for alignment.
We have the seven scenarios analyzed above, and Appendix 2 contains a few more I looked at, but which didn't have clear alignment-solving potential.
Below are some additional interpretability scenarios which I think may be as valuable as the main seven of this post, but I didn't have time to investigate them.
Collin Burns is working on this idea, which is summarized in Beth Barnes’ post Another list of theories of impact for interpretability. She mentions that there is no thorough writeup of the idea yet. Once there is, it would be good to evaluate whether it merits being added to the list of scenarios with alignment-solving potential considered in the present post.
From my basic understanding of the idea, it may end up being a means of achieving Scenario 3: Reliable lie detection, rather than a distinct scenario in itself. But I don’t claim to have a good understanding of the idea yet.
Scenario 3: Reliable lie detection discussed a possible “neural tell” for lying. What if there’s a neural tell for deceptive alignment?
For an AI engaged in deceptive alignment, we’d expect it to be consuming additional compute compared to being robustly aligned on the same objective, all other things equal. This is because deceptive alignment would require at least two processes: 1) optimizing the base objective and 2) searching for ways to defect. Whereas robust alignment and even other kinds of non-deceptive pseudo-alignment should only require process #1.
So another scenario that might have alignment-solving potential is one which specifically targets reliable detection of process #2. This may end up having considerable overlap with Scenario 5: Locate the AI’s beliefs about its observations, but more research is recommended.
From Hubinger (2019): “[W]hat facts about the world does [a model] M know that are going into producing its output? In particular, what implicit knowledge is M using that isn't represented in its explicit output?”
This is similar to Scenario 1: Full understanding of arbitrary neural networks, but may be different enough to warrant a separate scenario analysis.
From Hubinger (2019): “In particular, if there is some condition α such that M only behaves acceptably if α is true, then we want to be able to know that M 's output is highly dependent on α.”
This may be the same as Donald Hobson’s suggestion from a comment on an earlier draft of this post: “[To include a scenario] that highlights which training data is most influential in making its decisions. (So you can put it in a secure box. Tell it about it being in the box, and some details. Not give any problem where it needs to know this. Then if its decisions are ever influenced by it thinking about the design of box, shut it down)”
These proposals might be equivalent and/or largely covered by Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs, but I’m not sure yet.
The section Interpretability Scenarios with Alignment-Solving Potential above provides a fairly thorough analysis of what the seven scenarios are, their expected impacts on alignment, and reasons to be optimistic and pessimistic about each one.
To more comprehensively evaluate interpretability for property #1 of High-leverage Alignment Research and as a target for large investments of capital and/or talent, it would be useful also to consider the paths and intermediate steps toward realizing each of these scenarios. We would like to have probabilities for the likelihood of being able to achieve each scenario and its intermediate steps as well. It may then be possible to consider all the scenario probabilities together to form an overall probability estimate of interpretability research going well, given enough funding and talent.
I am considering doing this research for a future post in this sequence. Part of why I haven’t done this yet is that while I received a lot of great feedback on the draft for this post, I imagine it might make sense to revise or update the list of scenarios based on the feedback that comes in after this is published. Probability estimates are quite sensitive to the specific details of a scenario, so it makes sense to wait until both the overall list of scenarios and the parameters of each scenario contained within it are fairly stable.
A lot of the scenarios above are written assuming perfection of some interpretability technique (perfect lie detection, reliable myopia verification etc.) Is it possible to get sufficient benefits out of only partially realizing some of these scenarios? What about combinations of partial scenarios, e.g. good but imperfect lie detection (partial Scenario 3) combined with human modeling detection (Scenario 6).
It would be valuable to know if there are visible paths to alignment having only partial progress towards the scenarios above, as that may be more achievable than realizing 100% reliability of these interpretability techniques.
Proposals #10 and #11 from Hubinger (2020) involve using a hybrid approach of amplification and RL. While Appendix 1: Analysis of scenario impacts on specific robustness and alignment techniques analyzes the impact of each scenario on many different techniques, this one wasn’t explored. But that was simply for lack of time, and it would be good to know more about how the scenarios in this post impact that approach.
The seven scenarios in this post show many inner alignment issues that interpretability could address. However, one inner alignment issue that is not well addressed by this post is suboptimality alignment. (Neither is the closely related suboptimality deceptive alignment.)
I can see how some forms of suboptimality alignment are addressed in the scenarios. For example, an AI might have a misaligned terminal goal, but some errors in its world model cause it to coincidentally have aligned behavior for a period of time. In Scenario 2, we could catch this form of suboptimality alignment when we do the goal read-offs and see that its terminal goal is misaligned.
But what about unpredictable forms of suboptimality alignment? What if an AI is aligned in training, but as it learns more during deployment, it has an ontological crises and determines that the base objective isn't compatible with its new understanding of the universe?
How serious of a risk is suboptimality alignment in practice, and how can that risk be mitigated? This is an important question to investigate, both for alignment in general as well as for better understanding the extent of interpretability's potential impacts on inner alignment.
In this post, we investigated whether interpretability has property of #1 of High-leverage Alignment Research. We discussed the four most important parts AI alignment, and which seem to be the hardest. Then we explored interpretability's relevance to these areas by analyzing seven specific scenarios focused on major interpretability breakthroughs that could have great impacts on the four alignment components. We also looked at interpretability's potential relevance to deconfusion research and yet-unknown scenarios for solving alignment.
It seems clear that there are many ways interpretability will be valuable or even essential for AI alignment. It is likely to be the best resource available for addressing inner alignment issues across a wide range of alignment techniques and proposals, some of which look quite promising from an outer alignment and performance competitiveness perspective.
However, it doesn't look like it will be easy to realize the potential of interpretability research. The most promising scenarios analyzed above tend to rely on near-perfection of interpretability techniques that we have barely begun to develop. Interpretability also faces serious potential obstacles from things like distributed representations (e.g. polysemanticity), the likely-alien ontologies of advanced AIs, and the possibility that those AIs will attempt to obfuscate their own cognition. Moreover, interpretability doesn't offer many great solutions for suboptimality alignment and training competitiveness, at least not that I could find yet.
Still, interpretability research may be one of the activities that most strongly exhibits property #1 of High-leverage Alignment Research. This will become more clear if we can resolve some of the Further investigation questions above, such as developing more concrete paths to achieving the scenarios in this post and estimating probabilities that we could achieve them. It would also help if, rather than considering interpretability just on its own terms, we could do a side-by-side-comparison of interpretability with other research directions, as the Alignment Research Activities Question suggests.
Realizing any of the scenarios with alignment-solving potential covered in this post would likely require much more funding for interpretability, as well as many more researchers to be working in the field than are currently doing so today.
For the next post in this series, I’ll be exploring whether interpretability has property #2 of High-leverage Alignment Research: "the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)"
The Appendices for this post are on Google Docs at the following link: Appendices for Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
High-leverage Alignment Research is my term for what Karnofsky (2022) defines as:
“Activity that is  likely to be relevant for the hardest and most important parts of the problem, while also being  the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)”
See The Alignment Research Activities Question section in the first post of this sequence for further details.
Hubinger, Evan (2020): An overview of 11 proposals for building safe advanced AI
In researching what are the important components of AI alignment, I first spent a couple days thinking about this question and looking back over the AGISF curriculum, this talk Eliezer gave at NYU, and this Evan Hubinger interview. I came up with a 3-part breakdown of 1) Outer alignment, 2) Inner alignment, and 3) Alignment tax. I asked Joe Collman if he would look it over, and he had some useful feedback but broadly agreed with it and didn’t have any major components to add.
Then I came across Hubinger (2020) again, which it had been awhile since I’d read. His breakdown was mostly the same, but I liked his descriptions better. He also divided what was “alignment tax” in my system into “training competitiveness” and “performance competitiveness”. I thought this was a useful distinction, which is why I adopt Hubinger’s breakdown in this paper.
The fact that 2 people independently arrived at roughly these same basic components of alignment lends some additional confidence to their correctness. Although when I came up with my version I may have been subconsciously influenced by an earlier reading of Hubinger’s work.
3 of the 11 proposals explicitly have “transparency tools” in the name. 5 more of them rely on relaxed adversarial training. In Evan Hubinger’s Relaxed adversarial training for inner alignment, he explains why this technique ultimately depends on interpretability as well:
“...I believe that one of the most important takeaways we can draw from the analysis presented here, regardless of what sort of approach we actually end up using, is the central importance of transparency. Without being able to look inside our model to a significant degree, it is likely going to be very difficult to get any sort of meaningful acceptability guarantees. Even if we are onlyshooting for an iid guarantee, rather than a worst-case guarantee, we are still going to need some way of looking inside our model to verify that it doesn't fall into any of the other hard cases.”
Then there is Microscope AI, which is an alignment proposal based entirely around interpretability. STEM AI relies on transparency tools to solve inner alignment issues in Hubinger’s analysis. Finally, in proposal #2 which utilizes intermittent oversight, he clarifies that the overseers will be "utilizing things like transparency tools and adversarial attacks."
The Alignment Research Activities Question is my term for a question posed by Karnofsky (2022). The short version is: “What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?”
For all relevant details on that question, see the The Alignment Research Activities Question section in the first post of this sequence.
Karnofsky, Holden (2022): Important, actionable research questions for the most important century
Sometimes when I quote Karnofsky (2022), I'm referring directly to the link above to his post on the Effective Altruism Forum. Other times I'm referring to something that only appears in the associated Appendix 1: detailed discussion of important, actionable questions for the most important century that Holden provides, which is on Google Docs.
The "most important century" part of the present sequences's name also draws its inspiration from Karnofsky (2022) and an earlier blog post series by the same author.
Paul Christiano opens his 2021 Comments on OpenPhil's Interpretability RFP with the following, indicating his support for interpretability research:
"I'm very excited about research that tries to deeply understand how neural networks are thinking, and especially to understand tiny parts of neural networks without too much concern for scalability, as described in OpenPhil's recent RFP or the Circuits thread on Distill."
As for Eliezer, you can read his support for interpretability research in the following quote from the 2021 Discussion with Eliezer Yudkowsky on AGI interventions, along with his concerns that interpretability won't advance fast enough: (bold mine)
'Chris Olah is going to get far too little done far too late. We're going to be facing down an unalignable AGI and the current state of transparency is going to be "well look at this interesting visualized pattern in the attention of the key-value matrices in layer 47" when what we need to know is "okay but was the AGI plotting to kill us or not”. But Chris Olah is still trying to do work that is on a pathway to anything important at all, which makes him exceptional in the field.'
You could interpret Eliezer's concerns about timing as a) being that it is futile to pursue interpretability research. Or you could interpret it as b) reason to ramp up investment into interpretability research so that we can accelerate its progress. This is similar the position we are exploring in this sequence, dependent on whether we can clearly identify interpretability research as High-leverage Alignment Research.
You can see further evidence for the latter view from Eliezer/MIRI's support for interpretability research in this quote from MIRI's Visible Thoughts Project and Bounty Announcement post in 2021: (bold mine)
"The reason for our focus on this particular project of visible thoughts isn’t because we believe it to be better or more fruitful than Circuits-style transparency (we have said for years that Circuits-style research deserves all possible dollars that can be productively spent on it), but just because it’s a different approach where it might also be possible to push progress forward."
Note that the bullet referring to this footnote isn't technically a "reason to think interpretability won't go far enough" like the others in that section list. It's more of a general risk associated with interpretability research, but I couldn't find a better home for it in this post.
I subscribe to Evan Hubinger's view that imitative amplification is likely outer aligned. See for example this explanation from Hubinger (2020):
"Since imitative amplification trains M to imitate Amp(M), it limits to the fixed point of the Amp operator, which Paul Christiano calls HCH for “Humans Consulting HCH.” HCH is effectively a massive tree of humans consulting each other to answer questions.
Thus, whether imitative amplification is outer aligned is dependent on whether HCH is aligned or not. HCH’s alignment, in turn, is likely to depend heavily on the specific humans used and what sort of policy they’re implementing. The basic idea, however, is that since the limit is composed entirely of humans—and since we can control what those humans do—we should be able to ensure that all the optimization power is coming from the humans (and not from memetic selection, for example), which hopefully should allow us to make it safe. While there are certainly valid concerns with the humans in the tree accidentally (or purposefully) implementing some misaligned policy, there are possible things you can do to address these problems."
Hubinger, Evan (2019): Relaxed adversarial training for inner alignment
This could be achieved by training multi-agent environments where agents with subhuman intelligence are incentivized to lie to one another (then don’t deploy those models!). Control the experiments by having models performing similar tasks but cooperatively.
Alternative scheme to study how a specific agent engages in deception: After main training, fine tune the model in an environment which incentives lying and monitor its neural activity (then throw away that deceptive model!)
This could be achieved by training multi-agent environments where agents with subhuman intelligence are incentivized to lie to one another (then don’t deploy those models!). Control the experiments by having models performing similar tasks, but cooperatively.
ELK scored poorly (4.0 / 10) on Neel Nanda’s survey of possible interpretability impacts. However, as of this writing, only 3 researchers have participated in that vote. Also, this question was not about ELK as a theory of impact assuming large amounts of money and talent go into interpretability; it was probably based on assumptions of interpretability continuing along current trajectories. It also made no reference to this specific scenario as the means to solving ELK (only interpretability in general).
There’s a good chance that advanced AI won’t provide a crisp easy-to-understand answer like “my goal is to win the debate”, and instead the answer will be more like [complex mess of heuristics that happens to correlate with winning quite strongly]. In this case we’d have to comb through the answer and distill the goal meaning we care about. There’s also a chance that the debater would output something like [explanation that you've asked a confused/underspecified question, and that there's no sensible way to ascribe a particular "terminal goal" to the debater]. This is a bit more problematic but can probably be addressed with a series of follow-up questions. (See the paragraph beginning with “Just because our AI is honest, doesn’t mean that it can’t make mistakes” further above in this scenario.) After addressing these complexities, I think the main ideas of this debate game analysis still hold. (Credit to Joe Collman for raising these issues)
Declining to answer the question in this debate assumes that the debaters know they are under lie detection. It's interesting to think about what would happen if they didn't. Maybe they'd be more likely to play the debate game with one telling the truth and the other lying. But maintaining the debaters' ignorance of the lie detector seems like a difficult equilibrium to enforce, and it's impossible if you're going to have the debaters use the lie detector on one another.
Threat model evidence scored highly (7.1 / 10) on Neel Nanda’s survey of researchers on possible interpretability impacts. This survey question was also not assuming large amounts of money and talent go into interpretability, as we are in this document; hence it may have even scored even better given those assumptions. Either way, this is another data point in favor of these threat models being valuable even if 100% reliable deception detection isn’t achieved.
It’s also possible for an AI to simply have incoherent “beliefs”. For example, a large language model like GPT-3 may say contradictory things because it makes statements based on predictive modeling of language rather than a unified set of beliefs about the world. However, I suspect AIs will move toward having more coherent beliefs than present-day language models, so I don’t think this kind of incoherence is as relevant to aligning advanced AI.
Another example illustrating how diverse these approaches to myopia are: LCDT. The idea behind LCDT is that instead of limiting an agent’s time-horizon directly, we might achieve the desired effects by making an AI believe that it cannot influence other agents, including its future self.Yet another approach that comes to my mind might be called “the end is nigh” myopia. If we can enforce that an agent has the unwavering belief that the world will end immediately after the present training step or episode, then it wouldn’t make sense for it to have any plans beyond that time horizon. It doesn't seem like a great idea to try and enforce incorrect beliefs about the world in our AI, but hopefully this helps illustrate that there could be many approaches to myopic AI.
For clarification on these terms, see What is a training "step" vs. "episode" in machine learning?
Thanks to Donald Hobson for this clever thought experiment. The problem isn’t specific to per-episode myopia and could apply just as well to per-step myopia. The thought experiment does depend on the AI believing in a kind of time travel where multiple timelines are possible though.
This stands for “Independent and identically distributed". See https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables
Thanks to Nick Turner for key discussions that led to me writing this section.
Whereas robust alignment and even other kinds of non-deceptive pseudo-alignment should only require process #1.
In fact, this is something Karnofsky (2022) proposes.
Thanks to Nathan Helm-Burger for raising this question.
I expect there to be disagreements about some of the specific claims and scenarios in this post, and I look forward to learning from those. But I would be surprised if they undermined the overall preponderance of evidence put forth here for the alignment-solving potential of interpretability research, across all the scenarios in this post and all the analyzed impacts on various robustness & alignment techniques in Appendix 1.