How To Go From Interpretability To Alignment: Just Retarget The Search

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

[EDIT: Many people who read this post were very confused about some things, which I later explained in What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? You might want to read that post first.]

When people talk about prosaic alignment proposals, there’s a common pattern: they’ll be outlining some overcomplicated scheme, and then they’ll say “oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are”, and then they’ll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.) And then usually there’s a whole discussion about the specific problems with the overcomplicated scheme.

In this post I want to argue from a different direction: if we had great interpretability tools, we could just use those to align an AI directly, and skip the overcomplicated schemes. I’ll call the strategy “Just Retarget the Search”.

We’ll need to make two assumptions:

Given these two assumptions, here’s how to use interpretability tools to align the AI:

  • Identify the AI’s internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
  • Identify the retargetable internal search process.
  • Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target. 

Just retarget the search. Bada-bing, bada-boom.

Problems

Of course as written, “Just Retarget the Search” has some issues; we haven’t added any of the bells and whistles to it yet. Probably the “identify the internal representation of the alignment target” step is less like searching through a bunch of internal concepts, and more like writing our intended target in the AI’s internal concept-language. Probably we’ll need to do the retargeting regularly on-the-fly as the system is training, even when the search is only partly-formed, so we don’t end up with a misaligned AI before we get around to retargeting. Probably we’ll need a bunch of empirical work to figure out which possible alignment targets are and are not easily expressible in the AI’s internal language (e.g. I’d guess “user intention” or "human mimicry" are more likely than “human values”). But those details seem relatively straightforward.

A bigger issue is that “Just Retarget the Search” just… doesn’t seem robust enough that we’d want to try it on a superintelligence. We still need to somehow pick the right target (i.e. handle outer alignment), and ideally it’s a target which fails gracefully (i.e. some amount of basin-of-corrigibility). If we fuck up and aim a superintelligence at not-quite-the-right-target, game over. Insofar as “Just Retarget the Search” is a substitute for overcomplicated prosaic alignment schemes, that’s probably fine; most of those schemes are targeting only-moderately-intelligent systems anyway IIUC. On the other hand, we probably want our AI competent enough to handle ontology shifts well, otherwise our target may fall apart.

Then, of course, there’s the assumptions (natural abstractions and retargetable search), either of which could fail. That said, if one or both of the assumptions fail, then (a) that probably messes up a bunch of the overcomplicated prosaic alignment schemes too (e.g. failure of the natural abstraction hypothesis can easily sink interpretability altogether), and (b) that might mean that the system just isn’t that dangerous in the first place (e.g. if it turns out that retargetable internal search is indeed necessary for dangerous intelligence).

Upsides

First big upside of Just Retargeting the Search: it completely and totally eliminates the inner alignment problem. We just directly set the internal optimization target.

Second big upside of Just Retargeting the Search: it’s conceptually simple. The problems and failure modes are mostly pretty obvious. There is no recursion, no complicated diagram of boxes and arrows. We’re not playing two Mysterious Black Boxes against each other.

But the main reason to think about this approach, IMO, is that it’s a true reduction of the problem. Prosaic alignment proposals have a tendency to play a shell game with the Hard Part of the problem, move it around and hide it in different black boxes but never actually eliminate it. “Just Retarget the Search” directly eliminates the inner alignment problem. No shell game, no moving the Hard Part around. It still leaves the outer alignment problem unsolved, it still needs assumptions about natural abstractions and retargetable search, but it completely removes one Hard Part and reduces the problem to something simpler.

As such, I think “Just Retarget the Search” is a good baseline. It’s a starting point for thinking about the parts of the problem it doesn’t solve (e.g. outer alignment), or the ways it might fail (retargetable search, natural abstractions), without having to worry about inner alignment.

New Comment
33 comments, sorted by Click to highlight new comments since: Today at 12:05 PM

Definitely agree that "Retarget the Search" is an interesting baseline alignment method you should be considering.

I like what you call "complicated schemes" over "retarget the search" for two main reasons:

  1. They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about).
  2. They degrade gracefully with worse interpretability tools, e.g. in debate, even if the debaters can only credibly make claims about whether particular neurons are activated, they can still stay stuff like "look my opponent is thinking about synthesizing pathogens, probably it is hoping to execute a treacherous turn", whereas "Retarget the Search" can't use this weaker interpretability at all. (Depending on background assumptions you might think this doesn't reduce x-risk at all; that could also be a crux.)

I indeed think those are the relevant cruxes.

  1. They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about).

Why do you think we probably won't end up with mesa-optimizers in the systems we care about?

Curious about both which systems you think we'll care about (e.g. generative models, RL-based agents, etc.) and why you don't think mesa-optimization is a likely emergent property for very scaled-up ML models.

  1. It's a very specific claim about how intelligence works, so gets a low prior, from which I don't update much (because it seems to me we know very little about how intelligence works structurally and the arguments given in favor seem like relatively weak considerations).
  2. Search is computationally inefficient relative to heuristics, and we'll be selecting really hard on computational efficiency (the model can't just consider 10^100 plans and choose the best one when it only has 10^15 flops to work with). It seems very plausible that the model considers, say, 10 plans and chooses the best one, or even 10^6 plans, but then most of the action is in which plans were generated in the first place and "retarget the search" doesn't necessarily solve your problem.

I'm not thinking much about whether we're considering generative models vs RL-based agents for this particular question (though generally I tend to think about foundation models finetuned from human feedback).

most of the action is in which plans were generated in the first place and "retarget the search" doesn't necessarily solve your problem

I definitely buy this and I think the thread under this between you and John is a useful elaboration.

The thing that generates the proposals has to do most of the heavy lifting in any interestingly-large problem. e.g. I would argue most of the heavy lifting[1] of AlphaGo and that crowd is done by the fact that the atomic actions are all already 'interestingly consequential' (i.e. the proposal generation doesn't have to consider millisecond muscle twitches but rather whole 'moves', a short string of which is genuinely consequential in-context).

Nevertheless I reasonably strongly think that something of the 'retargetable search' flavour is a useful thing to expect, look for, and attempt to control.

For one, once you have proposals which are any kind of good at all, running a couple of OOMs of plan selection can buy you a few standard deviations of plan quality, provided you can evaluate plans ex ante better than randomly, which is just generically applicable and useful. But this isn't the main thing, because with just that picture we're still back to most of the action being the generator/heuristics.

The main things are that

  1. (as John pointed out) recursive-ish generic planning is enormously useful and general, and implies at least some degree of retargetability.
  2. (this is shaky and insidey) how do you arrive at the good heuristics/generators? It's something like

  1. (this is an entirely unfair defamation of Silver et al which I feel the need to qualify is at least partly rhetorical and not in fact my entire take on the matter) ↩︎

I am very confused by (2). It sounds like you are imagining that search necessarily means brute-force search (i.e. guess-and-check)? Like non-brute-force search is just not a thing? And therefore heuristics are necessarily a qualitatively different thing from search? But I don't think you're young enough to have never seen A* search, so presumably you know that formal heuristic search is a thing, and how to use relaxation to generate heuristics. What exactly do you imagine that the word "search" refers to?

I've definitely seen A* search and know how it works. I meant to allude to it (and lots of other algorithms that involve a clear goal) with this part:

It seems very plausible that the model considers, say, 10 plans and chooses the best one, or even 10^6 plans, but then most of the action is in which plans were generated in the first place and "retarget the search" doesn't necessarily solve your problem.

If your AGI is doing an A* search, then I think "retarget the search" is not a great strategy, because you have to change both the goal specification and the heuristic, and it's really unclear how you would change the heuristic even given a solution to outer alignment (because A* heuristics are incredibly specific to the setting and goal, and presumably have to become way more specialized than they are today in order for it to be more powerful than what we can do today).

That's what relaxation-based methods are for; they automatically generate heuristics for A* search. For instance, in a maze, it's very easy to find a solution if we relax all the "can't cross this wall" constraints, and that yields the Euclidean distance heuristic. Also, to a large extent, those heuristics tend to depend on the environment but not on the goal - for instance, in the case of Euclidean distance in a maze, the heuristic applies to any pathfinding problem between two points (and probably many other problems too), not just to whatever particular start and end points the particular maze highlights. We can also view things like instrumentally convergent subgoals or natural abstractions as likely environment-specific (but not goal-specific) heuristics.

Those are the sort of pieces I imagine showing up as part of "general-purpose search" in trained systems: general methods for generating heuristics for a wide variety of goals, as well as some hard-coded environment-specific (but not goal-specific) heuristics.

(Note to readers: here's another post (with comments from John) on the same topic, which I only just saw.)

I imagine two different kinds of AI systems you might be imagining:

  1. An AI system that has a "subroutine" that runs A* search given a problem specification. The AI system works by formulating useful subgoals, converting those into A* problem specifications + heuristics, uses the A* subroutine, and then executes the result.
  2. An AI system that literally is A* search. The AI has (in its weights, if it is a learned neural net) a high-level "state space of the universe", a high-level "conceptual actions" space, an ability to predict the next high-level state given a previous state + conceptual action, and some goal function (= the mesa-objective). Given an input, the AI converts it into a high-level state, and runs A* with that state as the input, takes the resulting plan and executes the first action of the plan.

In (1), it seems like the major alignment work is in aligning the part of the AI system that formulates subgoals, problem specification, and heuristics, where it is not clear that "retarget the search" would work. (You could also try to excise the A* subroutine and use that as a general-purpose problem solver, but then you have to tune the heuristic manually; maybe you could excise the A* subroutine and the part that designs the heuristic, if you were lucky enough that those were fully decoupled from the subgoal-choosing part of the system.)

In (2), I don't know why you expect to get general-purpose search instead of a very complex heuristic that's very specific to the mesa objective. There is only ever one goal that the A* search has to optimize for; why wouldn't gradient descent embed a bunch of goal-specific heuristics that improve efficiency? Are you saying that such heuristics don't exist?

Separately: do you think we could easily "retarget the search" for an adult human, if we had mechanistic interpretability + edit access for the human's brain? I'd expect "no".

I'm imagining roughly (1), though with some caveats:

  • Of course it probably wouldn't literally be A* search
  • Either the heuristic-generation is internal to the search subroutine, or it's using a standard library of general-purpose heuristics for everything (or some combination of the two).
  • A lot of the subgoal formulation is itself internal to the search (i.e. recursively searching on subproblems is a standard search technique).

I do indeed expect that the major alignment work is in formulating problem specification, and possibly subgoals/heuristics (depending on how much of that is automagically handled by instrumental convergence/natural abstraction). That's basically the conclusion of the OP: outer alignment is still hard, but we can totally eliminate the inner alignment problem by retargeting the search.

Separately: do you think we could easily "retarget the search" for an adult human, if we had mechanistic interpretability + edit access for the human's brain? I'd expect "no".

I expect basically "yes", although the result would be something quite different from a human.

We can already give humans quite arbitrary tasks/jobs/objectives, and the humans will go figure out how to do it. I'm currently working on a post on this, and my opening example is Benito's job; here are some things he's had to do over the past couple years:

  • build a prototype of an office
  • resolve neighbor complaints at a party
  • find housing for 13 people with 2 days notice
  • figure out an invite list for 100+ people for an office
  • deal with people emailing a funder trying to get him defunded
  • set moderation policies for LessWrong
  • write public explanations of grantmaking decisions
  • organize weekly online zoom events
  • ship books internationally by Christmas
  • moderate online debates
  • do April Fools' Jokes on Lesswrong
  • figure out which of 100s of applicants to do trial hires with

So there's clearly a retargetable search subprocess in there, and we do in fact retarget it on different tasks all the time.

That said, in practice most humans seem to spend most of their time not really using the retargetable search process much; most people mostly just operate out of cache, and if pressed they're unsure what to point the retargetable search process at. If we were to hardwire a human's search process to a particular target, they'd single-mindedly pursue that one target (and subgoals thereof); that's quite different from normal humans.

... Interesting. I've been thinking we were talking about (2) this entire time, since on my understanding of "mesa optimizers", (1) is not a mesa optimizer (what would its mesa objective be?).

If we're imagining systems that look more like (1) I'm a lot more confused about how "retarget the search" is supposed to work. There's clearly some part of the AI system (or the human, in the analogy) that is deciding how to retarget the search on the fly -- is your proposal that we just chop that part off somehow, and replace it with a hardcoded concept of "human values" (or "user intent" or whatever)? If that sort of thing doesn't hamstring the AI, why didn't gradient descent do the same thing, except replacing it with a hardcoded concept of "reward" (which presumably a somewhat smart AGI would have)?

So, part of the reason we expect a retargetable search process in the first place is that it's useful for the AI to recursively call it with new subproblems on the fly; recursive search on subproblems is a useful search technique. What we actually want to retarget is not every instance of the search process, but just the "outermost call"; we still want it to be able to make recursive calls to the search process while solving our chosen problem.

Okay, I think this is a plausible architecture that a learned program could have, and I don't see super strong reasons for "retarget the search" to fail on this particular architecture (though I do expect that if you flesh it out you'll run into more problems, e.g. I'm not clear on where "concepts" live in this architecture and I could imagine that poses problems for retargeting the search).

Personally I still expect systems to be significantly more tuned to the domains they were trained on, with search playing a more cursory role (which is also why I expect to have trouble retargeting a human's search). But I agree that my reason (2) above doesn't clearly apply to this architecture. I think the recursive aspect of the search was the main thing I wasn't thinking about when I wrote my original comment.

One of the main reasons I expect this to not work is because optimization algorithms that are the best at optimizing some objective given a fixed compute budget seem like they basically can't be generally-retargetable. E.g. if you consider something like stockfish, it's a combination of search (which is retargetable), sped up by a series of very specialized heuristics that only work for winning. If you wanted to retarget stockfish to "maximize the max number of pawns you ever have" you had, you would not be able to use [specialized for telling whether a move is likely going to win the game] heuristics to speed up your search for moves. A more extreme example is the entire endgame table is useless for you, and you have to recompute the entire thing probably.

Something like [the strategy stealing assumption](https://ai-alignment.com/the-strategy-stealing-assumption-a26b8b1ed334) is needed to even obtain the existence of a set of heuristics just as good for speeding up search for moves that "maximize the max number of pawns you ever have" compared to [telling whether a move will win the game]. Actually finding that set of heuristics is probably going to require an entirely parallel learning process.

This also implies that even if your AI has the concept of "human values" in its ontology, you still have to do a bunch of work to get an AI that can actuallly estimate the long-run consequences of any action on "human values", or else it won't be competitive with AIs that have more specialized optimization algorithms.

I agree. My comment here on Rohin and John's thread is a poor attempt at saying something similar, but also observing that having the machinery to do the 'find the good heuristics' thing is itself a (somewhat necessary?) property of 'recursive-ish search' (at least of the flavour applicable to high-dimensional 'difficult' problem-spaces). In humans and animals I think this thing is something like 'motivated exploration' aka 'science' aka 'experimentation', plus magic abstraction-formation and -recomposition.

I think it's worth trying to understand better how these pieces fit together, and to what extent these burdens can (or will) be overcome by compute and training scale.

This seems like a good argument against retargeting the search in a trained model turning out to be a successful strategy. But if we get to the point where we can detect such a search process in a model and what its target is, even if its efficiency is enhanced by specialized heuristics, doesn't that buy us a lot even without the retargeting mechanism?

We could use that info about the search process to start over and re-train the model, modifying parameters to try and guide it toward learning the optimization target that we want it to learn. Re-training is far from cheap on today's large models, but you might not have to go through the entire training process before the optimizer emerges and gains a stable optimization target. This could allow us to iterate on the search target and verify once we have the one we want before having to deploy the model in an unsafe environment.

Are you saying that the AIs we train will be optimization algorithms that are literally the best at optimizing some objective given a fixed compute budget? Can you elaborate on why that is?

Not literally the best, but retargetable algorithms are on the far end of the spectrum of "fully specialized" to "fully general", and I expect most tasks we train AIs to do to have heuristics that enable solving the tasks much faster than "fully general" algorithms, so there's decently strong pressure to be towards the "specialized" side.

I also think that heuristics are going to be closer to multiplicative speed ups than additive, so it's going to be closer to "general algorithms just can't compete" than "it's just a little worse". E.g. random search is terrible compared to anything using exploiting non-trivial structure (random sorting vs quicksort is, I think a representative example, where you can go from exp -> pseudolinear if you are specialized to your domain).

I expect most tasks we train AIs to do to have heuristics that enable solving the tasks much faster than "fully general" algorithms, so there's decently strong pressure to be towards the "specialized" side

I basically buy that claim. The catch is that those specialized AIs won't be AGIs, for obvious reasons, and at the end of the day it's the AGIs which will have most of X-risk impact.

OK, cool. How do you think generalization works? I thought the idea was that instead of finding a specific technique that only works on the data you were trained on, sufficiently big NN's trained on sufficiently diverse data end up finding more general techniques that work on that data + other data that is somewhat different.

Generalization ability is a key metric for AGI, which I expect to go up before the end; like John said the kinds of AI we care about are the kinds that are pretty good at generalizing, meaning that they ARE close to the "fully general" end of the spectrum, or at least close enough that whatever they are doing can be retargeted to lots of other environments and tasks besides the exact ones they were trained on. Otherwise, they wouldn't be AGI.

Would you agree with that? I assume not...

humans, despite being fully general, have vastly varying ability to do various tasks, e.g. they're much better at climbing mountains than playing GO it seems. Humans also routinely construct entirely technology bases to enable them to do tasks that they cannot do themselves. This is, in some sense, a core human economic activity: the construction of artifacts that can do tasks better/faster/more efficiently than humans can do themselves. It seems like by default, you should expect a similar dynamic with "fully general" AIs. That is, AIs trained to do semiconductor manufacturing will create their own technology bases, specialized predictive artifacts, etc. and not just "think really hard" and "optimize within their own head." This also suggests a recursive form of the alignment problem, where an AI that wants to optimize human values is in a similar situation to us, where it's easy to construct powerful artifacts with SGD that optimize measurable rewards, but it doesn't know how to do that for human values/things that can't be measured.

Even if you're selecting reasonably hard for "ability to generalize" by default, the range of tasks you're selecting for aren't all going to be "equally difficult", and you're going to get an AI that is much better at some tasks than other tasks, has heuristics that enable it to accurately predict key intermediates across many tasks, heuristics that enable it to rapidly determine quick portions of the action space are even feasible, etc. Asking that your AI can also generalize to "optimize human values" aswell as the best avaliable combination of skills that it has otherwise seems like a huge ask. Humans, despite being fully general, find it much harder to optimize for some things than others, e.g. constructing large cubes of iron versus status seeking, despite being able to in theory optimize for constructing large cubes of iron.

Nobody is asking that the AI can also generalize to "optimize human values as well as the best available combination of skills it has otherwise..." at least, I wasn't asking that. (At no point did I assume that fully general means 'equally good' at all tasks. I am not even sure such comparisons can be made.) But now rereading your comments it seems you were all along, since you brought up competitiveness worries. So now maybe I understand you better: you are assuming a hypercompetitive takeoff in which if there are AIs running around optimized to play the training game or something, and then we use interpretability tools to intervene on some of them and make them optimize for long-run human values instead, they won't be as good at it as they were at playing the training game, even though they will be able to do it (compare: humans can optimize for constructing large cubes of iron, but they aren't as good at it as they are at optimizing for status) and so they'll lose competitions to the remaining AIs that haven't been modified?

(My response to this would be ah, this makes sense, but I don't expect there to be this much competition so I'm not bothered by this problem. I think if we have the interpretability tools we'll probably be able to retarget the search of all relevant AIs, and then they'll optimize for human values inefficiently but well enough to save the day.)

I think competitiveness matters a lot even if there's only moderate amounts of competitive pressure. The gaps in efficiency I'm imagining are less "10x worse" and more like "I only had support vector machines and you had SGD"

“Just Retarget the Search” directly eliminates the inner alignment problem.

 

I think deception is still an issue here. A deceptive agent will try to obfuscate its goals, so unless you're willing to assume that our interpretability tools are so good they can't ever be tricked, you have to deal with that.

It's not necessarily a huge issue - hopefully with interpretability tools this good we can spot deception before it gets competent enough to evade our interpretability tools, but it's not just "bada-bing bada-boom" exactly.

Yea, I agree that if you give a deceptive model the chance to emerge then a lot more risks arise for interpretability and it could become much more difficult. Circumventing interpretability: How to defeat mind-readers kind of goes through the gauntlet, but I think one workaround/solution Lee lays out there which I haven't seen anyone shoot down yet (aside from it seeming terribly expensive) is to run the interpretability tools continuously or near continuously from the beginning of training. This would give us the opportunity to examine the mesa-optimizer's goals as soon as they emerge, before it has a chance to do any kind of obfuscation.

As a bonus, I'm pretty sure this approach is robust to the sharp left turn. Once the AI becomes advanced enough that most of its capability gain comes from its own computations, as opposed to the SGD[1], it won't want to change its mesa-objective. Indeed, it would do its best to avoid this outcome, up to hacking the training process and/or escaping it. If we goal-align it before this point, in a way that is robust to any ontology shifts (addressing which may be less difficult than it seems at the surface), it'll stay aligned forever.

In particular, this way we'll never have to deal with a lot of the really nasty stuff, like a superintelligence trying to foil our interpretability tools — a flawed mesa-objective is the ultimate precursor to that.

  1. ^

    Like ~all of a modern human's capability edge over a 50k-years-ago caveman comes from the conceptual, scientific, and educational cognitive machinery other humans have developed, not from evolution.

I'm wondering what you think we can learn from approaches like ROME. For those who don't know, ROME is focused on editing factual knowledge (e.g. Eiffel Tower is now in Rome). I'm curious how we could take it beyond factual knowledge. ROME uses causal tracing to find the parts of the model that impact specific factual knowledge the most. 

What if we tried to do something similar to find which parts of the model impact the search the most? How would we retarget the search in practice? And in the lead-up to more powerful models, what are the experiments we can do now (retarget the internal "function" the model is using)?

In the case of ROME, the factual knowledge can be edited by modifying the model only a little bit. Is Search at all "editable" like facts or does this kind of approach seem impossible for retargeting search? In the case of approaches like ROME, is creating a massive database of factual knowledge to edit the model the best we can do? Or could we edit the model in more abstract ways (that could impact Search) that point to the things we want?

The standard mesa-optimization argument from Risks From Learned Optimization holds, and the system ends up developing a general-purpose (i.e. retargetable) internal search process.

It also works in the scenario where human programmers develop a general-purpose (i.e. retargetable) internal search process, i.e. brain-like AGI or pretty much any other flavor of model-based RL. You would look for things in the world-model and manually set their “value” (in RL jargon) / “valence” (in psych jargon) to very high or low, or neutral, as the case may be. I’m all for that, and indeed I bring it up with some regularity. My progress towards a plan along those lines (such as it is) is mostly here. (Maybe it doesn’t look that way, but note that “Thought Assessors” ( ≈ multi-dimensional value function) can be thought of as a specific simple approach to interpretability, see discussion of the motivation-vs-interpretability duality in §9.6 here.) Some of the open problems IMO include:

  • figuring out what exactly value to paint onto exactly what concepts;
  • dealing with concept extrapolation when concepts hit edge-cases [concepts can hit edge-cases both because the AGI keeps learning new things, and because the AGI may consider the possibility of executing innovative plans that would take things out of distribution];
  • getting safely through the period where the “infant AGI” hasn’t yet learned the concepts which we want it to pursue (maybe solvable with a good sandbox);
  • getting the interpretability itself to work well (including the first-person problem, i.e. the issue that the AGI’s own intentions may be especially hard to get at with interpretability tools because it’s not just a matter of showing certain sensory input data and seeing what neurons activate)

This post expresses an important idea in AI alignment that I have essentially believed for a long time, and which I have not seen expressed elsewhere. (I think a substantially better treatment of the idea is possible, but this post is fine, and you get a lot of points for being the only place where an idea is being shared.)

Agree that this is looks like a promising  approach. People interested in this idea can read some additional discussion in Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs from my post from May, "Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios".

As you mention, having this kind of advanced interpretability essentially solves the inner alignment problem, but leaves a big question mark about outer alignment. In that Scenario 2 link above, I have some discussion of expected impacts from this kind of interpretability on a bunch of different outer alignment and robustness techniques including: Relaxed adversarial training, Intermittent oversight, Imitative amplification, Approval-based amplification, Recursive reward modeling, Debate, Market making, Narrow reward modeling, Multi-agent, Microscope AI, STEM AI and Imitative generalization. [1] (You need to follow the link to the Appendix 1 section about this scenario though to get some of these details).

I'm not totally sure that the ability to reliably detect mesa-optimizers and their goals/optimization targets would automatically grant us the ability to "Just Retarget the Search" on a hot model. It might, but I agree with your section on Problems that it may look more like restarting training on models where we detect a goal that's different from what we want. But this still seems like it could accomplish a lot of what we want from being able to retargeting the search on a hot model, even though it's clunkier.

--

[1]: In a lot of these techniques it can make sense to check that the mesa-optimizer is aligned (and do some kind of goal retargeting if it's not). However, in others we probably want to take advantage of this kind of advanced interpretability in different ways. For example, in Imitative amplification, we can just use it to make sure mesa-optimization is not introduced during distillation steps, rather than checking that mesa-optimization is introduced but is also aligned.

How do you feel about this strategy today? What chance of success would you give this? Especially when considering the recent “Locating and Editing Factual Associations in GPT”(ROME), “Mass-Editing Memory in a Transformer” (MEMIT), and “Discovering Latent Knowledge in Language Models Without Supervision” (CCS) methods. 

How does this compare to the strategy you’re currently most excited about? Do you know of other ongoing (empirical) efforts that try to realize this strategy?

When I imagined retargeting the search, I definitely did not imagine that the methods in ROME etc would Just Work, and the majority of my probability mass was not on those particular methods being very central.