In How To Go From Interpretability To Alignment: Just Retarget The Search, John Wentworth suggests:

When people talk about prosaic alignment proposals, there’s a common pattern: they’ll be outlining some overcomplicated scheme, and then they’ll say “oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are”, and then they’ll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.) And then usually there’s a whole discussion about the specific problems with the overcomplicated scheme.

In this post I want to argue from a different direction: if we had great interpretability tools, we could just use those to align an AI directly, and skip the overcomplicated schemes. I’ll call the strategy “Just Retarget the Search”.

We’ll need to make two assumptions:

Given these two assumptions, here’s how to use interpretability tools to align the AI:

  • Identify the AI’s internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
  • Identify the retargetable internal search process.
  • Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target. 

Just retarget the search. Bada-bing, bada-boom.

There was a pretty interesting thread in the comments afterwards that I wanted to highlight.


Rohin Shah (permalink)

Definitely agree that "Retarget the Search" is an interesting baseline alignment method you should be considering.

I like what you call "complicated schemes" over "retarget the search" for two main reasons:

  1. They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about).
  2. They degrade gracefully with worse interpretability tools, e.g. in debate, even if the debaters can only credibly make claims about whether particular neurons are activated, they can still stay stuff like "look my opponent is thinking about synthesizing pathogens, probably it is hoping to execute a treacherous turn", whereas "Retarget the Search" can't use this weaker interpretability at all. (Depending on background assumptions you might think this doesn't reduce x-risk at all; that could also be a crux.)


johnswentworth (permalink)

I indeed think those are the relevant cruxes.

 

Evan R. Murphy (permalink)

  1. They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about).

Why do you think we probably won't end up with mesa-optimizers in the systems we care about?

Curious about both which systems you think we'll care about (e.g. generative models, RL-based agents, etc.) and why you don't think mesa-optimization is a likely emergent property for very scaled-up ML models.

 

Rohin Shah (permalink)

  1. It's a very specific claim about how intelligence works, so gets a low prior, from which I don't update much (because it seems to me we know very little about how intelligence works structurally and the arguments given in favor seem like relatively weak considerations).
  2. Search is computationally inefficient relative to heuristics, and we'll be selecting really hard on computational efficiency (the model can't just consider 10^100 plans and choose the best one when it only has 10^15 flops to work with). It seems very plausible that the model considers, say, 10 plans and chooses the best one, or even 10^6 plans, but then most of the action is in which plans were generated in the first place and "retarget the search" doesn't necessarily solve your problem.

I'm not thinking much about whether we're considering generative models vs RL-based agents for this particular question (though generally I tend to think about foundation models finetuned from human feedback).


johnswentworth (permalink)

I am very confused by (2). It sounds like you are imagining that search necessarily means brute-force search (i.e. guess-and-check)? Like non-brute-force search is just not a thing? And therefore heuristics are necessarily a qualitatively different thing from search? But I don't think you're young enough to have never seen A* search, so presumably you know that formal heuristic search is a thing, and how to use relaxation to generate heuristics. What exactly do you imagine that the word "search" refers to?


Rohin Shah (permalink)

I've definitely seen A* search and know how it works. I meant to allude to it (and lots of other algorithms that involve a clear goal) with this part:

It seems very plausible that the model considers, say, 10 plans and chooses the best one, or even 10^6 plans, but then most of the action is in which plans were generated in the first place and "retarget the search" doesn't necessarily solve your problem.

If your AGI is doing an A* search, then I think "retarget the search" is not a great strategy, because you have to change both the goal specification and the heuristic, and it's really unclear how you would change the heuristic even given a solution to outer alignment (because A* heuristics are incredibly specific to the setting and goal, and presumably have to become way more specialized than they are today in order for it to be more powerful than what we can do today).

 

johnswentworth (permalink)

That's what relaxation-based methods are for; they automatically generate heuristics for A* search. For instance, in a maze, it's very easy to find a solution if we relax all the "can't cross this wall" constraints, and that yields the Euclidean distance heuristic. Also, to a large extent, those heuristics tend to depend on the environment but not on the goal - for instance, in the case of Euclidean distance in a maze, the heuristic applies to any pathfinding problem between two points (and probably many other problems too), not just to whatever particular start and end points the particular maze highlights. We can also view things like instrumentally convergent subgoals or natural abstractions as likely environment-specific (but not goal-specific) heuristics.

Those are the sort of pieces I imagine showing up as part of "general-purpose search" in trained systems: general methods for generating heuristics for a wide variety of goals, as well as some hard-coded environment-specific (but not goal-specific) heuristics.

 

Rohin Shah (permalink)

(Note to readers: here's another post (with comments from John) on the same topic, which I only just saw.)

I imagine two different kinds of AI systems you might be imagining:

  1. An AI system that has a "subroutine" that runs A* search given a problem specification. The AI system works by formulating useful subgoals, converting those into A* problem specifications + heuristics, uses the A* subroutine, and then executes the result.
  2. An AI system that literally is A* search. The AI has (in its weights, if it is a learned neural net) a high-level "state space of the universe", a high-level "conceptual actions" space, an ability to predict the next high-level state given a previous state + conceptual action, and some goal function (= the mesa-objective). Given an input, the AI converts it into a high-level state, and runs A* with that state as the input, takes the resulting plan and executes the first action of the plan.

In (1), it seems like the major alignment work is in aligning the part of the AI system that formulates subgoals, problem specification, and heuristics, where it is not clear that "retarget the search" would work. (You could also try to excise the A* subroutine and use that as a general-purpose problem solver, but then you have to tune the heuristic manually; maybe you could excise the A* subroutine and the part that designs the heuristic, if you were lucky enough that those were fully decoupled from the subgoal-choosing part of the system.)

In (2), I don't know why you expect to get general-purpose search instead of a very complex heuristic that's very specific to the mesa objective. There is only ever one goal that the A* search has to optimize for; why wouldn't gradient descent embed a bunch of goal-specific heuristics that improve efficiency? Are you saying that such heuristics don't exist?

Separately: do you think we could easily "retarget the search" for an adult human, if we had mechanistic interpretability + edit access for the human's brain? I'd expect "no".

 

johnswentworth (permalink)

I'm imagining roughly (1), though with some caveats:

  • Of course it probably wouldn't literally be A* search
  • Either the heuristic-generation is internal to the search subroutine, or it's using a standard library of general-purpose heuristics for everything (or some combination of the two).
  • A lot of the subgoal formulation is itself internal to the search (i.e. recursively searching on subproblems is a standard search technique).

I do indeed expect that the major alignment work is in formulating problem specification, and possibly subgoals/heuristics (depending on how much of that is automagically handled by instrumental convergence/natural abstraction). That's basically the conclusion of the OP: outer alignment is still hard, but we can totally eliminate the inner alignment problem by retargeting the search.

Separately: do you think we could easily "retarget the search" for an adult human, if we had mechanistic interpretability + edit access for the human's brain? I'd expect "no".

I expect basically "yes", although the result would be something quite different from a human.

We can already give humans quite arbitrary tasks/jobs/objectives, and the humans will go figure out how to do it. I'm currently working on a post on this, and my opening example is Benito's job; here are some things he's had to do over the past couple years:

  • build a prototype of an office
  • resolve neighbor complaints at a party
  • find housing for 13 people with 2 days notice
  • figure out an invite list for 100+ people for an office
  • deal with people emailing a funder trying to get him defunded
  • set moderation policies for LessWrong
  • write public explanations of grantmaking decisions
  • organize weekly online zoom events
  • ship books internationally by Christmas
  • moderate online debates
  • do April Fools' Jokes on Lesswrong
  • figure out which of 100s of applicants to do trial hires with

So there's clearly a retargetable search subprocess in there, and we do in fact retarget it on different tasks all the time.

That said, in practice most humans seem to spend most of their time not really using the retargetable search process much; most people mostly just operate out of cache, and if pressed they're unsure what to point the retargetable search process at. If we were to hardwire a human's search process to a particular target, they'd single-mindedly pursue that one target (and subgoals thereof); that's quite different from normal humans.

 

Rohin Shah (permalink)

... Interesting. I've been thinking we were talking about (2) this entire time, since on my understanding of "mesa optimizers", (1) is not a mesa optimizer (what would its mesa objective be?).

If we're imagining systems that look more like (1) I'm a lot more confused about how "retarget the search" is supposed to work. There's clearly some part of the AI system (or the human, in the analogy) that is deciding how to retarget the search on the fly -- is your proposal that we just chop that part off somehow, and replace it with a hardcoded concept of "human values" (or "user intent" or whatever)? If that sort of thing doesn't hamstring the AI, why didn't gradient descent do the same thing, except replacing it with a hardcoded concept of "reward" (which presumably a somewhat smart AGI would have)?

 

johnswentworth (permalink)

So, part of the reason we expect a retargetable search process in the first place is that it's useful for the AI to recursively call it with new subproblems on the fly; recursive search on subproblems is a useful search technique. What we actually want to retarget is not every instance of the search process, but just the "outermost call"; we still want it to be able to make recursive calls to the search process while solving our chosen problem.

 

Rohin Shah (permalink)

Okay, I think this is a plausible architecture that a learned program could have, and I don't see super strong reasons for "retarget the search" to fail on this particular architecture (though I do expect that if you flesh it out you'll run into more problems, e.g. I'm not clear on where "concepts" live in this architecture and I could imagine that poses problems for retargeting the search).

Personally I still expect systems to be significantly more tuned to the domains they were trained on, with search playing a more cursory role (which is also why I expect to have trouble retargeting a human's search). But I agree that my reason (2) above doesn't clearly apply to this architecture. I think the recursive aspect of the search was the main thing I wasn't thinking about when I wrote my original comment.


Moderator note: this post is an experiment in promoting more dialogue-shaped content on LessWrong.  If you’d be excited about finding an interlocutor to debate, dialogue, or be interviewed by: fill in this dialogue matchmaking form.

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 4:02 AM
[-]kave7mo50

I like the Mark Xu & Daniel Kokotajlo thread on that post too

Both parties take some time to reach some mutual understanding of what the architecture is. Or at least build some intuition about what kind of architecture the other has in mind or what changes to perform in that architecture. This is a common pattern that I have seen a few times. I have heard that this is called typical in pre-paradigmatic fields. There is not enough standard terminology for common patterns or constraints. Or alternatively, there are not enough concrete implementations to point to or to be able to write down quickly. It would have been much easier if johnswentworth had been able to refer to a "Shah representation of the concept of human value" or a "the 2nd order Byrnes search process neuron hull" (all made up) and say "let's Yud-chain the Byrnes hull to the Shah". And then the discussion would have been whether the 2nd order Byrnes converges faster than the Yud-chain diverges or something. But we are not there and it shows.   

John mentioned the existince of What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?, which was something of a follow-up post to How To Go From Interpretability To Alignment: Just Retarget The Search, and continues in a similar direction.