When people talk about prosaic alignment proposals, there’s a common pattern: they’ll be outlining some overcomplicated scheme, and then they’ll say “oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are”, and then they’ll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.) And then usually there’s a whole discussion about the specific problems with the overcomplicated scheme.
In this post I want to argue from a different direction: if we had great interpretability tools, we could just use those to align an AI directly, and skip the overcomplicated schemes. I’ll call the strategy “Just Retarget the Search”.
We’ll need to make two assumptions:
- Some version of the natural abstraction hypothesis holds, and the AI ends up with an internal concept for human values, or corrigibility, or what the user intends, or human mimicry, or some other outer alignment target.
- The standard mesa-optimization argument from Risks From Learned Optimization holds, and the system ends up developing a general-purpose (i.e. retargetable) internal search process.
Given these two assumptions, here’s how to use interpretability tools to align the AI:
- Identify the AI’s internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
- Identify the retargetable internal search process.
- Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.
Just retarget the search. Bada-bing, bada-boom.
Problems
Of course as written, “Just Retarget the Search” has some issues; we haven’t added any of the bells and whistles to it yet. Probably the “identify the internal representation of the alignment target” step is less like searching through a bunch of internal concepts, and more like writing our intended target in the AI’s internal concept-language. Probably we’ll need to do the retargeting regularly on-the-fly as the system is training, even when the search is only partly-formed, so we don’t end up with a misaligned AI before we get around to retargeting. Probably we’ll need a bunch of empirical work to figure out which possible alignment targets are and are not easily expressible in the AI’s internal language (e.g. I’d guess “user intention” or "human mimicry" are more likely than “human values”). But those details seem relatively straightforward.
A bigger issue is that “Just Retarget the Search” just… doesn’t seem robust enough that we’d want to try it on a superintelligence. We still need to somehow pick the right target (i.e. handle outer alignment), and ideally it’s a target which fails gracefully (i.e. some amount of basin-of-corrigibility). If we fuck up and aim a superintelligence at not-quite-the-right-target, game over. Insofar as “Just Retarget the Search” is a substitute for overcomplicated prosaic alignment schemes, that’s probably fine; most of those schemes are targeting only-moderately-intelligent systems anyway IIUC. On the other hand, we probably want our AI competent enough to handle ontology shifts well, otherwise our target may fall apart.
Then, of course, there’s the assumptions (natural abstractions and retargetable search), either of which could fail. That said, if one or both of the assumptions fail, then (a) that probably messes up a bunch of the overcomplicated prosaic alignment schemes too (e.g. failure of the natural abstraction hypothesis can easily sink interpretability altogether), and (b) that might mean that the system just isn’t that dangerous in the first place (e.g. if it turns out that retargetable internal search is indeed necessary for dangerous intelligence).
Upsides
First big upside of Just Retargeting the Search: it completely and totally eliminates the inner alignment problem. We just directly set the internal optimization target.
Second big upside of Just Retargeting the Search: it’s conceptually simple. The problems and failure modes are mostly pretty obvious. There is no recursion, no complicated diagram of boxes and arrows. We’re not playing two Mysterious Black Boxes against each other.
But the main reason to think about this approach, IMO, is that it’s a true reduction of the problem. Prosaic alignment proposals have a tendency to play a shell game with the Hard Part of the problem, move it around and hide it in different black boxes but never actually eliminate it. “Just Retarget the Search” directly eliminates the inner alignment problem. No shell game, no moving the Hard Part around. It still leaves the outer alignment problem unsolved, it still needs assumptions about natural abstractions and retargetable search, but it completely removes one Hard Part and reduces the problem to something simpler.
As such, I think “Just Retarget the Search” is a good baseline. It’s a starting point for thinking about the parts of the problem it doesn’t solve (e.g. outer alignment), or the ways it might fail (retargetable search, natural abstractions), without having to worry about inner alignment.
One of the main reasons I expect this to not work is because optimization algorithms that are the best at optimizing some objective given a fixed compute budget seem like they basically can't be generally-retargetable. E.g. if you consider something like stockfish, it's a combination of search (which is retargetable), sped up by a series of very specialized heuristics that only work for winning. If you wanted to retarget stockfish to "maximize the max number of pawns you ever have" you had, you would not be able to use [specialized for telling whether a move is likely going to win the game] heuristics to speed up your search for moves. A more extreme example is the entire endgame table is useless for you, and you have to recompute the entire thing probably.
Something like [the strategy stealing assumption](https://ai-alignment.com/the-strategy-stealing-assumption-a26b8b1ed334) is needed to even obtain the existence of a set of heuristics just as good for speeding up search for moves that "maximize the max number of pawns you ever have" compared to [telling whether a move will win the game]. Actually finding that set of heuristics is probably going to require an entirely parallel learning process.
This also implies that even if your AI has the concept of "human values" in its ontology, you still have to do a bunch of work to get an AI that can actuallly estimate the long-run consequences of any action on "human values", or else it won't be competitive with AIs that have more specialized optimization algorithms.
I think competitiveness matters a lot even if there's only moderate amounts of competitive pressure. The gaps in efficiency I'm imagining are less "10x worse" and more like "I only had support vector machines and you had SGD"