Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day for 25 days. Or until I run out of hot takes.

I'm not saying to never say "inner alignment." But you had better be capable of not using that framing if you want to work on alignment.

The inner/outer alignment framing is from Risks from Learned Optimization. You already know what I mean, but just to go through the motions: it describes a situations where there's two key optimization processes going on: the "outer" optimization process humans are using to create an AI, and the "inner" process the created AI is using to make plans. Outer alignment is when the outer process is aligned with humans, and inner alignment is when the inner process is aligned with the outer process. This is an outstandingly useful framing for thinking about certain kinds of AI development, especially model-free RL.

However, this framing has a limited domain of validity. Sometimes it breaks down in a way that looks like adding degrees of freedom - as if humans and the created AI are two ends of a string, and the AI-optimization process is a point along the string. Then you can imagine holding the ends fixed but being able to wiggle the midpoint around.

This looks like creating an AI that's still aligned with humans, but not because it's aligned with a creation process that is itself aligned with humans - instead, both processes are imperfect, and in order to get good outcomes you had to reason about the alignment of the end-product AI directly. This is how it is for most present-day "dumb" AI, except replace "aligned" with "useful and safe." One can also see shardites as arguing that this is what we should be doing.

Other times, the inner/outer framing breaks down entirely, because there isn't a distinct two-part structure. The key example is reflectivity - using the AI to reflect on its own optimization process rapidly blurs the lines between what's "inner" or "outer."

But it's not just obvious reflectivity, sometimes the breakdown seems like it was in the problem statement the whole time. Often when people try to solve one of inner or outer alignment entirely, they find that they've sneakily had to solve the other problem as well. In order to "really solve" outer alignment, you want the AI-optimization process to care about the generalization properties of the created AI beyond the training data. In order to "really solve" inner alignment, the created AI shouldn't just care about the raw outputs of the process that created it, it should care about the things communicated by the AI-optimization process in its real-world context.

I endorse these attempts to "really" solve alignment. If you think that the inner/outer alignment framing is obvious, it's probably valuable for you to deliberately look for opportunities to blur the lines. Dream up AI-generating processes that care about the AI's inner properties, or AIs that learn to care about humans in a self-reflective process not well-described in terms of an intermediate AI-optimizer. Queer the inner/outer alignment dichotomy.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 2:06 PM

I’d encourage you to delve more into this paragraph as I think this is the part of your article where it becomes the most hand-wavey:

“In order to "really solve" outer alignment, you want the AI-optimization process to care about the generalization properties of the created AI beyond the training data. In order to "really solve" inner alignment, the created AI shouldn't just care about the raw outputs of the process that created it, it should care about the things communicated by the AI-optimization process in its real-world context.”

I agree, would like a bit more detail and perhaps an example here.

[+][comment deleted]1y21