The standard frame (Evan Hubinger, 2021) is:

Outer alignment refers to the problem of finding a loss/reward function such that the training goal of “a model that optimizes for that loss/reward function” would be desirable.
Inner alignment refers to the problem of constructing a training rationale that results in a model that optimizes for the loss/reward function it was trained on.

Here’s the reframe (I believe the credit for this breakdown is due to John Wentworth, although I haven't found anything online to link to for it):

Reward Specification: Finding a policy-scoring function $J (π)$ such that (nearly–)optimal policies for that scoring function are desirable.
- "Are you optimising for the right thing?"
Adequate Policy Learning: Finding a policy that’s actually (nearly–)optimal for that scoring function.
- "Did you optimise for it enough?"

As Paul Christiano points out (in an excerpt recently highlighted by Alex Turner to make a similar point), factoring out Reward Specification represents only one "particular kind of alignment strategy, that's like a two step plan" for how one might try to conclude that a learned policy is desirable, where we first align a scoring function $J$ with what we actually want, and then align a training process with our scoring function. Under this kind of plan, the proof tree for overall existential safety would conclude by applying an implication like this:

\begin{matrix} \exists J : t y p e o f (π) \to R, ε : R . & Reward Specification | J (π) - {max}_{π^{*}} J (π^{*}) | < ε \Rightarrow s a f e l y M i t i g a t e s X R i s k (π) \land & \exists T : Ω \to t y p e o f (π) . & Adequate Policy Learning P_{π \sim T} (| J (π) - max π^{*} J (π^{*}) | < ε) > p \Rightarrow & P_{π \sim T} (s a f e l y M i t i g a t e s X R i s k (π)) > p & \Rightarrow & Probably ends acute risk period \end{matrix}

Mesa-optimisers

At this point you might be wondering: what happened to the concept that $π$ might contain a mesa-optimiser for something different from $J$ ?

The proper role of the mesa-optimiser concept is in an explanation of why Adequate Policy Learning is insidiously difficult:

In machine learning, one typically approximates (the gradient of) the reward function by aggregating a set of evaluations on only a finite sample (the empirical “training distribution”), whereas the true $J$ also depends on what $π$ does globally, even in rare states that may never be sampled

...

Sharp Left Turn

Sharp Left Turn