Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The standard frame (Evan Hubinger, 2021) is:

  • Outer alignment refers to the problem of finding a loss/reward function such that the training goal of “a model that optimizes for that loss/reward function” would be desirable.
  • Inner alignment refers to the problem of constructing a training rationale that results in a model that optimizes for the loss/reward function it was trained on.

Here’s the reframe (I believe the credit for this breakdown is due to John Wentworth, although I haven't found anything online to link to for it):

  • Reward Specification: Finding a policy-scoring function  such that (nearly–)optimal policies for that scoring function are desirable.
    • "Are you optimising for the right thing?"
  • Adequate Policy Learning: Finding a policy that’s actually (nearly–)optimal for that scoring function.
    • "Did you optimise for it enough?"

As Paul Christiano points out (in an excerpt recently highlighted by Alex Turner to make a similar point), factoring out Reward Specification represents only one "particular kind of alignment strategy, that's like a two step plan" for how one might try to conclude that a learned policy is desirable, where we first align a scoring function  with what we actually want, and then align a training process with our scoring function. Under this kind of plan, the proof tree for overall existential safety would conclude by applying an implication like this:


At this point you might be wondering: what happened to the concept that  might contain a mesa-optimiser for something different from ?

The proper role of the mesa-optimiser concept is in an explanation of why Adequate Policy Learning is insidiously difficult:

  1. In machine learning, one typically approximates (the gradient of) the reward function by aggregating a set of evaluations on only a finite sample (the empirical “training distribution”), whereas the true  also depends on what  does globally, even in rare states that may never be sampled during training.[1]
  2. If the trajectories observed in deployment have a state distribution which diverges from the training distribution, then the policy that was optimal on the training set will likely diverge from the optimal policy. This is standard out-of-distribution generalisation failure.
  3. If the policy that was optimal on the training set includes a good understanding of the real world and consequentialist reasoning (which is likely, if the training set is sufficiently rich that you can get a policy that's transformative in the real world), then it contains a capable mesa-optimiser, and it's likely that it's still transformative in the real world, even while experiencing generalisation failure: capabilities generalise better than alignment.

The bottom line is, once we get beyond some level of capabilities, it becomes likely that the gap between the empirical training distribution and the intended reward function suddenly gets exploited in a pretty worst-case-like way (i.e. optimising for some arbitrary thing, instead of merely being bad at optimising the outer objective). I think this phenomenon is also what the concept of a sharp left turn refers to, so this post could equally also be titled "Reframing the sharp left turn".

Inner alignment

The standard "inner alignment" answer to "what do we do about this?" is, to quote Paul again[2],

let’s build a system that’s internalized that objective in some sense, or is not doing any other optimization beyond pursuit of that objective.

In other words: let's make sure that the policy  contains a mesa-optimiser for the outer objective , or at least, contains no other mesa-optimisers.

If we assume outer alignment and Reward Specification are solved, the above notion of inner alignment is neither necessary[3] nor sufficient[4] for actually establishing that the learned policy is desirable. I think I'm in agreement with Alex Turner on this point, but I disagree insofar as I still think the reframed 2-step strategy of Reward Specification and Adequate Policy Learning is sufficient, and helpful, i.e. maybe (relatively) tractable (albeit still unnecessary: just one possible kind of decomposition of the alignment problem).

Both "inner misalignment" and "sharp left turn" are just particular stories for why an unbounded out-of-distribution optimality gap is a big problem for the safety of highly capable policies. One does not solve that problem by detecting internal computations with optimiser-like structure, or by installing penalty-function guardrails on the left-hand-side of some internal state-space. If an "inner alignment" solution to Adequate Policy Learning uses an inner-misalignment-detector which is not invariant to extensional equivalence of policies, it is certainly not robust: formally, it cannot be continuous relative to any natural topology on policy-space, let alone Lipschitz-continuous relative to a natural metric on policy-space.

Kinds of inner alignment that could work

That doesn't mean inner alignment is categorically doomed; there are (at least) two ways to do something like looking at internal structure while still respecting extensional equivalence:

  1. Use the internal structure only to accelerate the computation, but in such a way that extensionally equivalent internal structures would provably result in equivalent outputs. An example of this is the backpropagation algorithm for computing derivatives of a policy.
  2. Search for possible internal-structure-like explanations for the extensional behaviour of a policy that have some intensional “misalignment” property—but without being biased toward structures that resemble how the policy function happens to actually be encoded as a neural network.

Alternatives to inner alignment for Adequate Policy Learning

In my view, the best paths to confidence in Adequate Policy Learning, i.e. that a learned policy actually performs well on the outer objective are, well, provable performance guarantees on the outer objective, e.g. via:

However, it is important to note that the idea of proving something about the outer objective means the outer objective must be formally well-defined, as a function from , and that means it needs to incorporate a well-defined formal world-model. This takes us out of the end-to-end model-free RL setting, but it is equally important to note that it does not take us out of the "deep learning" setting! Deep neural networks can still be used as function approximations for policy and value functions in model-based RL. Model checking seems likely to be a particularly computationally-feasible way to get concrete guarantees about such models.

It is even possible, in this paradigm, to still use deep neural networks to simulate the dynamics of the world (purely predictively, and distinct from the policy and value networks). But, if we do that, we'll need ontology identification for the predictive dynamics network in order to have any hope of solving Reward Specification.

Implications (briefly)

On this view, ontology identification—applied to a purely predictive dynamics network, for the purpose of specifying rewards within its state space—is the proper role of alignment-related problems like Natural Abstractions and ELK (as opposed to something like "asking the AI if it has any submodules that are starting to think about planning to be deceptive"). I'm also excited about Boundaries as a tool for specifying a core safety property to model-check policies against—one which would imply (at least) nonfatality—relative to alien and shifting predictive ontologies.

  1. ^

    In Solomonoff induction there is a similar situation, but instead of having an empirical distribution of trajectories, we actually just look at one single ultra-long trajectory and hope that the world is approximately ergodic. Spoiler alert, it isn't. Following a parallel chain of reasoning leads to a similar conclusion about, after some level of capabilities, mesa-optimisers systematically causing catastrophic out-of-distribution generalisation failure but still being highly capable; cf. The Solomonoff Prior is Malign.

  2. ^

    Out of context, as Paul is not endorsing this frame; shortly thereafter he says

    I think this makes sense on some approaches and not on other approaches. I am most often thinking of it as: there’s some set of problems that seem necessary for outer alignment. I don’t really believe that the problems are going to split into “these are the outer alignment problems, and these are the inner alignment problems”. I think of it more as the outer alignment problems, or the things that are sort of obviously necessary for outer alignment, are more likely to be useful stepping stones, or warm up problems, or something. I suspect in the end, it’s not like we have our piece that does outer alignment and our piece that does inner alignment, and then we put them together.

  3. ^

    For example, a policy which is actually just an uploaded human will contain all sorts of mesa-optimisers, but it's generally agreed that there exist at least a few humans who are nonetheless robustly ethical enough (as a composite system) that uploads of them would be existentially safe.

  4. ^

    I think the standard response to this is "well, yes, technically, of course the policy could just be incompetent, but that's beside the point as long as it's trying to be aligned," to which I would respond "never attribute to malice that which can be explained by incompetence." Decomposition between beliefs and desires is fundamentally non-unique. There is no clean definition of incompetence which excludes policies that still kill you:

    Oops, I accidentally consequentialisted the atmospheric oxygen away. It's so corrosive and harmful but I totally forgot that humans still need it. I can be so clumsy sometimes! Gosh, this is really bad. This isn't what I wanted at all. I should update to be more liberal about asking for human feedback before I execute plans. If only there were any humans left to ask...

New Comment
13 comments, sorted by Click to highlight new comments since:

There is conspicuously little explicit focus on developing purely epistemic AI capabilities, which seem useful for alignment in a more straightforward manner than agentic AIs. A sufficiently good model of the world is going to have accurate models of humans, approaching the capability of formulating uploads, aligned with the original physical humans. In the limit of accuracy the human imitations (or rather imitations of possible humans from sufficiently accurate possible worlds) are a good enough building block for alignment research at higher physical speed. (But with insufficient accuracy, or when warped into greater capability than original humans, they are an alignment hazard.)

The simulators framing of LLMs is a step in the direction of epistemic rationality, away from fixation on goals and policies. RLHF as a step back, destroying the model-of-the-world role of an SSL-trained model, not even sparing it as a new explicit element in the architecture of a resulting AI.

So I think it's important to figure out internal alignment objectives for models of the world, things like SSL-trained LLMs that weren't mutilated with RL. This is not alignment with preferences, but it might be similar in asking for coherence in related generated descriptions, like utility functions ask for coherence in actions across related hypothetical situations. (My useless guess is automatic identification of a large number of concepts much smaller/simpler than people/simulacra, taking a look at how they influence episodes where they are relevant/expressed, and retraining the model towards improvement of their coherence, perhaps literally reifying them as agents with simple utility functions in a smaller scope of episodes where they apply.)

"Purely epistemic model" is not a thing, everything is an agent that is self-evidencing at least to some degree: I agree, however, that RLHF actively strengthens goal-directedness (the synonym of self-evidencing) which may otherwise remain almost rudimentary in LLMs.

I'm pretty confused as to how some of the details of this post are meant to be interpreted, I'll focus on my two main questions that would probably clear up the rest.

Reward Specification: Finding a policy-scoring function  such that (nearly–)optimal policies for that scoring function are desirable.

If I understand this and the next paragraphs correctly, then J takes in a complete description of a policy, so it also takes into account what the policy does off-distribution or in very rare cases, is that right? So in this decomposition, "reward specification" would be responsible for detecting deceptive misalignment? If so, that seems very non-standard and should probably be called out more. In particular, all the problems usually clustered in inner alignment would then have to be solved as part of reward specification as far as I can tell. If this is not what you meant (i.e. if J just evaluates the policy on some distribution), then I don't see how the proof step  could hold.

If an "inner alignment" solution to Adequate Policy Learning is not invariant to extensional equivalence of policies, it is certainly not robust: formally, it cannot be continuous relative to any natural topology on policy-space, let alone Lipschitz-continuous relative to a natural metric on policy-space.

Can you say more about what you mean by this? By the solution being invariant to extensional equivalence, do you mean that two policies that are extensionally equal should have the same probability (density) of being sampled using our training procedure? That would seem like an extremely strong and unnecessary condition. For continuity, I guess you mean that some density of  would be discontinuous if T wasn't invariant? (Continuity of T itself seems irrelevant, even assuming  has some natural topology) I don't see why that is bad, or "non-robust" in any important sense. I currently understand your point here as: if the policy learning method isn't invariant under extensional equivalence, then it will assign very different probabilities to policies that have almost the same (or even exactly the same behavior). But why would that be an issue? We just want one good policy, and if putting restrictions on its internal implementation makes that easier to find, why shouldn't we do that? In fact, if we had a completely deterministic procedure that just spits out one out of the many good policies, that seems like a perfectly fine solution even though it's maximally "discontinuous" (the density doesn't even exist as a function).


Thanks, this is very helpful feedback about what was confusing. Please do ask more questions if there are still more parts that are hard to interpret.

To the first point, yes, evaluates on all trajectories, even off-distribution. It may do this in a Bayesian way, or a worst-case way. I claim that does not need to “detect deceptive misalignment” in any special way, and I’m not optimistic that progress on such detection is even particularly helpful, since incompetence can also be fatal, and deceptive misalignment could Red Queen Race ahead of the detector.

Instead: a deceptively aligned policy that is bad must concretely do bad stuff on some trajectories. can detect this by simply detecting bad stuff.

If there’s a sneaky hard part of Reward Specification beyond the obvious hard part of defining what’s good and bad, it would be “realistically defining the environment.” (That’s where purely predictive models come in.)


It is perhaps also worth pointing out that computing is likely to be extremely hard. Step 1 is just about defining as a unique mathematical function; the problem of tractably computing guaranteed bounds on falls under step 2.

My guess is that the latter will require some very fancy algorithms of the model-checking variety and some very fancy AI heuristics that plug into those algorithms in a way that can improve their performance without potentially corrupting their correctness (basically because they’re Las Vegas algorithms instead of Monte Carlo algorithms).

Thanks, computing J not being part of step 1 helps clear things up.

I do think that "realistically defining the environment" is pretty closely related to being able to detect deceptive misalignment: one way J could fail due to deception would be if its specification of the environment is good enough for most purposes, but still has some differences to the real world which allow an AI to detect the difference. Then you could have a policy that is good according to J, but which still destroys the world when actually deployed.

Similar to my comment in the other thread on deception detection, most of my optimism about alignment comes from worlds where we don't need to define the environment in such a robust way (doing so just seems hopelessly intractable to me). Not sure whether we mostly disagree on the hardness of solving things within your decomposition, or on how good chances are we'll be fine with less formal guarantees.


To the final question, for what it’s worth to contextualize my perspective, I think my inside-view is simultaneously:

  • unusually optimistic about formal verification
  • unusually optimistic about learning interpretable world-models
  • unusually pessimistic about learning interpretable end-to-end policies

I agree, if there is a class of environment-behaviors that occur with nonnegligible probability in the real world but occur with negligible probability in the environment-model encoded in , that would be a vulnerability in the shape of alignment plan I’m gesturing at. However, aligning a predictive model of reality to reality is “natural” compared to normative alignment. And the probability with which this vulnerability can actually be bad is linearly related to something like total variation distance between the model and reality; I don’t know if this is exactly formally correct, but I think there’s some true theorem vaguely along the lines of: a 1% TV distance could only cause a 1% chance of alignment failure via this vulnerability. We don’t have to get an astronomically perfect model of reality to have any hope of its not being exploited. Judicious use of worst-case maximin approaches (e.g. credal sets rather than pure Bayesian modeling) will also help a lot with narrowing this gap, since it will be (something like) the gap to the nearest point in the set rather than to a single distribution.

Instead: a deceptively aligned policy that is bad must concretely do bad stuff on some trajectories.  can detect this by simply detecting bad stuff.

I think it extremely probable that there exist policies which exploit adversarial inputs to J such that they can do bad stuff while getting J to say "all's fine." 


For most s I agree, but the existence of any adversarial examples for would be an outer alignment problem (you get what you measure). (For outer alignment, it seems necessary that there exist—and that humans discover—natural abstractions relative to formal world models that robustly pick out at least the worst stuff.)


To the second point, I meant something very different—I edited this sentence and hopefully it is more clear now. I did not mean that should respect extensional equivalence of policies (if it didn’t, we could always simply quotient it by extensional equivalence of policies, since it outputs rather than inputs policies).

Instead, I meant that a training story that involves mitigating your model-free learning algorithm’s unbounded out-of-distribution optimality gap by using some kind of interpretability loop where you’re applying a detector function to the policy to check for inner misalignment (and using that to guide policy search) has a big vulnerability: the policy search can encode similarly deceptive (or even exactly extensionally equivalent) policies in other forms which make the deceptiveness invisible to the detector. Respecting extensional equivalence is a bare-minimum kind of robustness to ask from an inner-misalignment detector that is load-bearing in an existential-safety strategy.

FWIW, I agree that respecting extensional equivalence is necessary if we want a perfect detector, but most of my optimism comes from worlds where we don't need one that's quite perfect. For example, maybe we prevent deception by looking at the internal structure of networks, and then get a good policy even though we couldn't have ruled out every single policy that's extensionally equivalent to the one we did rule out. To me, it seems quite plausible that all policies within one extensional equivalence class are either structurally quite similar or so complex that our training process doesn't find them anyway. The main point of a deception detector wouldn't be to give us guarantees, but instead to give training a sufficiently strong inductive bias against simple inductive policies, such that the next-simplest policy is non-deceptive.

I see, that makes much more sense than my guess, thanks!