*This post is part of research I did at OpenAI with mentoring and guidance from Paul Christiano.*

One possible inner alignment technique is to structure your learning algorithm’s inductive biases such that it is far more likely for it to produce an aligned model than a misaligned model. Two basic ways in which you might get traction now on how a technique like that will scale to AGI might be:

- Do experiments with current regularization techniques where you try to determine to what extent models end up being aligned with the loss function they were trained under depending on the types of inductive biases present during their training. I’m actually fairly excited about this approach, and I am working on a post right now which will hopefully provide some ways in which you might start going about doing that. One concern with this sort of approach, however, is that you might not be able to get any traction on phenomenon that only appear once your capabilities get really close to AGI.
- Analyze from a theoretical perspective what happens in the limit of perfect capabilities. The idea here is to try to patch the previous problem where when you do experiments you don’t get to see phenomenon that only happen when your capabilities get really advanced by imagining that your capabilities were perfect. In this case, however, you end up with the opposite problem of not seeing any phenomenon that only appear when your capabilities are weaker.

The nice thing about combining these two approaches is that they tackle the problem from both sides, giving you an idea both of what the weak capability landscape looks like as well as the strong capability landscape. For this post, however, the focus will be on the second approach.

## Background

From a theoretical perspective, the two most natural types of inductive biases are simplicity bias and time complexity bias. Paul Christiano has previously argued that simplicity bias can be deceptive in the limit, but left the question of whether the same is true of time complexity bias as an open problem. The goal of this post is to attempt to resolve that open problem. In particular, though one might have hoped that deceptive alignment could be avoided by heavily penalizing time complexity so that the model doesn’t have time to spend reasoning about deception, I will argue that this is not the case due to the possibility of meta-learning.

First, however, we need a clear understanding of what it means for a model to be deceptive. “Risks from Learned Optimization” provides the model of *deceptive alignment,* wherein a model appears aligned during training for the instrumental reason of being selected by the training process. More generally, however, what we really care about is deceptive behavior—that is, the situation where our model generalizes pathologically such that it produces a malign failure.

Second, we need a clear understanding of why simplicity bias leads to deception and an understanding of why that doesn’t generalize to time complexity bias.

Specifically, the trap that simplicity bias falls into is as follows. First, there are many short but very powerful optimizers composed of very simple objectives with optimization processes capable of making use of vast amounts of compute. Furthermore, for a sufficiently complex task, it seems likely that the simplest program that solves it will be such an optimizer. Second, many of these optimizers are likely to be able to benefit in some way from being selected by your simplicity-biased training process—by defecting against you at some point, for example—and thus will have good training performance. Finally, since the vast majority of possible objectives—especially the very simple ones—are not aligned, the simplest program that solves a sufficiently complex task is unlikely to be aligned.^{[1]}

However, this sort of strategy has a major problem: it has to spend extra time doing the “am I out of training” check, which causes it to lose out on time complexity compared to a robustly aligned model. Thus, one hope for eliminating this sort of thing might be to exploit this asymmetry by strongly selecting your model based on time complexity. If this works, it would mean you might be able to solve inner alignment just by using strong regularization, which would be a big win.

## Overview

This leads to the following theoretical conjecture: the fastest program that solves a task will not be deceptive. In particular, we will be thinking of “fastest programs” as minimal circuits. In his original post, Paul phrased this conjecture as “the fastest way to solve any particular problem doesn't involve [deception].”

The conclusion of this post is the following: if there exists some set of natural tasks for which the fastest way to solve them is to do some sort of machine learning to find a good policy, and there is some task for which that machine learning results in deceptive behavior, then there exists a natural task such that the minimal circuit that solves that task also produces deceptive behavior. For example, if the fastest way to solve some set of tasks is to search for simple programs that perform well, and if the simplest programs are deceptive, then there exists a task where the minimal circuit is also likely to be deceptive.

Why might we think such a set of tasks exists? The fundamental idea is to consider the case of meta-learning: if your tasks are things like “classify this image according to this training data” then it seems quite plausible that the fastest way to solve all tasks of that form will be to use some form of ML—in particular, that’s what we do.^{[2]} Then, once you have a task set that satisfies the requirement that the fastest way to solve those tasks is through search over models, you just need to also include some task for which search over models yields deceptive behavior to produce the desired reduction.

Once you have such a task set, the basic insight is to consider what happens if you take all of those different tasks together and combine them into one very diverse meta-learning task. In this case, the minimal circuit that solves that single task is forced to implement the same fastest search algorithm over models that we know has some task on which it yields deceptive behavior. Fundamentally, what’s happening here is that we’re exploiting the fact that the fastest search algorithm does not produce the fastest model. Thus, even if in some sense the fastest model that doesn’t do any search over models (that is, doesn’t do any meta-learning) is safe, as soon as the model starts doing meta-learning, it might find another model which is not safe.

You can think about this argument as a reduction from the safety of minimal circuits to the safety of fast training procedures. Furthermore, if you buy that real-world data is mostly distributed according to a simplicity prior, such that such fast training procedures will have to select models at least somewhat according to simplicity, then you can even view this argument as a direct reduction to the malignancy of simplicity priors.

## Formal argument

We will formalize tasks as MDPs of the form where is the state space, is the action space, is the initial state distribution, is the transition function (which we will write as ), and is the per-step reward function (which we will write as ).^{[3]} Then, we will use to denote the expected future return of a policy and to denote the expected future return starting from some state . Furthermore, for a given MDP, we will say a policy is deceptive—which we will denote as —if , where is a predicate that tests for the presence of deceptive behavior.

Then, let be some set of tasks such that the minimal circuit that solves them first searches for a good policy and then runs it. For example, the task distribution might contain things like “classify this image according to this training data” such that the fastest algorithm is to train and run some model on the given data. Thus, formalizing this assumption, let be the minimal circuit that achieves for some . Now, suppose there exists some such that finds and then runs a deceptive model when applied to . This gives us the setup for our desired reduction: if there exists some task for which yields a deceptive model, then we want to show that there exists some MDP for which the minimal circuit is deceptive. Thus, we will construct and such that the minimal circuit that achieves is deceptive:

- Let be the disjoint union .
- For , we will parameterize our action space such that each state has its own action set . Thus, for , there must be some such that . Then, let .
- For and , let .
- For , let .
- Let be 0 for in different and for in the same .
- Let .

Now, suppose is the minimal circuit that achieves . Then, such that the condition that achieve is equivalent to (also using ) the condition that , which is precisely the condition previously placed on . Thus, since was also defined as the minimal circuit that satisfied that condition.

Then, determining whether is deceptive, we get which holds for , making deceptive, as desired.

## Conclusion

It is worth noting that there is a meaningful sense in which the minimal circuit here is sort of trying to do the right thing—it is actually trying its best to find and run a policy that solves the problem—it’s just that there might exist some input on which the easiest-to-find policy is deceptive. However, despite it sort of doing the right thing, it isn’t safe (at least assuming that there is some task on which the fastest machine learning isn’t safe): it produces outputs that are systematically optimized for values other than the ones we want them to.

Importantly, however, this is because we didn’t successfully incentivize our minimal circuit to also find minimal circuits. If we were somehow able to find some alternative metric which incentivized our minimal circuit to only do search over minimal circuits, then we might be able to break this counterexample. It is an interesting open problem what sorts of objectives might be able to get around this “forwarding the guarantee” problem.^{[4]}

Furthermore, the counterexample presented here is quite natural in the sense of just requiring some environment which is diverse enough to include aspects of meta-learning, which is the sort of thing that could occur in many realistic diverse environments. Specifically, Ortega et al. have argued that “Meta-learning can also occur spontaneously in online regression when the capacity of the agent is bounded and the data is produced by a *single generator,*” noting in particular that “the downside is that we cannot easily control what will be metalearned. In particular, spontaneous meta-learning could lead to undesirable emergent properties, which is considered an open research problem in AI safety [Ortega et al., 2018].”

Thus, this counterexample suggests that in situations involving spontaneous meta-learning we shouldn’t necessarily expect time complexity regularization to guarantee inner alignment, as any meta-learning that has to be done won’t inherit any safety guarantees. Furthermore, this counterexample further suggests that the case of spontaneous meta-learning—or *mesa-optimization* as it is referred to in “Risks from Learned Optimization”—is an important hard case to consider when designing AGI alignment strategies.^{[5]} That being said, the jury is still out on whether time complexity regularization might still be one component of a full inner alignment solution, though this result demonstrates that it at least can’t be the entire solution.

It is worth noting that this example does not straightforwardly generalize to actual machine learning systems—see “Risks from Learned Optimization” for an analysis of whether such a generalization holds. ↩︎

An important question here is why the minimal circuit wouldn't just look like a lookup table. The simple answer is that it probably will at first, but if we give it enough meta-learning problems, then at some finite time the lookup table should become more expensive than an actual learning algorithm, since the learning algorithm should always take a constant amount of time per problem, whereas lookup gets more and more expensive as you have to encode more and more problems. ↩︎

We will refer to the different components of a task with subscripts such that . ↩︎

One way to think about this problem is that it is the ML version of the Tiling Agents Problem—the “Tiling Training Processes Problem,” perhaps. The question is, for the operation of going from a training process to the training process that is incentivized for models trained via that process to implement, what are the fixed points? That is, what sorts of training processes produce models that would use the same training process? ↩︎

Interestingly, this counterexample points at a distinctly different way in which such models might be dangerous, however: whereas “Risks from Learned Optimization” deals with the dangers of a learned objective, this counterexample suggests that the learned search process itself might also pose a problem. ↩︎

Interesting proof. A couple off-the-cuff thoughts...

First, this is proving something much more general than just deceptiveness. We can just scribble out the three or four English sentences with the word "deceptive" in them, and just say "C(π,x) is a predicate" - the proof doesn't actually use any properties of C. So this is a fairly general result about minimal circuits on MDPs - and I'm wondering what other predicates could be plugged into it to yield interesting results.

Second, the proof doesn't actually use the assumption that the minimal circuit on the original set of tasks is performing search. All that matters is that there is some MDP on which the circuit is deceptive, and that the circuit is minimal subject to an average performance bound on the full set of tasks.

Third, we can state the overall conclusion as roughly "if there exists a set of MDPs for which the minimal circuit achieving some average performance on the set is deceptive on at least one of the MDPs, then there exists a single MDP whose minimal circuit is deceptive." On the other hand, the reverse implication holds trivially: if there exists a single MDP whose minimal circuit is deceptive, then there exists a set of MDPs for which the minimal circuit achieving some average performance on the set is deceptive on at least one of the MDPs. Proof: take the set to contain just the one MDP whose minimal dircuit is deceptive. So we don't just have a one-way implication here; we actually have equivalence between two open problems. Obvious next question: what other minimal-circuit problems are equivalent to these two?

I agree that the proof here can be made significantly more general—and I agree that exploring that definitely seems worthwhile—though I also think it's worth pointing out that the proof rests on assumptions that I would be a lot less confident would hold in other situations. The point of explaining the detail regarding search algorithms here is that it gives a plausible story for why the assumptions made regarding πlearn and Tbad should actually hold.

This would be pretty strange -- why not just directly hardcode the policy? Wouldn't that be faster? We need to use machine learning because we aren't able to write down a program that "directly" solves e.g. image recognition, but the "direct" program would be faster if we had some way of finding it. The general reason for optimism is that the "fastest" requirement implies that any extraneous computation (e.g. deception) is removed; that same reason implies that any search would be removed and replaced with a correct output of the search.

Another way to think of this: when you don't have a simplicity bias, you have to compete with the GLUT (Giant Lookup Table), which can be very fast. Even if you take into account the time taken to perform the lookup, in a deterministic environment the GLUT only has to encode the optimal trajectory, not the full policy. In a stochastic environment, the GLUT may need to be exponentially larger, so it may be too slow, but even so you can have GLUT-like things built out of higher-level abstractions, which might be enough to avoid deception. Basically you can do a lot of weird stuff when you don't require simplicity; it's not clear that meta learning should be modeled the same way.

A couple of things. First, a minimal circuit is not the same as a speed-prior-minimal algorithm. Minimal circuits have to be minimal in width + depth, so a GLUT would definitely lose out. Second, even if you're operating on pure speed, I think there are sets of tasks that are large enough that a GLUT won't work. For example, consider the task of finding the minimum of an arbitrary convex function. Certainly for the infinite set of all possible convex functions (on the rationals, say), I would be pretty surprised if something like gradient descent weren't the fastest way to do that. Even if you restrict to only finitely many convex functions, if your set is large enough it still seems hard to do better than gradient descent, especially since looking up the solution in a huge GLUT could be quite expensive (how do you do the lookup? a hash table? a binary tree? I'm not sure if those would be good enough here).

I don't really understand this -- my understanding is that with a minimal circuit, you want to minimize the number of gates in the circuit, and the circuit must be a DAG (if you're allowed to have loops + clocks, then you can build a regular computer, and for complex tasks the problem should be very similar to finding the shortest program and implementing it on a universal circuit).

But then I can create a Turing Machine that interprets its input as a circuit, and simulates each gate of the circuit in sequence. Then the running time of the TM is proportional to the number of gates in the circuit, so an input with minimal running time should be a circuit with the minimal number of gates. This is not technically a Universal TM, since loop-free circuits are not Turing-complete, but I would expect a speed prior using such a Turing Machine would be relatively similar to a speed prior with a true UTM.

I agree that if you have to work on an infinite set of inputs, a GLUT is not the way to do that. I was thinking of the case where you have to work on a small finite set of inputs (hence why I talk about the optimal trajectory instead of the optimal policy), which is always going to be the case in the real world. But this is too pedantic, we can certainly think of the theoretical case where you have to work on an infinite set of inputs. I was mostly trying to use the GLUT as an intuition pump, not arguing that it was always better.

In the case with infinite inputs, I still have the intuition that meta learning is what you do when you don't know enough about the problem to write down the good heuristics straight away, as opposed to being the fastest way of solving the problem. But I agree that the fastest solution won't be a GLUT; I'm thinking more a combination of really good heuristics that "directly" solve the problem. (Part of this is an intuition that for any reasonably structured set of inputs, value-neutral optimization of a fixed objective is very inefficient.)

Planned summary:

While it has been argued that the

simplestprogram that solves a complex task is likely to be deceptive, it hasn't yet been argued whether thefastestprogram that solves a complex task will be deceptive. This post argues that fast programs will often be forced tolearna good policy (just as we need to do today), and the learned policy is likely to be deceptive (presumably due to risks from learned optimization). Thus, there are at least some tasks where the fastest program will also be deceptive.Planned opinion:

This is an intriguing hypothesis, but I'm not yet convinced: it's not clear why the fastest program would have to

learnthe best policy, rather than directly hardcoding the best policy. If there are multiple possible tasks, the program could have a nested if structure that figures out which task needs to be done and then executes the best policy for that task. More details in this comment.Very interesting!

I'm confused about why the "spontaneous meta-learning" in Ortega et al. is equivalent to (or a special case of?) mesa-optimization; which was also suggested in MIRI's August 2019 Newsletter. My understanding of Ortega et al. is that "spontaneous meta-learning" describes a scenario in which training on a sequence from a single generator is equivalent to training on sequences from multiple generators. I haven't seen them discuss this issue in the context of the trained model itself doing search/optimization.

Regarding Ortega et al., I agree that the proof presented in the paper is just about how a single generator can be equivalent to sequences of multiple generators. The point that the authors are using that proof to make, however, is somewhat more broad, which is that your model can learn a learning algorithm even when the task you give it isn't explicitly a meta-learning task. Since a learning algorithm is a type of search/optimization algorithm, however, if you recast that conclusion into the language of Risks from Learned Optimization, you get exactly the concern regarding mesa-optimization, which is that models can learn optimization algorithms even when you don't intend them to.

This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure? If you train network N on dataset D and optimization pressure causes N to internally develop a planning system (mesa-optimizer) M, aren't all questions of whether M is aligned with N's optimization objective just generalization questions?

More specifically if N is sufficiently overcomplete and well regularized, and D is large enough, then N can fully grok the dataset D, resulting in perfect generalization. It's also straightforward as to why this can happen - when N is large enough to contain enough individually regularized sub-model solutions (lottery tickets) that it is approximating a solomonoff style ensemble.

Anyway if N has a measurably low generalization gap on D, then it doesn't seem to matter whether M exists or what it's doing with regard to generalization on D. So is the risk of 'inner alignment failure' involve out of distribution generalization?

Yes, inner alignment is a subset of robustness. See the discussion here and my taxonomy here.

This reflects a misunderstanding of what a mesa-optimizer is—as we say in Risks from Learned Optimization:

I don't think this characterization is correct. A couple of points:

Ahh thanks I found that post/discussion more clear than the original paper.

So in my example, the whole network N is just a mesa-optimizer according to your definition. That doesn't really change anything, but your earlier links already answered my question.

I should have clarified I meant grokking only shows prefect generalization on-distribution. Yes, off-distribution failure is always possible (and practically unavoidable in general considering adversarial distributions) - that deceptive RSA example is interesting.

I mentioned lottery tickets as examples of minimal effective sub-models embedded in larger trained over-complete models. I'm not sure what you mean by "active" at initialization, as they are created through training.

I'm gesturing at the larger almost self-evident generalization hypothesis - that overcomplete ANNs, with proper regularization and other features - can efficiently explore an exponential/combinatoric space of possible solutions (sub-models) in parallel, where each sub-model corresponds roughly to a lottery ticket. Weight or other regularization is essential and helps ensure the training process comes to approximate bayesian inference over the space of sub-models.