We've written a paper on online imitation learning, and our construction allows us to bound the extent to which mesa-optimizers could accomplish anything. This is not to say it will definitely be easy to eliminate mesa-optimizers in practice, but investigations into how to do so could look here as a starting point. The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.

Here is the abstract:

In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. No existing work provides formal guidance in how this might be accomplished, instead restricting focus to environments that restart, making learning unusually easy, and conveniently limiting the significance of any mistake. We address a fully general setting, in which the (stochastic) environment and demonstrator never reset, not even for training purposes. Our new conservative Bayesian imitation learner underestimates the probabilities of each available action, and queries for more data with the remaining probability. Our main result: if an event would have been unlikely had the demonstrator acted the whole time, that event's likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency.

The second-last sentence refers to the bound on what a mesa-optimizer could accomplish. We assume a realizable setting (positive prior weight on the true demonstrator-model). There are none of the usual embedding problems here—the imitator can just be bigger than the demonstrator that it's modeling.

(As a side note, even if the imitator had to model the whole world, it wouldn't be a big problem theoretically. If the walls of the computer don't in fact break during the operation of the agent, then "the actual world" and "the actual world outside the computer conditioned on the walls of the computer not breaking" both have equal claim to being "the true world-model", in the formal sense that is relevant to a Bayesian agent. And the latter formulation doesn't require the agent to fit inside world that it's modeling).

Almost no mathematical background is required to follow [Edit: most of ] the proofs. [Edit: But there is a bit of jargon. "Measure" means "probability distribution", and "semimeasure" is a probability distribution that sums to less than one.] We feel our bounds could be made much tighter, and we'd love help investigating that.

These slides (pdf here) are fairly self-contained and a quicker read than the paper itself.

Below, and refer to the probability of the event supposing the demonstrator or imitator were acting the entire time. The limit below refers to successively more unlikely events ; it's not a limit over time. Imagine a sequence of events such that .

I haven't read the paper yet, looking forward to it. Using something along these lines to run a sufficiently-faithful simulation of HCH seems like a plausible path to producing an aligned AI with a halting oracle. (I don't think that even solves the problem given a halting oracle, since HCH is probably not aligned, but I still think this would be noteworthy.)

First I'm curious to understand this main result so I know what to look for and how surprised to be. In particular, I have two questions about the quantitative behavior described here:

## 1: Dependence on the prior of the true hypothesis

It seems like you have the following tough case even if the human is deterministic:

- There are N hypotheses about the human, of which one is correct and the others are bad.
- I would like to make a series of N potentially-catastrophic decisions.
- On decision k, all (N-1) bad hypotheses propose the same catastrophic prediction. We don't know k in advance

... (read more)Here's the sketch of a solution to the query complexity problem.

Simplifying technical assumptions:I'm pretty sure removing those is mostly just a technical complication.

Safety assumptions:Algorithm:On each round, let p0 be the prior probability mass of unfalsified hypotheses predicting 0 and p1 be the same for 1.I understand that the practical bound is going to be logarithmic "for a while" but it seems like the theorem about runtime doesn't help as much if that's what we are (mostly) relying on, and there's some additional analysis we need to do. That seems worth formalizing if we want to have a theorem, since that's the step that we need to be correct.

If our models are a trillion bits, then it doesn't seem that surprising to me if it takes 100 bits extra to specify an intended model relative to an effective treacherous model, and if you have a linear dependence that would be unworkable. In some sense it's actively surprising if the very shortest intended vs treacherous models have description lengths within 0.0000001% of each other unless you have a very strong skills vs values separation. Overall I feel much more agnostic about this than you are.

This doesn't seem like it works once you are selecting on competent treacherous models. Any competent model will err very rarely (with probability less than 1 / (feasible training time), probably much less). I do... (read more)

Trying to defect at time T is only a good idea if it's plausible that your mechanism isn't going to notice the uncertainty at time T and then query the human. So it seems to me like this argument can never drive P(successful treachery) super low, or else it would be self-defeating (for competent treacherous agents).

Subroutines of functions aren't always simpler. Even without treachery this seems like it's basically guaranteed. If "the truth" is just a simulation of a human, then the truth is a subroutine of the simplest model. But instead you could have a model of the form "Run program X, treat its output as a program, then run that." S... (read more)

Edit: I no longer endorse this comment; see this comment instead.

I think you've just assumed the entire inner alignment problem away. You assume that the model—the imitation learner—is a perfect Bayesian, rather than just some trained neural network or other function approximator (edit: I was just misinterpreting here and the training process is supposed to Bayesian rather than the model, see Rohin's comment below). The entire point of the inner alignment problem, though, is that you can't guarantee that your trained model is actually a perfect Bayesian—or anything else. In practice, all we actually generally do when we do ML is we train some neural network on some training data/environment and hope that the implicit inductive biases of our training process are such that we end up with a model doing the right thing, meaning that we can't really control what sort of model—Bayesian or anything else—that we get when we do ML, and that uncontrollability, in my eyes,

isthe inner alignment problem.While I share your position that this mostly isn't addressing the things that make inner alignment hard / risky in practice, I agree with Vanessa that this does not assume the inner alignment problem away, unless you have a particularly contorted definition of "inner alignment".

There's an optimization procedure (Bayesian updating) that is selecting models (the model of the demonstrator) that can themselves be optimizers, and you could get the wrong one (e.g. the model that simulates an alien civilization that realizes it's in a simulation and predicts well to be selected by the Bayesian updating but eventually executes a treacherous turn). The algorithm presented precludes this from happening with some probability. We can debate the significance, but it seems to me like it is clearly doing something solution-like with respect to the inner alignment problem.

I think this is completely unfair. The inner alignment problem exists even for perfect Bayesians, and solving it in that setting contributes much to our understanding. The fact we don't have satisfactory mathematical models of deep learning performance is a different problem, which is broader than inner alignment and to first approximation orthogonal to it. Ideally, we will solve this second problem by improving our mathematical understanding of deep learning and/or other competitive ML algorithms. The latter effort is already underway by researchers unrelated to AI safety, with some results. Moreover, we can in principle come up with heuristics how to apply this method of solving inner alignment (which I call "confidence thresholds" in my own work) to deep learning: e.g. use NNGP to measure confidence or use an evolutionary algorithm with a population of networks and check how well they agree with each other. Of course if we do this we won't have formal guarantees that it will work, but, like I said this is a broader issue than inner alignment.

Regardless of how you define inner alignment, I think that the vast majority of the existential risk comes from that “broader issue” that you're pointing to of not being able to get worst-case guarantees due to using deep learning or evolutionary search or whatever. That leads me to want to define inner alignment to be about that problem—and I think that is basically the definition we give in Risks from Learned Optimization, where we introduced the term. That being said, I do completely agree that getting a better understanding of deep learning is likely to be critical.

[Emphasis added.] I think this is a common and serious mistake-pattern, and in particular is one of the more common underlying causes of framing errors. The pattern is roughly:

quitefit Cause(X)The problem is that, in trying to "shoehorn" cause(y) into the category Cause(X), we miss the opportunity to notice a different pattern, which is more directly useful in understanding y as well as some other cluster of problems related to y.

A concrete example: this is the same mistake I accused Zvi of making when trying to cast moral mazes as a problem of super-perfect competition. The conditions needed for super-perfect competition to explain m... (read more)

I mean, I don't think I'm “redefining” inner alignment, given that I don't think I've ever really changed my definition and I was the one that originally came up with the term (inner alignment was due to me, mesa-optimization was due to Chris van Merwijk). I also certainly agree that there are “more than just inner alignment problems going on in the lack of worst-case guarantees for deep learning/evolutionary search/etc.”—I think that's exactly the point that I'm making, which is that while there are other issues, inner alignment is what I'm most concerned about. That being said, I also think I was just misunderstanding the setup in the paper—see Rohin's comment on this chain.

I think I mostly agree with what you're saying here, though I have a couple of comments—and now that I've read and understand your paper more thoroughly, I'm definitely a lot more excited about this research.

I don't think that's right—if you're modeling the

training processas Bayesian, as I now understand, then the issue is that what makes the problem go away isn't more intelligent models, but less efficient training processes. Even if we have arbitrary compute, I think we're unlikely to use training processes that look Bayesian just because true Bayesianism is really hard to compute such that for basically any amount of computation that you have available to you, you'd rather run some more efficient algorithm like SGD.I worry about these sorts of Bayesian analyses where we're assuming that we're training a large population of models, one of which is assumed to be an accurate model of the world, since I expect us to end up using some sort of local search instead of a global search like that—just because I think that local search is way more efficient than any sort of Bayesianish global search... (read more)

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

We know that local search processes can produce AGI, so viability is a question of efficiency—and we know that SGD is at least efficient enough to solve a wide variety of problems from image classification, to language modeling, to complex video games, all given just current compute budgets. So while I could certainly imagine SGD being insufficient, I definitely wouldn't want to bet on it.

Planned summary:

... (read more)Thanks for the post and writeup, and good work! I especially appreciate the short, informal explanation of what makes this work.

Given my current understanding of the proposal, I have one worry which makes me reluctant to share your optimism about this being a solution to inner alignment:

The scheme doesn't protect us if somehow all top-n demonstrator models have correlated errors. This could happen if they are coordinating, or more prosaically if our way to approximate the posterior leads to such correlations. The picture I have in my head for the latter is... (read more)

Thanks for sharing this work!

Here's my short summary after reading the slides and scanning the paper.

... (read more)This is exactly the method used in my paper about delegative RL and also an earlier essay that doesn't assume finite MDPs or access to rewards but has stronger assumptions about the advisor (so it's essentially doing imitation learning + denoising). I pointed out the connection to mesa-optimizers (which I called "daemons") in another essay.

Reviewed here.

It is worth noting that this approach doesn't deal with non-Cartesian daemons, i.e. malign hypotheses that attack via the

physical side effectsof computations rather than their output. Homomorphic encryption is still the best solution I know to these.Interesting paper! I like the focus on imitation learning, but the really new food-for-thought thing to me is the bit about dropping i.i.d. assumptions and then seeing how far you can get. I need to think more about the math in the paper before I can ask some specific questions about this i.i.d. thing.

My feelings about the post above are a bit more mixed. Claims about inner alignment always seem to generate a lot of traffic on this site. But a lot of this traffic consists of questions and clarification about what exactly counts as an inner alignment fail... (read more)

[edited to delete and replace an earlier question] Question about the paper: under equation (3) on page 4 I am reading:

This confused me initially to no end, and still confuses me. Should this be:

The 0 on the l.h.s. means the imitator is picking the action a itself instead of deferring to the demonstrator

or picking one of the other actions???This would seem to be more consistent with the definitions that follow, and it would seem to make more ... (read more)