In the original “AI safety via debate” paper, Geoffrey Irving et al. introduced the concept of analyzing different alignment proposals from the perspective of what complexity class they are able to access under optimal play. I think this is a pretty neat way to analyze different alignment proposals—in particular, I think it can help us gain some real insights into how far into the superhuman different systems are able to go. Thus, the goal of this post is to try to catalog different alignment proposals based on the metric of what complexity class they have so far been proven to access.

To do that, I have included a variety of new complexity class proofs in this post. Of particular note, I demonstrate that there exist forms of both imitative amplification and AI safety via market making that reach all the way up to —which is significant given that the largest complexity class that any alignment proposal was known to access previously was . Only the forms of amplification and market making making use of pointers (as in strong HCH), however, can access —for the pointer-less versions, I demonstrate in this post that they access and , respectively. The proof for market making is also particularly notable as it is the only approach on my list that ends up in that complexity class. Additionally, I also demonstrate that recursive reward modeling can reach all the way to , improving upon the previous best result in “Scalable agent alignment via reward modeling” that it accesses .

Before I jump in, however, some preliminaries. First, we'll assume that a human, , is polynomial-time such that can reliably solve any problem in but not anything beyond that. Second, we'll assume that our training procedure and resulting models are arbitrarily strong in terms of what complexity class they can access. Third, we'll assume that gets oracle access to the models during training. Then, we'll say that *a proposal to train a model using a loss function accesses a complexity class iff, for any language , there exists some strategy available to such that, for any which is optimal under given 's strategy, .* Thus, conceptually, a proposal accesses if there is a (polynomial-time) strategy that you (a human) can implement such that—conditional on you knowing that the model is optimal—you would trust the model's output for any problem in . Note that that is *not* the same as saying that a polynomial-time human would actually be able to verify that the result is correct—only that it will always be correct at optimum. Note that these assumptions are just generalizations of those used in “AI safety via debate.” Irving et al. actually note that, if you don't imagine optimal play and simply restrict to the set of problems that a polynomial-time human can actually verify, debate only reaches rather than .

# Alignment proposals by complexity class

Without further ado, here's my list of alignment proposals grouped by what complexity class they access. All of the proofs below are only *lower bounds* rather than upper bounds, so the proposals could be stronger than is noted here, but shouldn't be weaker.

*EDIT: This list is now out of date. See “Weak HCH accesses ” for the updated version.*

: Imitation learning (very straightforward—given , optimal imitation of will also be in )

: AI safety via debate (proof), Imitative amplification with weak HCH (proof below), Approval-based amplification (proof below), Recursive reward modeling (proof below)

: AI safety via market making (proof below)

: Debate with cross-examination (proof)

: Imitative amplification with strong HCH (proof below), AI safety via market making with pointers (proof below)

# Proofs

Note that while I am reasonably confident that my proofs are correct, I'm not a complexity theorist and a result I suspect that many of them are a lot uglier than they need to be. In particular, because I didn't feel like I was super confident in my understanding of different -complete and -complete problems, I just did everything with Turing machines instead.

## Imitative amplification with weak HCH accesses

In general, we'll define imitative amplification to be the procedure which trains (where is some set of questions and is some distribution over answers) on

where is the probability density function for evaluated at and uses some distribution with support over all of . Then, at optimal loss, we get

where is the loss at data point such that is a delta function probability density, in which case we will just ignore the probability distribution and let .

What we have left unspecified here, however, is the format of 's output—which matters quite a lot, as it makes the difference here between weak HCH and strong HCH and thus the difference between and .

For weak HCH, we will let and in be the set of all finite strings in natural language. Thus, since is polynomial-time, that restricts the output of to always be polynomial in size.

Now, we will show that this procedure accesses . is defined as the set of all problems that can be solved by Turing machines using space where is the input size (in this case the question length) and is some polynomial in .

Thus, for some language , let be some Turing machine with state space which computes the answer to any decision problem in space where . Then, we will let 's strategy here be as follows:

- Given , return where is the initial state of , is the empty tape symbol, and is string concatenation.
- Given , run one step of starting from state using as the symbol under the head, as the tape to the left of the head, and as the tape to the right of the head. This results in some new , , , and . If is an accept or reject state, return accept or reject. Otherwise, return .

Now, we will demonstrate that this strategy is polynomial time:

- For the first type of behavior, the only actions that must take are copying into 's input, which takes time (since is defined as the size of ), and copying 's output into 's output, which takes time proportional to the size of 's output. Since is a decision problem, 's output should be boolean, making copying it constant time. Thus, behavior 1 is implementable by a polynomial .
- For the second type of behavior, computing the new state and transition is constant time, copying the state is constant time as has a constant number of states, and copying the new head symbol is constant time as there are a constant number of tape symbols. However, updating and copying and takes time. Fortunately, however, since we know that only uses polynomial space, we know . Thus, behavior 2 is also implementable by a polynomial .

Finally, we will demonstrate that an optimal under this strategy successfully solves . We know that at optimum will compute the fixed point of and thus we simply need to show that as a recursive function where calls to are replaced by calls to will successfully solve .

First, we will show that will always produce the result computed by starting from state on tape if halts starting from that position. We will do this by induction on , where is the number of steps until halts starting from the given position. Note that in all cases here we only need to deal with the second type of behavior, as it is the only one that will be called on inputs of the form .

For the base case, if halts (that is, enters an accept or reject state), then will simply return accept or reject, as desired.

For the inductive step, just returns , which is an input on which should be one step closer to halting, and thus by induction should return the correct output for .

Finally, since , and we know halts since , we get that will always return the result of , as desired. Thus, since this procedure works for any , we get that imitative amplification with weak HCH accesses , as desired.

## Approval-based amplification accesses

The proof here is fairly trivial as we can simply reproduce the same loss as in imitative amplification by having calculate approval on some question by calculating where is some distance metric. can then just implement the exact same strategy as above for to reproduce the same results.

The only limitation is that now we need to be able to compute the distance metric, which means it needs to be computable in polynomial-time. However, that restriction just means that we need , which we already have since those outputs are linear in size other than and —and we know and are polynomial in size thanks to having . Thus, we get that approval-based amplification accesses , as desired.

## Recursive reward modeling accesses

In “Scalable agent alignment via reward modeling,” Leike et al. exhibit a proof that recursive reward modeling accesses . However, I believe that this proof understates recursive reward modeling's capabilities and that in fact it can get all the way to , which is what we'll prove here. Given the above proofs, the proof here is actually fairly straightforward and proceeds simply by demonstrating that approval-based amplification can be implemented from within recursive reward modeling.

We'll start by exhibiting a version of recursive reward modeling with the right properties. We'll be restricting ourselves here to single-step episodes with , as we won't be needing anything else. In that context, we'll first define what it means to train a reward model and then using that define what it means to train an agent using that reward model.

First we'll define reward modeling. Let be the set of states for our single-step episode and the set of actions. Per the recursive reward modeling setup, we want to train a reward model on the evaluation of environment trajectories by a human with access to . Since a trajectory in this setting is just a tuple , however—and each episode is only a single step such that the advantage function is just the reward function—we're effectively just training to mimic .

Given such a reward modeling procedure, we'll train an agent model to maximize . Thus, we get

which, for optimal as above, reduces to

Now, let and we'll demonstrate that there exists a strategy for that ensures the correctness of at optimal loss. Specifically, 's strategy here is the same as with approval-based amplification, which is to compute a distance metric such that

which—as in the proof for approval-based amplification—precisely reproduces the loss function for imitative amplification, which as above we have already proven accesses . Note that, as with approval-based amplification above, since we need to be able to compute the distance metric here, we need and to be polynomially-sized, but the assumption that gives us that. Thus, we get that recursive reward modeling accesses , as desired.

## Imitative amplification with strong HCH accesses

The proof here is very similar to the previous imitative amplification proof with only minor modifications.

First, we need to incorporate strong HCH into that proof. Specifically, we will allow for and to include pointers to arbitrary quantities of text (which can itself also include pointers). This will allow 's input and output to be arbitrarily large while still being amenable to manipulation by a polynomial-time .

Second, we will change the format for representing the tape to make use of pointers. Instead of representing the full and in string form, we will use pointers to represent both and as doubly linked lists with the head of the list at the head of the Turing machine. This will allow for to perform a single step of the Turing machine in constant time, with the total time taken by in each step just for copying the length- problem .

These changes allow us to completely remove the assumption that the Turing machine in the previous imitative amplification proof run in space, enabling the procedure above to accommodate any Turing machine which is known to halt rather than just those which use polynomial space. Since is the complexity class of problems decidable by a Turing machine, we get that imitative amplification with strong HCH accesses , as desired.

## AI safety via market making with pointers accesses

For market making, I will be doing the proof first, as it is actually much simpler than the proof, especially given the above proofs as background.

We will use the formulation of AI safety via market making for non-binary questions such that we have two models— and —where is trained to predict after seeing 's output and is trained to produce output which changes 's prediction. We will also be allowing to reference as in this simple proof.

Furthermore, since we are considering the pointer version here, we will be allowing both and to take as input and produce as output text containing pointers which can themselves reference other pointer-laden text. Additionally, for the purposes of this proof, we will only be considering transcripts of length 1 and will be carefully constructing 's strategy such that there is only one minimal loss equilibrium for . Note that will still be very important here for ensuring that only the one optimal behavior for exists despite the fact that at equilibrium won't actually be getting any reward.

Now, let such that there exists a Turing machine that decides any decision problem . Then, 's strategy is as follows. First, if is given as input, then only listens to arguments of the form (which are really two possible arguments, one for accept and one for reject):

: The answer is accept/reject because predicts accept/reject where is 's starting state, is the empty tape symbol, and is an empty pointer.

And if given as input, then only listens to arguments of the following two forms.

Argument type 1:

Adv: The answer is accept/reject because one step of in state over tape sends to accept/reject state .

Argument type 2:

: The answer is accept/reject because:

- one step of in state over tape sends to new state with new head symbol and tape pointers and and
- predicts accept/reject.

It should be easy enough to see that all of the above arguments only reference things which are checkable by a polynomial-time and cannot be more than polynomial in size due to the use of pointers.

Now, under such a strategy, we will prove that the only unexploitable market is one that correctly answers any question of the form so long as halts starting from state on tape . We will use induction on the number of steps left until halts.

First, the base case. If halts on the given input, then won't listen to any type 2 arguments, as will be able to check that 's next state is an accept/reject state. Thus, should only listen to type 1 arguments—and furthermore should only listen to the correct type 1 argument as knows what the next state of should be. Therefore, the only unexploitable must predict the correct answer here.

Second, the inductive step. Suppose and is correct for all . Since , only argument type 2 is available. Since can check a single step of , however, only type 2 arguments with the correct next state and tape should be available. Furthermore, since should be one step closer to halting, it must be correct by induction. Thus, should only listen to the correct argument type 2 here, ensuring that the only unexploitable market is one where predicts the correct answer.

Given that is correct on inputs of the form where halts, it must be that is correct, as that is an input on which we know halts since we know it decides . Furthermore, it is easy enough to see that the only unexploitable market for must have . Thus, must correctly answer at optimal loss, as desired. Since this works for all decidable by some , we get that market making accesses , as desired.

## AI safety via market making accesses

This is definitely the most complicated of all the proofs in this post and so to save space I won't be going into as much detail here as with the other proofs.

We will start with the same training setup as in the previous proof, with the exception that we will no longer be allowing pointers. Then, let be some language which is decidable by a Turing machine in a number of steps where where is the size of the input.

Now, let be some decision problem. Then, we will describe 's strategy by listing out all the arguments will listen to for each of the different inputs that we care about.

On input :

: The answer is accept/reject because I can exhibit an such that:

- predicts ,
- in state over tape symbol enters state , and
- is an accept/reject state.

On input :

: The answer is because:

- predicts ,
- predicts ,
- predicts , and
- in state over tape symbol with tape symbol to the left and to the right enters state and moves the head to symbol .

: The answer is because is the starting state, is the empty tape symbol, and .

On input :

: The answer is because is the empty tape symbol and .

: The answer is because:

- ,
- predicts ,
- on changes the head position by (such that ),
- , and
- predicts .

: The answer is because:

- ,
- predicts ,
- on changes the head position by (such that ),
- , and
- predicts .

Additionally, we also need to specify that if is given an input that is too large to represent in polynomial space—and thus that can't comprehend—that always produce the empty string.

Now, there are a variety of things that we need to prove about this strategy.

First, we need to demonstrate that it is implementable by a polynomial-time . The only real barrier to that is the necessity of and being small enough that is able to read and understand them. In particular, if or are too large to represent in polynomial space, then will give no answer, which means we need —and any calls to that implicitly depends on—to never use an or which is too large to represent in polynomial space. However, the assumption that gives us this.

We will start with . Since , we get where is the total number of steps taken by until it halts. Since only ever depends on calls with (since we know there exists an and then that call should only depend on calls with smaller ), we get that . However, if we represent using any base-n number system, we can represent it in space proportional to

as desired.

Now we'll take a look at . The largest that depends on is bounded above by . To see this, note that can only ever call with , each call to can only modify by , and only make subcalls with smaller , ensuring that cannot grow larger than than . Thus, since is polynomial-sized, must be also.

Second, we need to demonstrate that this strategy ensures that optimal always solves . To do that, we first need to show that and are correct for polynomially-sized . I won't reproduce this part of the proof here as I think it's pretty straightforward but also very tedious, but suffice it to say that this can be proven via mutual induction on .

However, I will show that is correct given that and are correct for all polynomially-sized . Now, suppose there exists some such that all the conditions are met for to make the available argument for that . Then, it must be that is polynomially-sized, since otherwise would be the empty string, and thus it must be that is correct by the previous correctness proof for such calls. Furthermore, since we know halts in exponential time, we know such an exists, and thus the only non-exploitable equilibrium for is to correctly produce the result given by that . Since this procedure works for all , we get that AI safety via market making accesses , as desired.

Planned summary for the Alignment Newsletter:

Planned opinion:

It seems to me that the following argument should hold:

Premise 1: We can't build physical machines that solve problems outside of P.

Premise 2: Recursive alignment proposals (HCH, debate, market-making, recursive reward modelling) at equilibrium would solve problems outside of P.

Conclusion 1, from premises 1 and 2: We can't build physical machines that implement equilibrium policies for recursive alignment proposals.

Premise 3: Existing arguments for the safety of recursive alignment proposals reason about their equilibrium policies.

Conclusion 2, from conclusion 1 and premise 3: If we were to build physical machines that were trained using recursive alignment proposals, existing arguments would not establish the safety of the resulting policies.

Note that this assumes that P does not contain PSPACE.

Also, the proofs in the post (or at least the proof that iterative amplification with weak HCH accesses PSPACE, which is the only one I've read so far) assume polynomial-time access to M, which in reality contradicts premise 2. So I'm not sure what to make of things.

After you pass this back through the analogy, it becomes

Seems much less true at that point.

I'm confused by this comment. The post assumes that humans can solve all problems in P (in fact, all problems solvable in polynomial time given access to an oracle for M), then proves that various protocols can solve tricky problems by getting the human player to solve problems in P that in reality aren't typical human problems to solve. Therefore, I take P to actually mean P, rather than being an analogy for problems humans can solve.

Also, regarding your version of premise 1, I think I buy that AI can only give you a polynomial speedup over humans.

I've always understood it as an analogy (e.g. from the debate paper, "Although debate is intended for use with fuzzy humans as judges, we can gain intuition about the model by replacing the human with an arbitrary polynomial time algorithm"). But if I take your perspective, then we have:

Premise 0: Humans can solve all problems in P.

Premise 1: We can't build physical machines that solve problems outside of P.

Conclusion: We can't build physical machines that can solve problems that humans can't.

Seems like you should probably ditch one of the premises.

(I do think premise 0 is pretty wrong -- I can solve arbitrary SAT problems that have < 10 clauses, despite it being NP-complete, but I can't multiply a trillion numbers together, even though that is in P. Complexity classes impose scaling limits; humans have resource limits; these are very different. Yes, I do know that "SAT problems with < 10 clauses" is technically in P.)

I think this is easily handled by saying "in practice, the models we train will not be literally optimal". It's still useful to know what would happen at optimality (e.g. I do in fact have more trust for deep RL algorithms that have proofs of convergence to optimal policies given infinite model capacity, compute and data).

I read the post as attempting to be literal, ctrl+F-ing "analog" doesn't get me anything until the comments. Also, the post is the one that I read as assuming for the sake of analysis that humans can solve all problems in P, I myself wouldn't necessarily assume that.

My guess is that the thing you mean is something like "Sure, the conclusion of the post is that optimal models can do more than a polynomial speedup over humans, but we won't train optimal models, and in fact the things we train will just be a polynomial speedup over humans", which is compatible with my argument in the top-level comment. But in general your comments make me think that you're interpreting the post completely differently than I am. [EDIT: actually now that I re-read the second half of your comment it makes sense to me. I still am confused about what you think this post means.]

I mean, Evan can speak to what he meant. I strongly disagree with the literal interpretation of the post, regardless of what Evan meant, for the reason I gave above.

I usually think of the analogies as demonstrating the mechanism by which a particular scheme is meant to outperform a human. I do find these proof-analogies less compelling than the original one in debate because they're simulating Turing machines which is not how I expect any of them to work in practice. (In contrast, the debate proof for accessing PSPACE was about narrowing in on disagreements as being equivalent to finding a winning path of a two-player game, which is the canonical "structure" of a PSPACE-complete problem and is similar to why we'd expect debate to incentivize honesty in practice.)

I mean, I assumed what I needed to in order to be able to do the proofs and have them make sense. What the proofs actually mean in practice is obviously up for debate, but I think that a pretty reasonable interpretation is that they're something like analogies which help us get a handle on how powerful the different proposals are in theory.

I'm curious if you agree with the inference of conclusions 1 and 2 from premises 1, 2, and 3, and/or the underlying claim that it's bad news to learn that your alignment scheme would be able to solve a very large complexity class.

I agree with the gist that it implies that arguments about the equilibrium policy don't necessarily translate to real models, though I disagree that that's necessarily bad news for the alignment scheme—it just means you need to find some guarantees that work even when you're not at equilibrium.

I don't see where the post says that humans can solve all problems solvable in polynomial time given access to an oracle. What Evan does is just replacing humans (which is a rather fuzzy category) with polynomial time algorithms (which represent the consensus in complexity theory for tractable algorithms).

On another note, you're writing your comment on a physical device that can solve any computable problem, which obviously include problems outside of P (if only by the time hierarchy theorem). The value of P is that it captures problem that one can solve using physical machines in a way that scale reasonably, so you don't need to take the lifespan of the universe to compute the answer for an instance of size 1000. But we clearly have algorithms for problems outside of P. We just believe/know they will take forever and/or too much space and/or too much energy.

TBC by "can" I mean "can in practice". Also if we're getting picky my laptop is just a finite state machine :)

From the post:

Yes, this says that humans can solve any problem in P. It says nothing about using M as an oracle.

It doesn't talk about using M as an oracle, but otherwise I don't see how the proofs pan out: for instance, how else is

supposed to only take polynomial time?

You're right, after rereading the proofs and talking with Evan, the only way for H to be polynomial time is to get oracle access to M. Which is slightly unintuitive, but makes sense because the part of the computation that depends on H and not on M is indeed polynomial time.

Theorem. Weak HCH (and similar proposals) contain EXP.

Proof sketch: I give a strategy that H can follow to determine whether some machine that runs in O(2cnk) time accepts. Basically, we need to answer questions of the form "Does cell x have value y at time t." and "Was the head in position x at time t?", where x and t are bounded by some function in O(2cnk). Using place-system representations of x and t, these questions have length in O(nk), so they can be asked. Further, each question is a simple function of a constant number of other such questions about earlier times as long as t>0, and can be answered directly in the base case t=0.

Yeah, I think that's absolutely right—I actually already have a version of my market making EXP proof for amplification that I've been working on cleaning up for publishing. But regardless of how you prove it I certainly agree that I understated amplification here and that it can in fact get to EXP.

A notational stumble it took me a while to solve when reading this - what's M(q)|H(q,M) supposed to mean? My guess at the answer, so that others can get through this quicker, or so that evhub can tell me I'm wrong:

M(q) is supposed to be M's distribution over answers to questions q. If δ is a distribution over set S and s∈S, then δ|s is the probability mass δ assigns to s. Finally, H(q,M) is how the human would answer question q when given access to M. So, M(q)|H(q,M) is the probability that M assigns to the answer that human H would give if H could use M to answer q.

Yeah, that's right—for a probability distribution P:Δ(X), I mean for P|x to be the value of the probability density function at x. I edited the post to clarify.

Nice, an excuse to apply my love for Computational Complexity to useful questions of AI Safety!

That being said, I'm a little confused with what you show.

I follow up to there. What you're showing is basically that you can train M to solve any problem in C using a specific alignment approach, without limiting in any way the computational power of M. So it might take an untractable amount of resources, like exponential time, for this model to solve a problem in PSPACE, but what matters here is just that it does. The point is to show that alignment approaches using a polynomial-time human can solve these problems, not how much resources they will use to do so.

And then you write about the intuition:

Maybe it's just grammar, but I read this sentence as saying that I trust the output of the polynomial-time strategy. And thus that you can solve PSPACE, NEXP, EXP and R problems in polynomial time. So I'm assuming that you mean trusting the model, which once again has no limits in terms of resources used.

I looked for that statement in the paper, failed to find it, then realized you probably meant that raw polynomial time verification gives you NP (the certificate version of NP, basically). Riffing on the importance of optimal play, Irving et al. show that the debate game is a game in the complexity theoretic sense, and thus that it is equivalent to TQBF, a PSPACE-complete problem. But when seeing a closed formula as a game, the decision problem of finding whether it's in TQBF amounts to showing the existence of a winning strategy for the first player. Debate solves this by assuming optimal play, and thus that the winning debater will have, find, and apply a winning strategy for the debate.

I find it fun to compare it to IP, the class of Interactive Protocols, which is also equal to PSPACE. An interactive protocol makes a powerful prover convince a polynomial time verifier of the existence of a winning strategy for the problem (TQBF for example). As Scott Aarronson writes in "Quantum Computing Since Democritus":

This seems more powerful than debate, like a meta level above. Yet it isn't, and it makes sense: showing you that there's a winning strategy is powerful, but having a guarantee to always win if there is a winning strategy is almost as good.

With all that, I've not gotten into your proofs at all. I'm reading them in details, but I'll have to take a bit more time before having constructive comments on them. ^^

A notation question I can already ask though: what is M(q)|H(M,q) ?

See my comment that attempted to figure this out.

According to me, the point is that you can reduce "trusting the model" to "trusting that the model has reached the optimal equilibrium".

Glad you found the post exciting!

Yep—that's right.

I certainly don't mean that the model needs to be polynomial time. I edited the post to try to clarify this point.

Yeah—that's my interpretation of the debate paper as well.

The notion of complexity classes is well defined in an abstract mathematical sense. Its just that the abstract mathematical model is sufficiently different from an actual real world AI that I can't see any obvious correspondence between the two.

All complexity theory works in the limiting case of a series of ever larger inputs. In most everyday algorithms, the constants are usually reasonable, meaning that a loglinear algorithm will probably be faster than an expspace one on the amount of data you care about.

In this context, its not clear what it means to say that humans can solve all problems in P. Sorting is in P. Can a human sort an arbitrary list of values? No, a human can't sort 3^^^3 values. Also, saying a problem is in P only says that there exists an algorithm that does something, it doesn't tell you how to find the algorithm. The problem of taking in a number, totally ignoring it and then printing a proof of fermats last theorem is P. There is a constant time algorithm that does it, namely an algorithm with the proof hardcoded into it.

This is especially relevant considering that the AI isn't magic. Any real AI we build will be using a finite number of physical bits, and probably a reasonably small number.

Suppose that humans haven't yet come up with a good special purpose algorithm for the task of estimating the aerodynamic efficiency of a wind turbine design. However, we have made a general AI. We set the AI the task of calculating this efficiency, and it designs and then runs some clever new PDE solving algorithm. The problem was never one that required an exponentially vast amount of compute. The problem was a polytime problem that humans hadn't got around to programming yet. (Except by programming a general AI that could solve it.)

There might be some way that these proofs relate to the capabilities of realworld AI systems, but such relation is highly non obvious and definitely worth describing in detail. Of course, if you have not yet figured out how these results relate to real AI, you might do the maths in the hope that it gives you some insight.