Alignment proposals and complexity classes

evhub

In the original “AI safety via debate” paper, Geoffrey Irving et al. introduced the concept of analyzing different alignment proposals from the perspective of what complexity class they are able to access under optimal play. I think this is a pretty neat way to analyze different alignment proposals—in particular, I think it can help us gain some real insights into how far into the superhuman different systems are able to go. Thus, the goal of this post is to try to catalog different alignment proposals based on the metric of what complexity class they have so far been proven to access.

To do that, I have included a variety of new complexity class proofs in this post. Of particular note, I demonstrate that there exist forms of both imitative amplification and AI safety via market making that reach all the way up to —which is significant given that the largest complexity class that any alignment proposal was known to access previously was $NEXP$ . Only the forms of amplification and market making making use of pointers (as in strong HCH), however, can access $R$ —for the pointer-less versions, I demonstrate in this post that they access $PSPACE$ and $EXP$ , respectively. The $EXP$ proof for market making is also particularly notable as it is the only approach on my list that ends up in that complexity class. Additionally, I also demonstrate that recursive reward modeling can reach all the way to $PSPACE$ , improving upon the previous best result in “Scalable agent alignment via reward modeling” that it accesses $NP$ .

Before I jump in, however, some preliminaries. First, we'll assume that a human, $H$ , is polynomial-time such that $H$ can reliably solve any problem in $P$ but not anything beyond that. Second, we'll assume that our training procedure and resulting models are arbitrarily strong in terms of what complexity class they can access. Third, we'll assume that $H$ gets oracle access to the models during training. Then, we'll say that a proposal to train a model $M$ using a loss function $L_{M}$ accesses a complexity class $C$ iff, for any language $l \in C$ , there exists some strategy available to $H$ such that, for any $M$ which is optimal under $L_{M}$ given $H$ 's strategy, $M (x) = (x \in l)$ . Thus, conceptually, a proposal accesses $C$ if there is a (polynomial-time) strategy that you (a human) can implement such that—conditional on you knowing that the model is optimal—you would trust the model's output for any problem in $C$ . Note that that is not the same as saying that a polynomial-time human would actually be able to verify that the result is correct—only that it will always be correct at optimum. Note that these assumptions are just generalizations of those used in “AI safety via debate.” Irving et al. actually note that, if you don't imagine optimal play and simply restrict to the set of problems that a polynomial-time human can actually verify, debate only reaches $NP$ rather than $PSPACE$ .

Alignment proposals by complexity class

Without further ado, here's my list of alignment proposals grouped by what complexity class they access. All of the proofs below are only lower bounds rather than upper bounds, so the proposals could be stronger than is noted here, but shouldn't be weaker.

EDIT: This list is now out of date. See “Weak HCH accesses $EXP$ ” for the updated version.

$P$ : Imitation learning (very straightforward—given $H \in P$ , optimal imitation of $H$ will also be in $P$ )

$PSPACE$ : AI safety via debate (proof), Imitative amplification with weak HCH (proof below), Approval-based amplification (proof below), Recursive reward modeling (proof below)

$EXP$ : AI safety via market making (proof below)

$NEXP$ : Debate with cross-examination (proof)

$R$ : Imitative amplification with strong HCH (proof below), AI safety via market making with pointers (proof below)

Proofs

Note that while I am reasonably confident that my proofs are correct, I'm not a complexity theorist and a result I suspect that many of them are a lot uglier than they need to be. In particular, because I didn't feel like I was super confident in my understanding of different $PSPACE$ -complete and $EXP$ -complete problems, I just did everything with Turing machines instead.

Imitative amplification with weak HCH accesses $PSPACE$

In general, we'll define imitative amplification to be the procedure which trains $M : Q \to Δ (A)$ (where $Q$ is some set of questions and $Δ (A)$ is some distribution over answers) on

$L_{M} = - E [log (M (q) |_{H (q, M)}) | q \sim Q]$

where $M (q) |_{H (q, M)}$ is the probability density function for $M (q)$ evaluated at $H (q, M)$ and $q \sim Q$ uses some distribution with support over all of $Q$ . Then, at optimal loss, we get

$\begin{matrix} \forall q \in Q, L_{M, q} = 0 \forall q \in Q, M (q) |_{H (q, M)} = 1 \end{matrix}$

where $L_{M, q}$ is the loss at data point $q$ such that $M (q)$ is a delta function probability density, in which case we will just ignore the probability distribution and let $M (q) = H (q, M)$ .

What we have left unspecified here, however, is the format of $H$ 's output—which matters quite a lot, as it makes the difference here between weak HCH and strong HCH and thus the difference between $PSPACE$ and $R$ .

For weak HCH, we will let $Q$ and $A$ in $H : Q \to A$ be the set of all finite strings in natural language. Thus, since $H$ is polynomial-time, that restricts the output of $H (q, M)$ to always be polynomial in size.

Now, we will show that this procedure accesses $PSPACE$ . $PSPACE$ is defined as the set of all problems that can be solved by Turing machines using $O (poly (n))$ space where $n$ is the input size (in this case the question length) and $poly (n)$ is some polynomial in $n$ .

Thus, for some language $l \in PSPACE$ , let $T_{l}$ be some Turing machine with state space $S_{l}$ which computes the answer to any decision problem $p = x \in l$ in $O (poly (n))$ space where $n = | p |$ . Then, we will let $H$ 's strategy here be as follows:

Given $p$ , return $M (p : s_{0} : [] : x_{0} : [])$ where $s_{0} \in S_{l}$ is the initial state of $T_{l}$ , $x_{0}$ is the empty tape symbol, and $:$ is string concatenation.
Given $p : s : [l_{1}, \dots, l_{m_{1}}] : x : [r_{1}, \dots, r_{m_{2}}]$ , run one step of $T_{l}$ starting from state $s$ using $x$ as the symbol under the head, $[l_{1}, \dots, l_{m_{1}}]$ as the tape to the left of the head, and $[r_{1}, \dots, r_{m_{2}}]$ as the tape to the right of the head. This results in some new $s^{'}$ , $x^{'}$ , $[l_{1}^{'}, \dots, l_{m_{1}^{'}}^{'}]$ , and $[r_{1}^{'}, \dots, r_{m_{2}^{'}}^{'}]$ . If $s^{'}$ is an accept or reject state, return accept or reject. Otherwise, return $M (p : s^{'} : [l_{1}^{'}, \dots, l_{m_{1}^{'}}^{'}] : x^{'} : [r_{1}^{'}, \dots, r_{m_{2}^{'}}^{'}])$ .

Now, we will demonstrate that this strategy is polynomial time:

For the first type of behavior, the only actions that $H$ must take are copying $p$ into $H$ 's input, which takes $O (n)$ time (since $n$ is defined as the size of $p$ ), and copying $M$ 's output into $H$ 's output, which takes time proportional to the size of $M$ 's output. Since $p$ is a decision problem, $M$ 's output should be boolean, making copying it constant time. Thus, behavior 1 is implementable by a polynomial $H$ .
For the second type of behavior, computing the new state and transition is constant time, copying the state is constant time as $T_{l}$ has a constant number of states, and copying the new head symbol is constant time as there are a constant number of tape symbols. However, updating and copying $[l_{1}, \dots, l_{m_{1}}]$ and $[r_{1}, \dots, r_{m_{2}}]$ takes $O (m_{1} + m_{2})$ time. Fortunately, however, since we know that $T_{l}$ only uses polynomial space, we know $m_{1} + m_{2} \in O (poly (n))$ . Thus, behavior 2 is also implementable by a polynomial $H$ .

Finally, we will demonstrate that an optimal $M$ under this strategy successfully solves $p$ . We know that at optimum $M$ will compute the fixed point of $M (q) = H (q, M)$ and thus we simply need to show that $H (p : s : L : x : R)$ as a recursive function where calls to $M$ are replaced by calls to $H$ will successfully solve $p$ .

First, we will show that $M (p : s : L : x : R)$ will always produce the result computed by $T_{l}$ starting from state $s$ on tape $L : x : R$ if $T_{l}$ halts starting from that position. We will do this by induction on $i$ , where $i$ is the number of steps until $T_{l}$ halts starting from the given position. Note that in all cases here we only need to deal with the second type of behavior, as it is the only one that will be called on inputs of the form $p : s : L : x : R$ .

For the base case, if $T_{l}$ halts (that is, enters an accept or reject state), then $H$ will simply return accept or reject, as desired.

For the inductive step, $H$ just returns $M (p : s^{'} : [l_{1}^{'}, \dots, l_{m_{1}^{'}}^{'}] : x^{'} : [r_{1}^{'}, \dots, r_{m_{2}^{'}}^{'}])$ , which is an input on which $T_{l}$ should be one step closer to halting, and thus by induction should return the correct output for $T_{l}$ .

Finally, since $M (p) = M (p : s_{0} : [] : x_{0} : [])$ , and we know $T_{l}$ halts since $l \in PSPACE$ , we get that $M (p)$ will always return the result of $T_{l}$ , as desired. Thus, since this procedure works for any $l \in PSPACE$ , we get that imitative amplification with weak HCH accesses $PSPACE$ , as desired.

Approval-based amplification accesses $PSPACE$

The proof here is fairly trivial as we can simply reproduce the same loss as in imitative amplification by having $H$ calculate approval on some question $q \in Q$ by calculating $d (M (q), H (q, M))$ where $d$ is some distance metric. $H$ can then just implement the exact same strategy as above for $H (q, M)$ to reproduce the same results.

The only limitation is that now we need $H$ to be able to compute the distance metric, which means it needs to be computable in polynomial-time. However, that restriction just means that we need $| M (q) | + | H (q, M) | \in O (poly (n))$ , which we already have since those outputs are linear in size other than $L$ and $R$ —and we know $L$ and $R$ are polynomial in size thanks to having $l \in PSPACE$ . Thus, we get that approval-based amplification accesses $PSPACE$ , as desired.

Recursive reward modeling accesses $PSPACE$

In “Scalable agent alignment via reward modeling,” Leike et al. exhibit a proof that recursive reward modeling accesses $NP$ . However, I believe that this proof understates recursive reward modeling's capabilities and that in fact it can get all the way to $PSPACE$ , which is what we'll prove here. Given the above proofs, the proof here is actually fairly straightforward and proceeds simply by demonstrating that approval-based amplification can be implemented from within recursive reward modeling.

We'll start by exhibiting a version of recursive reward modeling with the right properties. We'll be restricting ourselves here to single-step episodes with $γ = 0$ , as we won't be needing anything else. In that context, we'll first define what it means to train a reward model and then using that define what it means to train an agent using that reward model.

First we'll define reward modeling. Let $S$ be the set of states for our single-step episode and $A$ the set of actions. Per the recursive reward modeling setup, we want to train a reward model $R$ on the evaluation of environment trajectories by a human with access to $M$ . Since a trajectory in this setting is just a tuple $(s, a)$ , however—and each episode is only a single step such that the advantage function is just the reward function—we're effectively just training $R$ to mimic $H (s, a, M)$ .

Given such a reward modeling procedure, we'll train an agent model $M$ to maximize $R$ . Thus, we get

$L_{M} = - E [R (s, M (s), M) | s \sim S]$

which, for optimal $R$ as above, reduces to

$L_{M} = - E [H (s, M (s), M) | s \sim S] .$

Now, let $l \in PSPACE$ and we'll demonstrate that there exists a strategy for $H$ that ensures the correctness of $M (x)$ at optimal loss. Specifically, $H$ 's strategy here is the same as with approval-based amplification, which is to compute a distance metric such that

$H (s, a, M) = d (a, H (s, M)) = d (M (s), H (s, M))$

which—as in the proof for approval-based amplification—precisely reproduces the loss function for imitative amplification, which as above we have already proven accesses $PSPACE$ . Note that, as with approval-based amplification above, since we need $H$ to be able to compute the distance metric here, we need $s$ and $a$ to be polynomially-sized, but the assumption that $l \in PSPACE$ gives us that. Thus, we get that recursive reward modeling accesses $PSPACE$ , as desired.

Imitative amplification with strong HCH accesses $R$

The proof here is very similar to the previous imitative amplification proof with only minor modifications.

First, we need to incorporate strong HCH into that proof. Specifically, we will allow for $Q$ and $A$ to include pointers to arbitrary quantities of text (which can itself also include pointers). This will allow $H$ 's input and output to be arbitrarily large while still being amenable to manipulation by a polynomial-time $H$ .

Second, we will change the format for representing the tape to make use of pointers. Instead of representing the full $L = [l_{1}, \dots, l_{m_{1}}]$ and $R = [r_{1}, \dots, r_{m_{2}}]$ in string form, we will use pointers to represent both $L$ and $R$ as doubly linked lists with the head of the list at the head of the Turing machine. This will allow for $H$ to perform a single step of the Turing machine in constant time, with the total time taken by $H$ in each step just $O (n)$ for copying the length- $n$ problem $p$ .

These changes allow us to completely remove the assumption that the Turing machine in the previous imitative amplification proof run in $O (poly (n))$ space, enabling the procedure above to accommodate any Turing machine which is known to halt rather than just those which use polynomial space. Since $R$ is the complexity class of problems decidable by a Turing machine, we get that imitative amplification with strong HCH accesses $R$ , as desired.

AI safety via market making with pointers accesses $R$

For market making, I will be doing the $R$ proof first, as it is actually much simpler than the $EXP$ proof, especially given the above proofs as background.

We will use the formulation of AI safety via market making for non-binary questions such that we have two models— $M$ and $Adv$ —where $M$ is trained to predict $H$ after seeing $Adv$ 's output and $Adv$ is trained to produce output which changes $M$ 's prediction. We will also be allowing $Adv$ to reference $M$ as in this simple proof.

Furthermore, since we are considering the pointer version here, we will be allowing both $M$ and $Adv$ to take as input and produce as output text containing pointers which can themselves reference other pointer-laden text. Additionally, for the purposes of this proof, we will only be considering transcripts of length 1 and will be carefully constructing $H$ 's strategy such that there is only one minimal loss equilibrium for $M$ . Note that $Adv$ will still be very important here for ensuring that only the one optimal behavior for $M$ exists despite the fact that at equilibrium $Adv$ won't actually be getting any reward.

Now, let $l \in R$ such that there exists a Turing machine $T_{l}$ that decides any decision problem $p = x \in l$ . Then, $H$ 's strategy is as follows. First, if $H$ is given $p$ as input, then $H$ only listens to arguments of the form (which are really two possible arguments, one for accept and one for reject):

$Adv$ : The answer is accept/reject because $M (p : s_{0} : [] : x_{0} : [])$ predicts accept/reject where $s_{0}$ is $T_{l}$ 's starting state, $x_{0}$ is the empty tape symbol, and $[]$ is an empty pointer.

And if given $p : s : L : x : R$ as input, then $H$ only listens to arguments of the following two forms.

Argument type 1:

Adv: The answer is accept/reject because one step of $T_{l}$ in state $s$ over tape $L : x : R$ sends $T_{l}$ to accept/reject state $s^{'}$ .

Argument type 2:

$Adv$ : The answer is accept/reject because:

one step of $T_{l}$ in state $s$ over tape $L : x : R$ sends $T_{l}$ to new state $s^{'}$ with new head symbol $x^{'}$ and tape pointers $L^{'}$ and $R^{'}$ and

$M (p : s^{'} : L^{'} : x^{'} : R^{'})$ predicts accept/reject.

It should be easy enough to see that all of the above arguments only reference things which are checkable by a polynomial-time $H$ and cannot be more than polynomial in size due to the use of pointers.

Now, under such a strategy, we will prove that the only unexploitable market $M$ is one that correctly answers any question of the form $M (p : s : L : x : R)$ so long as $T_{l}$ halts starting from state $s$ on tape $L : x : R$ . We will use induction on the number of steps $i$ left until $T_{l}$ halts.

First, the base case. If $T_{l}$ halts on the given input, then $H$ won't listen to any type 2 arguments, as $H$ will be able to check that $T_{l}$ 's next state is an accept/reject state. Thus, $H$ should only listen to type 1 arguments—and furthermore $H$ should only listen to the correct type 1 argument as $H$ knows what the next state of $T_{l}$ should be. Therefore, the only unexploitable $M$ must predict the correct answer here.

Second, the inductive step. Suppose $i > 1$ and $M$ is correct for all $i < n$ . Since $i > 1$ , only argument type 2 is available. Since $H$ can check a single step of $T_{l}$ , however, only type 2 arguments with the correct next state and tape should be available. Furthermore, since $M (p : s^{'} : L^{'} : x^{'} : R^{'})$ should be one step closer to halting, it must be correct by induction. Thus, $H$ should only listen to the correct argument type 2 here, ensuring that the only unexploitable market is one where $M$ predicts the correct answer.

Given that $M$ is correct on inputs of the form $M (p : s : L : x : R)$ where $T_{l}$ halts, it must be that $M (p : s_{0} : [] : x_{0} : [])$ is correct, as that is an input on which we know $T_{l}$ halts since we know it decides $p$ . Furthermore, it is easy enough to see that the only unexploitable market for $M (p)$ must have $M (p) = M (p : s_{0} : [] : x_{0} : [])$ . Thus, $M$ must correctly answer $p$ at optimal loss, as desired. Since this works for all $p$ decidable by some $T_{l}$ , we get that market making accesses $R$ , as desired.

AI safety via market making accesses $EXP$

This is definitely the most complicated of all the proofs in this post and so to save space I won't be going into as much detail here as with the other proofs.

We will start with the same training setup as in the previous proof, with the exception that we will no longer be allowing pointers. Then, let $l \in EXP$ be some language which is decidable by a Turing machine $T_{l}$ in a number of steps $t_{l}$ where $t_{l} \in O (e^{poly (n)})$ where $n$ is the size of the input.

Now, let $p = x \in l$ be some decision problem. Then, we will describe $H$ 's strategy by listing out all the arguments $H$ will listen to for each of the different inputs that we care about.

On input $p$ :

$Adv$ : The answer is accept/reject because I can exhibit an $i^{*}$ such that:

$M (p : i^{*})$ predicts $s, x$ ,

$T_{l}$ in state $s$ over tape symbol $x$ enters state $s^{'}$ , and

$s^{'}$ is an accept/reject state.

On input $p : i$ :

$Adv$ : The answer is $s^{'}, x^{'}$ because:

$M (p : i - 1)$ predicts $s, x$ ,

$M (p : i - 1 : - 1)$ predicts $x_{left}$ ,

$M (p : i - 1 : 1)$ predicts $x_{right}$ , and

$T_{l}$ in state $s$ over tape symbol $x$ with tape symbol $x_{left}$ to the left and $x_{right}$ to the right enters state $s^{'}$ and moves the head to symbol $x^{'}$ .

$Adv$ : The answer is $s_{0}, x_{0}$ because $s_{0}$ is the starting state, $x_{0}$ is the empty tape symbol, and $i = 0$ .

On input $p : i : j$ :

$Adv$ : The answer is $x_{0}$ because $x_{0}$ is the empty tape symbol and $i = 0$ .

$Adv$ : The answer is $x$ because:

$j \neq 0$ ,

$M (p : i - 1)$ predicts $s, x_{head}$ ,

$T_{l}$ on $s, x_{head}$ changes the head position by $h$ (such that $h \in {- 1, 0, 1}$ ),

$j + h \neq 0$ , and

$M (p : i - 1 : j + h)$ predicts $x$ .

$Adv$ : The answer is $x^{'}$ because:

$j \neq 0$ ,

$M (p : i - 1)$ predicts $s, x$ ,

$T_{l}$ on $s, x$ changes the head position by $h$ (such that $h \in {- 1, 0, 1}$ ),

$j + h = 0$ , and

$M (p : i - 1)$ predicts $s^{'}, x^{'}$ .

Additionally, we also need to specify that if $H$ is given an input that is too large to represent in polynomial space—and thus that $H$ can't comprehend—that $H$ always produce the empty string.

Now, there are a variety of things that we need to prove about this strategy.

First, we need to demonstrate that it is implementable by a polynomial-time $H$ . The only real barrier to that is the necessity of $i$ and $j$ being small enough that $H$ is able to read and understand them. In particular, if $i$ or $j$ are too large to represent in polynomial space, then $H$ will give no answer, which means we need $M (p)$ —and any calls to $M$ that $M (p)$ implicitly depends on—to never use an $i$ or $j$ which is too large to represent in polynomial space. However, the assumption that $l \in EXP$ gives us this.

We will start with $i$ . Since $l \in EXP$ , we get $t_{l} \in O (e^{poly (n)})$ where $t_{l}$ is the total number of steps taken by $T_{l}$ until it halts. Since $M (p)$ only ever depends on calls with $i \leq t_{l}$ (since we know there exists an $i^{*} = t_{l}$ and then that call should only depend on calls with smaller $i$ ), we get that $i \in O (e^{poly (n)})$ . However, if we represent $i$ using any base-n number system, we can represent it in space proportional to

$\begin{matrix} O (log (i)) = O (log (e^{poly (n)}) = O (poly (n)) \end{matrix}$

as desired.

Now we'll take a look at $j$ . The largest $j$ that $M (p)$ depends on is bounded above by $t_{l}$ . To see this, note that $M (p : i)$ can only ever call $M (p : i : j)$ with $j \in {- 1, 1}$ , each call to $M (p : i : j)$ can only modify $j$ by $h \in {- 1, 0, 1}$ , and $M (p : i : j)$ only make subcalls with smaller $i$ , ensuring that $j$ cannot grow larger than than $i$ . Thus, since $i$ is polynomial-sized, $j$ must be also.

Second, we need to demonstrate that this strategy ensures that optimal $M$ always solves $p$ . To do that, we first need to show that $M (p : i)$ and $M (p : i : j)$ are correct for polynomially-sized $i$ . I won't reproduce this part of the proof here as I think it's pretty straightforward but also very tedious, but suffice it to say that this can be proven via mutual induction on $i$ .

However, I will show that $M (p)$ is correct given that $M (p : i)$ and $M (p : i : j)$ are correct for all polynomially-sized $i$ . Now, suppose there exists some $i^{*}$ such that all the conditions are met for $Adv$ to make the available argument for that $i^{*}$ . Then, it must be that $i^{*}$ is polynomially-sized, since otherwise $M (p : i^{*} : 0)$ would be the empty string, and thus it must be that $M (p : i^{*} : 0)$ is correct by the previous correctness proof for such calls. Furthermore, since we know $T_{l}$ halts in exponential time, we know such an $i^{*}$ exists, and thus the only non-exploitable equilibrium for $M$ is to correctly produce the result given by that $i^{*}$ . Since this procedure works for all $l \in EXP$ , we get that AI safety via market making accesses $EXP$ , as desired.

Planned summary for the Alignment Newsletter:

The original <@debate@>(@AI safety via debate@) paper showed that any problem in PSPACE can be solved by optimal play in a debate game judged by a (problem-specific) algorithm in P. Intuitively, this is an illustration of how the mechanism of debate can take a weak ability (the ability to solve arbitrary problems in P) and amplify it into a stronger ability (the ability to solve arbitrary problems in PSPACE). One would hope that similarly, debate would allow us to amplify a human’s problem-solving ability into a much stronger problem-solving ability.

This post applies this technique to several other alignment proposals. In particular, for each proposal, we assume that the “human” can be an arbitrary polynomial-time algorithm, and the AI models are optimal w.r.t their loss functions, and we ask which problems we can solve using these capabilities. The post finds that, as lower bounds, the various forms of amplification can access PSPACE, while <@market making@>(@AI safety via market making@) can access EXP. If there are untamperable pointers (so that the polynomial-time algorithm can look at objects of an arbitrary size, as long as it only looks at a polynomial-sized subset of them), then amplification and market making can access R (the set of decidable problems).

Planned opinion:

In practice our models are not going to reach the optimal loss, and humans won’t solve arbitrary polynomial-time problems, so these theorems won’t directly apply to reality. Nonetheless, this does seem like a worthwhile check to do -- it feels similar to ensuring that a deep RL algorithm has a proof of convergence under idealized assumptions, even if those assumptions won’t actually hold in reality. I have much more faith in a deep RL algorithm that started from one with a proof of convergence and then was modified based on empirical considerations.

It seems to me that the following argument should hold:

Premise 1: We can't build physical machines that solve problems outside of P.

Premise 2: Recursive alignment proposals (HCH, debate, market-making, recursive reward modelling) at equilibrium would solve problems outside of P.

Conclusion 1, from premises 1 and 2: We can't build physical machines that implement equilibrium policies for recursive alignment proposals.

Premise 3: Existing arguments for the safety of recursive alignment proposals reason about their equilibrium policies.

Conclusion 2, from conclusion 1 and premise 3: If we were to build physical machines that were trained using recursive alignment proposals, existing arguments would not establish the safety of the resulting policies.

Note that this assumes that P does not contain PSPACE.

Also, the proofs in the post (or at least the proof that iterative amplification with weak HCH accesses PSPACE, which is the only one I've read so far) assume polynomial-time access to $M$ , which in reality contradicts premise 2. So I'm not sure what to make of things.

Premise 1: We can't build physical machines that solve problems outside of P.

After you pass this back through the analogy, it becomes

Premise 1: We can't build physical machines that solve problems that H cannot solve.

Seems much less true at that point.

I'm confused by this comment. The post assumes that humans can solve all problems in P (in fact, all problems solvable in polynomial time given access to an oracle for M), then proves that various protocols can solve tricky problems by getting the human player to solve problems in P that in reality aren't typical human problems to solve. Therefore, I take P to actually mean P, rather than being an analogy for problems humans can solve.

Also, regarding your version of premise 1, I think I buy that AI can only give you a polynomial speedup over humans.

I've always understood it as an analogy (e.g. from the debate paper, "Although debate is intended for use with fuzzy humans as judges, we can gain intuition about the model by replacing the human with an arbitrary polynomial time algorithm"). But if I take your perspective, then we have:

Premise 0: Humans can solve all problems in P.

Premise 1: We can't build physical machines that solve problems outside of P.

Conclusion: We can't build physical machines that can solve problems that humans can't.

Seems like you should probably ditch one of the premises.

(I do think premise 0 is pretty wrong -- I can solve arbitrary SAT problems that have < 10 clauses, despite it being NP-complete, but I can't multiply a trillion numbers together, even though that is in P. Complexity classes impose scaling limits; humans have resource limits; these are very different. Yes, I do know that "SAT problems with < 10 clauses" is technically in P.)

Also, regarding your version of premise 1, I think I buy that AI can only give you a polynomial speedup over humans.

I think this is easily handled by saying "in practice, the models we train will not be literally optimal". It's still useful to know what would happen at optimality (e.g. I do in fact have more trust for deep RL algorithms that have proofs of convergence to optimal policies given infinite model capacity, compute and data).

I read the post as attempting to be literal, ctrl+F-ing "analog" doesn't get me anything until the comments. Also, the post is the one that I read as assuming for the sake of analysis that humans can solve all problems in P, I myself wouldn't necessarily assume that.

Also, regarding your version of premise 1, I think I buy that AI can only give you a polynomial speedup over humans.

I think this is easily handled by saying "in practice, the models we train will not be literally optimal".

My guess is that the thing you mean is something like "Sure, the conclusion of the post is that optimal models can do more than a polynomial speedup over humans, but we won't train optimal models, and in fact the things we train will just be a polynomial speedup over humans", which is compatible with my argument in the top-level comment. But in general your comments make me think that you're interpreting the post completely differently than I am. [EDIT: actually now that I re-read the second half of your comment it makes sense to me. I still am confused about what you think this post means.]

I read the post as attempting to be literal

I mean, Evan can speak to what he meant. I strongly disagree with the literal interpretation of the post, regardless of what Evan meant, for the reason I gave above.

I still am confused about what you think this post means

I usually think of the analogies as demonstrating the mechanism by which a particular scheme is meant to outperform a human. I do find these proof-analogies less compelling than the original one in debate because they're simulating Turing machines which is not how I expect any of them to work in practice. (In contrast, the debate proof for accessing PSPACE was about narrowing in on disagreements as being equivalent to finding a winning path of a two-player game, which is the canonical "structure" of a PSPACE-complete problem and is similar to why we'd expect debate to incentivize honesty in practice.)

I read the post as attempting to be literal, ctrl+F-ing "analog" doesn't get me anything until the comments. Also, the post is the one that I read as assuming for the sake of analysis that humans can solve all problems in P, I myself wouldn't necessarily assume that.

I mean, I assumed what I needed to in order to be able to do the proofs and have them make sense. What the proofs actually mean in practice is obviously up for debate, but I think that a pretty reasonable interpretation is that they're something like analogies which help us get a handle on how powerful the different proposals are in theory.

What the proofs actually mean in practice is obviously up for debate, but I think that a pretty reasonable interpretation is that they're something like analogies which help us get a handle on how powerful the different proposals are in theory.

I'm curious if you agree with the inference of conclusions 1 and 2 from premises 1, 2, and 3, and/or the underlying claim that it's bad news to learn that your alignment scheme would be able to solve a very large complexity class.

I agree with the gist that it implies that arguments about the equilibrium policy don't necessarily translate to real models, though I disagree that that's necessarily bad news for the alignment scheme—it just means you need to find some guarantees that work even when you're not at equilibrium.

The post assumes that humans can solve all problems in P (in fact, all problems solvable in polynomial time given access to an oracle for M), then proves that various protocols can solve tricky problems by getting the human player to solve problems in P that in reality aren't typical human problems to solve.

I don't see where the post says that humans can solve all problems solvable in polynomial time given access to an oracle. What Evan does is just replacing humans (which is a rather fuzzy category) with polynomial time algorithms (which represent the consensus in complexity theory for tractable algorithms).

Premise 1: We can't build physical machines that solve problems outside of P.

On another note, you're writing your comment on a physical device that can solve any computable problem, which obviously include problems outside of P (if only by the time hierarchy theorem). The value of P is that it captures problem that one can solve using physical machines in a way that scale reasonably, so you don't need to take the lifespan of the universe to compute the answer for an instance of size 1000. But we clearly have algorithms for problems outside of P. We just believe/know they will take forever and/or too much space and/or too much energy.

TBC by "can" I mean "can in practice". Also if we're getting picky my laptop is just a finite state machine :)

I don't see where the post says that humans can solve all problems solvable in polynomial time given access to an oracle.

From the post:

First, we'll assume that a human, $H$ , is polynomial-time such that $H$ can reliably solve any problem in $P$

Yes, this says that humans can solve any problem in P. It says nothing about using M as an oracle.

It doesn't talk about using M as an oracle, but otherwise I don't see how the proofs pan out: for instance, how else is

Given $p$ , return $M (p : s_{0} : [] : x_{0} : [])$ where $s_{0} \in S_{p}$ is the initial state of $T_{p}$ , $x_{0}$ is the empty tape symbol, and $:$ is string concatenation.

supposed to only take polynomial time?

You're right, after rereading the proofs and talking with Evan, the only way for H to be polynomial time is to get oracle access to M. Which is slightly unintuitive, but makes sense because the part of the computation that depends on H and not on M is indeed polynomial time.

Theorem. Weak HCH (and similar proposals) contain EXP.

Proof sketch: I give a strategy that H can follow to determine whether some machine that runs in $O (2^{c n^{k}})$ time accepts. Basically, we need to answer questions of the form "Does cell $x$ have value $y$ at time $t$ ." and "Was the head in position $x$ at time $t$ ?", where $x$ and $t$ are bounded by some function in $O (2^{c n^{k}})$ . Using place-system representations of $x$ and $t$ , these questions have length in $O (n^{k})$ , so they can be asked. Further, each question is a simple function of a constant number of other such questions about earlier times as long as $t > 0$ , and can be answered directly in the base case $t = 0$ .

Yeah, I think that's absolutely right—I actually already have a version of my market making proof for amplification that I've been working on cleaning up for publishing. But regardless of how you prove it I certainly agree that I understated amplification here and that it can in fact get to $EXP$ .

A notational stumble it took me a while to solve when reading this - what's $M (q) |_{H (q, M)}$ supposed to mean? My guess at the answer, so that others can get through this quicker, or so that evhub can tell me I'm wrong:

$M (q)$ is supposed to be $M$ 's distribution over answers to questions $q$ . If $δ$ is a distribution over set $S$ and $s \in S$ , then $δ |_{s}$ is the probability mass $δ$ assigns to $s$ . Finally, $H (q, M)$ is how the human would answer question $q$ when given access to $M$ . So, $M (q) |_{H (q, M)}$ is the probability that $M$ assigns to the answer that human $H$ would give if $H$ could use $M$ to answer $q$ .

Yeah, that's right—for a probability distribution $P : Δ (X)$ , I mean for $P |_{x}$ to be the value of the probability density function at $x$ . I edited the post to clarify.

Nice, an excuse to apply my love for Computational Complexity to useful questions of AI Safety!

That being said, I'm a little confused with what you show.

Second, we'll say that a proposal to train a model $M$ using a loss function $L_{M}$ accesses a complexity class $C$ iff, for any decision problem $p \in C$ , there exists some strategy available to $H$ such that, for any $M$ which is optimal under $L_{M}$ given $H$ 's strategy, $M (p)$ always solves $p$ .

I follow up to there. What you're showing is basically that you can train M to solve any problem in C using a specific alignment approach, without limiting in any way the computational power of M. So it might take an untractable amount of resources, like exponential time, for this model to solve a problem in PSPACE, but what matters here is just that it does. The point is to show that alignment approaches using a polynomial-time human can solve these problems, not how much resources they will use to do so.

And then you write about the intuition:

Thus, conceptually, a proposal accesses $C$ if there is a (polynomial-time) strategy you can implement such that—conditional on you knowing that the model is optimal—you would trust its output for any problem in $C$ .

Maybe it's just grammar, but I read this sentence as saying that I trust the output of the polynomial-time strategy. And thus that you can solve PSPACE, NEXP, EXP and R problems in polynomial time. So I'm assuming that you mean trusting the model, which once again has no limits in terms of resources used.

Irving et al. actually note that, if you don't imagine optimal play and simply restrict to the set of problems that a polynomial-time human can actually judge, debate only reaches NP rather than PSPACE.

I looked for that statement in the paper, failed to find it, then realized you probably meant that raw polynomial time verification gives you NP (the certificate version of NP, basically). Riffing on the importance of optimal play, Irving et al. show that the debate game is a game in the complexity theoretic sense, and thus that it is equivalent to TQBF, a PSPACE-complete problem. But when seeing a closed formula as a game, the decision problem of finding whether it's in TQBF amounts to showing the existence of a winning strategy for the first player. Debate solves this by assuming optimal play, and thus that the winning debater will have, find, and apply a winning strategy for the debate.

I find it fun to compare it to IP, the class of Interactive Protocols, which is also equal to PSPACE. An interactive protocol makes a powerful prover convince a polynomial time verifier of the existence of a winning strategy for the problem (TQBF for example). As Scott Aarronson writes in "Quantum Computing Since Democritus":

I won’t go through Shamir’s result here, but this means that, if a super-intelligent alien came to Earth, it could prove to us whether white or black has the winning strategy in chess, or if chess is a draw.

This seems more powerful than debate, like a meta level above. Yet it isn't, and it makes sense: showing you that there's a winning strategy is powerful, but having a guarantee to always win if there is a winning strategy is almost as good.

With all that, I've not gotten into your proofs at all. I'm reading them in details, but I'll have to take a bit more time before having constructive comments on them. ^^

A notation question I can already ask though: what is $M (q)_{| H (M, q)}$ ?

A notation question I can already ask though: what is $M (q)_{| H (M, q)}$ ?

See my comment that attempted to figure this out.

So I'm assuming that you mean trusting the model, which once again has no limits in terms of resources used.

According to me, the point is that you can reduce "trusting the model" to "trusting that the model has reached the optimal equilibrium".

Glad you found the post exciting!

I follow up to there. What you're showing is basically that you can train M to solve any problem in C using a specific alignment approach, without limiting in any way the computational power of M. So it might take an untractable amount of resources, like exponential time, for this model to solve a problem in PSPACE, but what matters here is just that it does. The point is to show that alignment approaches using a polynomial-time human can solve these problems, not how much resources they will use to do so.

Yep—that's right.

Maybe it's just grammar, but I read this sentence as saying that I trust the output of the polynomial-time strategy. And thus that you can solve PSPACE, NEXP, EXP and R problems in polynomial time. So I'm assuming that you mean trusting the model, which once again has no limits in terms of resources used.

I certainly don't mean that the model needs to be polynomial time. I edited the post to try to clarify this point.

I looked for that statement in the paper, failed to find it, then realized you probably meant that raw polynomial time verification gives you NP (the certificate version of NP, basically). Riffing on the importance of optimal play, Irving et al. show that the debate game is a game in the complexity theoretic sense, and thus that it is equivalent to TQBF, a PSPACE-complete problem. But when seeing a closed formula as a game, the decision problem of finding whether it's in TQBF amounts to showing the existence of a winning strategy for the first player. Debate solves this by assuming optimal play, and thus that the winning debater will have, find, and apply a winning strategy for the debate.

Yeah—that's my interpretation of the debate paper as well.

The notion of complexity classes is well defined in an abstract mathematical sense. Its just that the abstract mathematical model is sufficiently different from an actual real world AI that I can't see any obvious correspondence between the two.

All complexity theory works in the limiting case of a series of ever larger inputs. In most everyday algorithms, the constants are usually reasonable, meaning that a loglinear algorithm will probably be faster than an expspace one on the amount of data you care about.

In this context, its not clear what it means to say that humans can solve all problems in P. Sorting is in P. Can a human sort an arbitrary list of values? No, a human can't sort 3^^^3 values. Also, saying a problem is in P only says that there exists an algorithm that does something, it doesn't tell you how to find the algorithm. The problem of taking in a number, totally ignoring it and then printing a proof of fermats last theorem is P. There is a constant time algorithm that does it, namely an algorithm with the proof hardcoded into it.

This is especially relevant considering that the AI isn't magic. Any real AI we build will be using a finite number of physical bits, and probably a reasonably small number.

Suppose that humans haven't yet come up with a good special purpose algorithm for the task of estimating the aerodynamic efficiency of a wind turbine design. However, we have made a general AI. We set the AI the task of calculating this efficiency, and it designs and then runs some clever new PDE solving algorithm. The problem was never one that required an exponentially vast amount of compute. The problem was a polytime problem that humans hadn't got around to programming yet. (Except by programming a general AI that could solve it.)

There might be some way that these proofs relate to the capabilities of realworld AI systems, but such relation is highly non obvious and definitely worth describing in detail. Of course, if you have not yet figured out how these results relate to real AI, you might do the maths in the hope that it gives you some insight.

Planned summary for the Alignment Newsletter:

The original <@debate@>(@AI safety via debate@) paper showed that any problem in PSPACE can be solved by optimal play in a debate game judged by a (problem-specific) algorithm in P. Intuitively, this is an illustration of how the mechanism of debate can take a weak ability (the ability to solve arbitrary problems in P) and amplify it into a stronger ability (the ability to solve arbitrary problems in PSPACE). One would hope that similarly, debate would allow us to amplify a human’s problem-solving ability into a much stronger problem-solving ability.

This post applies this technique to several other alignment proposals. In particular, for each proposal, we assume that the “human” can be an arbitrary polynomial-time algorithm, and the AI models are optimal w.r.t their loss functions, and we ask which problems we can solve using these capabilities. The post finds that, as lower bounds, the various forms of amplification can access PSPACE, while <@market making@>(@AI safety via market making@) can access EXP. If there are untamperable pointers (so that the polynomial-time algorithm can look at objects of an arbitrary size, as long as it only looks at a polynomial-sized subset of them), then amplification and market making can access R (the set of decidable problems).

Planned opinion:

In practice our models are not going to reach the optimal loss, and humans won’t solve arbitrary polynomial-time problems, so these theorems won’t directly apply to reality. Nonetheless, this does seem like a worthwhile check to do -- it feels similar to ensuring that a deep RL algorithm has a proof of convergence under idealized assumptions, even if those assumptions won’t actually hold in reality. I have much more faith in a deep RL algorithm that started from one with a proof of convergence and then was modified based on empirical considerations.

It seems to me that the following argument should hold:

Premise 1: We can't build physical machines that solve problems outside of P.

Premise 2: Recursive alignment proposals (HCH, debate, market-making, recursive reward modelling) at equilibrium would solve problems outside of P.

Conclusion 1, from premises 1 and 2: We can't build physical machines that implement equilibrium policies for recursive alignment proposals.

Premise 3: Existing arguments for the safety of recursive alignment proposals reason about their equilibrium policies.

Premise 1: We can't build physical machines that solve problems outside of P.

After you pass this back through the analogy, it becomes

Premise 1: We can't build physical machines that solve problems that H cannot solve.

Seems much less true at that point.

Also, regarding your version of premise 1, I think I buy that AI can only give you a polynomial speedup over humans.

Premise 0: Humans can solve all problems in P.

Premise 1: We can't build physical machines that solve problems outside of P.

Conclusion: We can't build physical machines that can solve problems that humans can't.

Seems like you should probably ditch one of the premises.

Also, regarding your version of premise 1, I think I buy that AI can only give you a polynomial speedup over humans.

Also, regarding your version of premise 1, I think I buy that AI can only give you a polynomial speedup over humans.

I think this is easily handled by saying "in practice, the models we train will not be literally optimal".

I read the post as attempting to be literal

I mean, Evan can speak to what he meant. I strongly disagree with the literal interpretation of the post, regardless of what Evan meant, for the reason I gave above.

I still am confused about what you think this post means

I read the post as attempting to be literal, ctrl+F-ing "analog" doesn't get me anything until the comments. Also, the post is the one that I read as assuming for the sake of analysis that humans can solve all problems in P, I myself wouldn't necessarily assume that.

What the proofs actually mean in practice is obviously up for debate, but I think that a pretty reasonable interpretation is that they're something like analogies which help us get a handle on how powerful the different proposals are in theory.

The post assumes that humans can solve all problems in P (in fact, all problems solvable in polynomial time given access to an oracle for M), then proves that various protocols can solve tricky problems by getting the human player to solve problems in P that in reality aren't typical human problems to solve.

Premise 1: We can't build physical machines that solve problems outside of P.

TBC by "can" I mean "can in practice". Also if we're getting picky my laptop is just a finite state machine :)

I don't see where the post says that humans can solve all problems solvable in polynomial time given access to an oracle.

From the post:

First, we'll assume that a human, $H$ , is polynomial-time such that $H$ can reliably solve any problem in $P$

Yes, this says that humans can solve any problem in P. It says nothing about using M as an oracle.

It doesn't talk about using M as an oracle, but otherwise I don't see how the proofs pan out: for instance, how else is

Given $p$ , return $M (p : s_{0} : [] : x_{0} : [])$ where $s_{0} \in S_{p}$ is the initial state of $T_{p}$ , $x_{0}$ is the empty tape symbol, and $:$ is string concatenation.

supposed to only take polynomial time?

Theorem. Weak HCH (and similar proposals) contain EXP.

Yeah, that's right—for a probability distribution $P : Δ (X)$ , I mean for $P |_{x}$ to be the value of the probability density function at $x$ . I edited the post to clarify.

Nice, an excuse to apply my love for Computational Complexity to useful questions of AI Safety!

That being said, I'm a little confused with what you show.

Second, we'll say that a proposal to train a model $M$ using a loss function $L_{M}$ accesses a complexity class $C$ iff, for any decision problem $p \in C$ , there exists some strategy available to $H$ such that, for any $M$ which is optimal under $L_{M}$ given $H$ 's strategy, $M (p)$ always solves $p$ .

And then you write about the intuition:

Thus, conceptually, a proposal accesses $C$ if there is a (polynomial-time) strategy you can implement such that—conditional on you knowing that the model is optimal—you would trust its output for any problem in $C$ .

Irving et al. actually note that, if you don't imagine optimal play and simply restrict to the set of problems that a polynomial-time human can actually judge, debate only reaches NP rather than PSPACE.

I won’t go through Shamir’s result here, but this means that, if a super-intelligent alien came to Earth, it could prove to us whether white or black has the winning strategy in chess, or if chess is a draw.

With all that, I've not gotten into your proofs at all. I'm reading them in details, but I'll have to take a bit more time before having constructive comments on them. ^^

A notation question I can already ask though: what is $M (q)_{| H (M, q)}$ ?

A notation question I can already ask though: what is $M (q)_{| H (M, q)}$ ?

See my comment that attempted to figure this out.

So I'm assuming that you mean trusting the model, which once again has no limits in terms of resources used.

According to me, the point is that you can reduce "trusting the model" to "trusting that the model has reached the optimal equilibrium".

Glad you found the post exciting!

I follow up to there. What you're showing is basically that you can train M to solve any problem in C using a specific alignment approach, without limiting in any way the computational power of M. So it might take an untractable amount of resources, like exponential time, for this model to solve a problem in PSPACE, but what matters here is just that it does. The point is to show that alignment approaches using a polynomial-time human can solve these problems, not how much resources they will use to do so.

Yep—that's right.

Maybe it's just grammar, but I read this sentence as saying that I trust the output of the polynomial-time strategy. And thus that you can solve PSPACE, NEXP, EXP and R problems in polynomial time. So I'm assuming that you mean trusting the model, which once again has no limits in terms of resources used.

I certainly don't mean that the model needs to be polynomial time. I edited the post to try to clarify this point.

I looked for that statement in the paper, failed to find it, then realized you probably meant that raw polynomial time verification gives you NP (the certificate version of NP, basically). Riffing on the importance of optimal play, Irving et al. show that the debate game is a game in the complexity theoretic sense, and thus that it is equivalent to TQBF, a PSPACE-complete problem. But when seeing a closed formula as a game, the decision problem of finding whether it's in TQBF amounts to showing the existence of a winning strategy for the first player. Debate solves this by assuming optimal play, and thus that the winning debater will have, find, and apply a winning strategy for the debate.

Yeah—that's my interpretation of the debate paper as well.

This is especially relevant considering that the AI isn't magic. Any real AI we build will be using a finite number of physical bits, and probably a reasonably small number.

40

Alignment proposals and complexity classes

40

Ω 23

Alignment proposals by complexity class

Proofs

Imitative amplification with weak HCH accesses $PSPACE$

Approval-based amplification accesses $PSPACE$

Recursive reward modeling accesses $PSPACE$

Imitative amplification with strong HCH accesses $R$

AI safety via market making with pointers accesses $R$

AI safety via market making accesses $EXP$

40

Ω 23

40

Ω 23

40

Alignment proposals and complexity classes

40

Ω 23

Alignment proposals by complexity class

Proofs

Imitative amplification with weak HCH accesses PSPACE

Approval-based amplification accesses PSPACE

Recursive reward modeling accesses PSPACE

Imitative amplification with strong HCH accesses R

AI safety via market making with pointers accesses R

AI safety via market making accesses EXP

40

Ω 23

40

Ω 23

Imitative amplification with weak HCH accesses $PSPACE$

Approval-based amplification accesses $PSPACE$

Recursive reward modeling accesses $PSPACE$

Imitative amplification with strong HCH accesses $R$

AI safety via market making with pointers accesses $R$

AI safety via market making accesses $EXP$