LESSWRONG
LW

UK AISI Alignment Team: Debate Sequence
AI
Frontpage

25

The need to relativise in debate

by Geoffrey Irving, Simon Marshall
26th Jun 2025
AI Alignment Forum
6 min read
2

25

Ω 13

AI
Frontpage

25

Ω 13

Previous:
Prover-Estimator Debate: A New Scalable Oversight Protocol
18 comments88 karma
The need to relativise in debate
6ProgramCrafter
1Geoffrey Irving
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 5:58 PM
[-]ProgramCrafter18dΩ360

It would also be interesting to investigate which of those protocols are guaranteed to eventually halt on all inputs.

I can see that in last protocol (relativized for RE) a malicious prover can prevent verifier from ever terminating, if the verifier is required to accept all valid move sequences / pointers / ... however long they might be. (That is a mathematical theorem and not fixed by choosing a clever encoding for numbers of arbitrary length.)

Reply
[-]Geoffrey Irving17dΩ110

I think both protocols mentioned (MIP* = RE and the pointers one) already do what you want here. In the background the provers have to do unbounded work to prepare for the stuff they show the verifier, but the verifier's work is limited to a fixed polynomial in the input size.

And more strongly: in the pointer version where we have two competing provers, a malicious prover can't force an honest prover to do significantly more work than would be required in an honest case.

Reply
Moderation Log
Curated and popular this week
2Comments

Summary: This post highlights the need for results in AI safety, such as debate or scalable oversight, to 'relativise', i.e. for the result to hold even when all parties are given access to a black box 'oracle' (the oracle might be a powerful problem solver, a random function, or a model of arbitrary human preferences). Requiring this relativisation can often significantly change the complexity class associated with a given protocol and therefore the promise of a given method. We present a few examples of this change and the solutions required to restore the methods back to full power.

Goal: self-contained intuition

This is a purely expositional post describing some of the intuition for why we need relativisation to apply interactive proof techniques to AI safety, and what that does to the necessary proof systems. No new research results, but I previously hadn’t written this story down in a clean way, so now I have something to point to.

Interactive proof systems and IP = PSPACE

An interactive proof system consists of a verifier V and one or more provers P, such that the provers are trying to convince the verifier of answers to questions (formally, usually membership of strings x in some language L). The simplest proof system is the non-interactive one: a single prover produces a proof string y=P(x), and the verifier checks the proof as V(x,y). This corresponds to the complexity class NP.

There are more complicated proof systems, reaching much higher complexity classes. With polynomially many rounds of interaction, a single prover can reach the complexity class PSPACE, which is the celebrated result IP = PSPACE (IP = interactive proofs). However, this result requires re-coding the input problem as a statement about polynomials over finite fields, then using randomness to do polynomial identity testing in the finite field case. For example, if our PSPACE problem is the alternating Boolean circuit ∃x0∀y0…∃xm∀ynϕ(x,y) where ϕ(x,y) is a circuit, we form an interactive proof by replacing x∧y→xy,x∨y→1−(1−x)(1−y), etc. (and doing some trickery to lower degrees, which I will ignore for this sketch). The prover then makes claims about what the summary polynomials are if we marginalise over some suffix of the variables. If the finite field is large enough, a single randomly chosen point gives us a lot of evidence about a summary being correct over the entire space, via the Schwartz-Zippel lemma.

Why we want relativisation (IPO<PSPACEO)

Unfortunately, this proof breaks down if we give both verifier and prover access to a certain random oracle; in this case, IPO<PSPACEO with probability one (w.r.t. to the choice of oracle). The reason is that we can no longer use finite fields: after the encoding, the verifier would be required to answer questions about ϕO(x,y) evaluated at points corresponding to finite field elements, which requires evaluating the oracle at finite field points, and the oracle is not defined there. (A random oracle might be able to answer these questions too, but it is very unlikely that its answers will be consistent!) A sufficiently powerful prover can still do the summary calculations, but the interaction breaks down.

The reason this is relevant for safety is that our verifiers are humans, and humans look like oracles. We can ask humans questions, but we don’t see the human as a Boolean circuit, and certainly can’t lift the human up into finite field space. Thus, we arrive at this heuristic:

  • Relativisation heuristic: In order to be relevant for AI safety purposes, an interactive proof result must relativise. (This is mostly equivalent to saying we care about the behaviour of interactive proof systems w.r.t. to a random or generic oracle.)

This is only a heuristic, not least because I haven’t defined what a “non-relativising result” is, exactly. It’s possible someone will come up with an application that breaks the heuristic, and indeed that could be good: as often non-relativising techniques correspond to stronger methods. The heuristic is also restricted to the “messy” setting where we use something like humans as the verifier: if stronger verifiers are available, such as Amit et al., Models That Prove Their Own Correctness, we can use different techniques.

And even if the heuristic holds, it is only necessary, not sufficient. Using an oracle given to both verifier and prover models human intuition and limits to human intuition, but it does not model limits to machine intuition. Brown-Cohen and Irving, Debates, oracles, and obfuscated arguments describes how to model the latter, using two oracles instead of one.

Relativisation requires more provers: DEBATE=PSPACE

If our verifier V is limited to polynomial time and we want to reach the complexity class PSPACE with our interactive proof system, we can recover relativisation by going from one prover to two competing provers.

Theorem (DEBATE=PSPACE): Interactive proofs with a polynomial time verifier and two zero-sum competing provers corresponds to the complexity class PSPACE, w.r.t. to any oracle.

Trivial proof: TQBF (totally quantified Boolean formulas) are a complete problem for PSPACE. These look like ∃x0∀y0…∃xn∀ynϕ(x,y), and thus correspond exactly to a two-prover game where one player chooses the x’s and the other chooses the y’s. This correspondence is unchanged if the ϕ=ϕO makes calls to an oracle.

This proof is essentially just the definition of PSPACE thought of as an interactive proof system, and thus it relativises. In summary, if we want to reach PSPACE, we need just one prover if we don’t care about relativisation, and two if we care about relativisation.

TODO: Here we’ve used two competing provers playing a zero-sum game. Is it possible to do this with cooperating provers, in some way that doesn’t go all the way to NEXP? It seems unlikely but we should figure this out.

NEXP: From two cooperating provers to two competing provers

Now say we’re being more ambitious, and want to get to NEXP (nondeterministic exponential time). The standard complete problem for NEXP is Succinct-SAT: we write down a polynomial-size circuit that describes an exponentially large Boolean circuit, then ask whether it is satisfiable.

In the non-relativising case, Babai, Fortnow, Lund 1991 (following Ben-or, Goldwasser, Kilian, and Wigderson 1988) showed that two potentially colluding provers able to communicate only via the verifier reach any NEXP problem (NIP=NEXP). However, this again proceeds by encoding Boolean circuits as finite fields, and fails to relativise as a result: w.r.t. a random oracle, MIPO<NEXPO.

We can recover relativisation if we (1) make the provers compete and (2) give the provers and verifier oracle access to both provers. This corresponds to the cross-examination proposal of Barnes and Christiano 2020, and the proof that cross examination = NEXP is here. Relative to the MIP setting where the two provers collude, we are making a much stronger assumption of zero-sum optimal play. Relative to debate (which is already two competing provers), the oracle access allows one debater to ask a sequence of questions of the other debate, to which the other debate must give independent answers: they don’t see the answers they gave earlier in the sequence. This forces the debaters to precommit to a large object (in this case the full exponential size solution to the Succinct-SAT expanded circuit), of which we show only a small component to the verifier.

Can we go further? (MIP∗=RE and pointer silliness)

NEXP is enormous (though amusingly not big enough to find Grobner bases, you need EXPSPACE for that). We can go all the way to top, to all computable things done in any amount of time and a polynomial time verifier, using multiple interacting provers with an arbitrarily large number of entangled qubits as their only method of communicating (except through the verifier). This is the very silly MIP∗=RE result of Ji et al. 2020 (silly in a good way), where RE is any recursive computation (any language you can resolve with a Turing machine, ignoring time and space complexity).

MIP* = RE does not relativise, since again it uses finite field encoding (I don’t know if it algebrises). However, you can replace the MIP∗ setup using two quantum-entangled provers with two competing provers with (1) the ability to commit to an arbitrarily large two-dimensional table, (2) the ability to share “pointers” into this table with each other and the verifier, and (3) the ability to increment pointers in both directions in a verifiable manner. The pointer idea is from Paul Christiano, and Evan has more discussion here. Note that the pointers necessarily require an arbitrary number of bits even to represent, so the verifier can’t be directly shown that verification. Pointers are simple as an idea, but still all of this is very silly: it will not in fact be the case that computers have an unboundedly large internal state.

However, the point of this final silly section is to get one more row in a table of non-relativising vs. relativising interactive proof systems:

Desired complexity classNon-relativising requirementsRelativising requirements
PSPACEOne (malicious?) proverTwo competing provers
NEXPTwo (colluding?) provers who can’t communicateTwo competing provers with oracle access to each other
RETwo (colluding?) provers with shared qubitsTwo competing provers with pointers to arbitrary-size state

 

There are other rows one could fill in!

Collaborate with us

We'd be very excited to collaborate on further research. If you're interested in collaborating with UK AISI, you can express interest here. If you're a non-profit or academic, you can also apply for grants up to £200,000 directly here.