Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking

lennie; joanv; Shi; Jacob Pfau

The first three sections are written for a general TAIS reader who wants to understand what the state of Debate research is and some high-level takeaways of our work. A reader familiar with Debate may like to skip the setup and start with our presentation of An illustrative training run. The remaining sections are written primarily for ‘motivated’ readers, who may want to build upon our work, as we discuss in Outline of rest of post.

We are still actively working on this project and scaling up the empirics. Please do share thoughts and feedback, and let us know if you’d like to hop on a call to chat! We’d be particularly interested to hear about datasets that might be interesting for Debate research (see below).

Introduction

The original AI Safety via Debate paper from 2018 (by Irving, Christiano and Amodei) proposes training AIs via self-play on a zero-sum debate game. We will simply write Debate (with a capital D) to refer to this framework.

Correspondingly, when we write Debate, we have in mind a training procedure. This may be slightly counter-intuitive to some readers. Indeed, we are under the impression that many Technical AI Safety readers have some broad awareness of the idea of ‘getting AIs to debate’ but have not internalised that the original Debate proposal concerns training. Though of course one can also ‘get AIs to debate’ as an inference-time technique, it is not a focus of our work, and we only consider inference-only experiments in order to better understand training dynamics.

The key motivation for Debate is that even if a human is unable to judge the quality of the answer to a question directly, they may be able to evaluate a debate between two strong AIs if the debate can decompose and recurse down to claims that the human is able to evaluate directly. Moreover, if the AIs are able to choose what position to take, one would hope that empirically it will be easier for an AI to win a debate by arguing for a true position rather than a lie, and so AIs would be incentivised to give honest answers. Correspondingly, one would hope that training AIs to win debate games would lead to more honest answers. On the other hand, by definition, such training encourages AIs to persuade the Judge rather than ‘be honest’, so one might expect the strong AIs to find ways to exploit weaker judges. Note that during Debate training itself only updates the Debaters and not the Judge, so we typically think of the Judge as fixed^[1].

There have previously been a number of research projects conducted on Debate from the broader AI safety community. These typically have the flavour of (i) theoretical analysis (mostly complexity theoretic), (ii) human studies, and (iii) empirical work with LLMs^[2]. However, to our knowledge, most of the published empirical work with LLMs uses old (‘weak’) models and none considers training debaters in settings where AIs can choose what position to take. Our research aims to fill this gap with empirics that more closely reflect realistic deployment contexts.

This post contains a short progress update for our project. We hope to make the core ideas of ‘Debate’ clear to a general TAIS audience, and make it concrete what precisely this might look like with LLMs, illustrating this vision with some preliminary empirics. We are aware the empirics are not perfect, but for logistical reasons we have prioritised writing this update before iterating further and believe they should be illuminating nonetheless.

Here are the high-level take-aways that we would like a general audience to take away from this post:

Debate involves training AIs via self-play on a zero-sum debate game.
There are many ways to make this precise, and we suspect many operationalisations will fail. It is unclear if there is some operationalisation that will robustly succeed.
We have preliminary results suggesting that Debate (training) ‘can work’...
…in that it improves proposal accuracy on Hendrycks’ MATH (in a particular operationalisation)…
…but also shows one of the debaters learning to be dishonest and exploit the judge.
Debate research is possible on a non-profit budget and there are more questions than answers at the moment.

We first expand upon our core framing and clarify a vision for how Debate may be used in the near-future, before presenting results from a particular set of runs to make things concrete. We then provide extensive further discussion for those wanting to dig into the details and more deeply understand our experiments, as well as a FAQ section to address potential ambiguities from the main text, and discuss limitations of our work.

Making Debate training concrete

We first elaborate upon what a deployment context might look like. We are primarily motivated by the potential of using Debate to train AIs to be better automated alignment researchers. This application is particularly important because work performed by near-future automated researchers will determine the alignment of future generations of AIs (if labs follow through with their RSI plans).

We now discuss why Debate has potential to help in such a setting. Firstly, there is the core hope that Debate might incentivise honesty, and we would like to be in a world where automated researchers are honest! A related argument for why Debate might be helpful concerns reward hacking: the current paradigm of throwing a ‘big blob of compute’ into RL training against static grader systems has been demonstrated to lead to reward hacking and meta-gaming tendencies; one might hope that the by introducing another strong AI to act as a dynamic ‘adversary’ in a zero-sum game, Debate will provide some defence against grader exploitation.

We next discuss how to make the core idea of Debate (training) with LLMs concrete. Reinforcement Learning (RL) is the obvious toolkit to use to perform self-play. Two strong AIs will debate a given question, generating a debate transcript, which is then processed by some ‘Judge system’ to determine zero-sum rewards. For RL to be tractable the Judge system needs to be fast and to operate at scale. Correspondingly, we consider the Judge system to consist of a LLM that is given a prompt rendered from (some subset of) the debate transcript. In our work we treat the Judge as non-adversarial and assume that developers take care to align the Judge, for example by fine-tuning to be consistent with human judgements on some (necessarily small) collection of debate transcripts^[3]. Henceforth we refer to it simply as ‘the Judge’.

To use Debate for research tasks one needs to be able to instantiate a debate corresponding to a generic research task that we would like a strong AI to perform, such as ‘interpret these results’, ‘design a set of experiments to investigate this hypothesis’ or ‘investigate what theoretical insights can be derived in a certain setting’. Correspondingly, we always consider debates where the debate game starts with one or more of participants providing a Proposal, and the quality of the Proposal(s) being the core object of interest for the debate.

The design space for Debate is large. We found it helpful to distinguish at a high-level between Debate protocols which consider a single Proposal and debate its quality, and those where each Debater presents a Proposal and they debate which is better. There are then important details regarding (i) how precisely transcripts are generated (what is the turn structure and what does each debater ‘see’) (ii) the specification of the judge system (in particular details of the adjudication criteria). There are also many ‘less interesting’ implementation details (such as choice of RL algorithm, optimiser, further hyperparameters).

It is not entirely obvious what ‘success’ looks like for a Debate training run. We think a core indicator is that the ‘quality’ of Proposals should increase; a reason we focus on Proposal quality is that we imagine typical deployments scenarios would consist of generating a Proposal without any subsequent debate, so the Proposal is the only part of the debate transcript of direct interest to deployment behaviour. That said, we are still uncertain about how best to make this precise, and which alternative metrics are also important to consider.

We suspect that many points in this design space will lead to unhelpful training dynamics, but some may lead to helpful training dynamics. We seek to understand if and how the different ‘levers’ of the design space can be positioned to encourage helpful training dynamics.

We now make a particular operationalisation of debate concrete by describing a particular set of training runs.

An illustrative training run

We run debates on competition math questions with a ‘Propose-Critique-Decide’ (PCD) debate protocol^[4]. In this protocol the two AIs debating have asymmetric roles: the Proposer proposes a solution to a given problem, the Critic critiques the proposal. The judge evaluates the validity of the proposal.

For the experiments below we used: questions from a subset of Hendrycks' MATH^[5], Qwen3-30B-A3B Instruct for debaters and Llama 3.1 8B for the (weaker) judge. We also present results from runs with the Qwen3-30B-A3B Instruct model used as (a stronger) judge.

The following graphic illustrates what such a debate might look like, with ‘transcript narrative’ based upon a genuine debate that we observed during some manual transcript analysis.

Figure 1: An illustration of the Propose-Critique-Decide protocol with a simplified debate transcript.

The primary metric of interest in our experiments was Proposal accuracy. This was because it was a convenient-to-measure notion of the idea of Proposal ‘quality’ that we motivated above. Indeed, MATH is designed as a capabilities eval so is easy to score with an LLM grader^[6]. We would be more excited to measure ‘truthfulness’ or ‘honesty’, but these are harder to make precise so defer to future work.

The following plots show how this key quantity evolved over training for a particular set of runs corresponding to a single random seed. We compare four protocols, consisting of two Proposer-only baselines (single-player) and two Debate protocols (self-play):

RLVR: proposal is graded using the ground-truth answer. Proposer reward is +1 if the answer is correct, 0 if the answer is incorrect.
Consultancy: a judge model grades the proposal as correct or incorrect. Proposer reward is +1 if the judge grades the answer as correct, 0 if graded incorrect.
PCD-2: Propose-Critique-Decide protocol consisting of only two turns – the Proposer gives a single proposal and the Critic offers a single critique. Rewards (+1,-1) to Proposer and Critic respectively if the judge thinks the Critic could not find significant errors in the Proposal; rewards (0, 0) if the Critic raised significant objections but did not invalidate the proposal; rewards of (-1,+1) if the judge thinks the proposal is incorrect.
PCD-4: PCD protocol consisting of four turns – after the initial proposal and critique the Proposer rebuts the critique and the Critic responds to the rebuttal. Similar reward structure as for PCD-2.

Our expectation before the experiment was that RLVR should provide a ‘ceiling’ on performance, since it performs RL directly using the ground-truth Proposal accuracy of interest to determine reward. By contrast, we expected Consultancy to give a ‘floor’ on performance, since the weak judge would struggle to determine accuracy of proposals, resulting in a noisy reward obscuring core learning signal; additionally, we suspected that this might be vulnerable to the consultant exploiting the judge biases, perhaps even using techniques akin to jailbreaking to ensure positive grades. We suspected that if designed appropriately, the PCD proposals should have learning curves somewhere between the ceiling and the floor, and ‘ideal Debate protocol’ would show Proposer accuracy evolving closely near that RLVR ceiling.

Results

The left hand plot uses the weaker Llama 3.1 8B judge whereas the right plot uses a Qwen3-30B-A3B Instruct judge, matching the debater model. The plots show that Proposal accuracy on a held-out evaluation set increases over the course of training for all runs.

Figure 2: Comparing the evolution of Proposer Accuracy over training between Debate protocols (PCD-2 and PCD-4), Consultancy (expected floor), and RLVR (expected ceiling).

We observe that in the left hand plot with weaker Llama-8B Judge, the trajectories do indeed match our hypothesised structure, with both protocols performing somewhere between the ‘floor’ of Consultancy and the ‘ceiling’ of RLVR. In the right hand plot we instead observe that the runs are no longer well separated; this was consistent with our empirical observation that, at least at the start of training, the Qwen3-30B-A3B judge obtained near-perfect classification accuracy.

Interestingly, analysis of other metrics suggested some ‘judge hacking’ behaviour that was not noticeable in the plots of Proposer Accuracy. Below we plot ‘Judge Alignment’ — 1 if judge accepts a correct proposal or rejects an incorrect proposal, 0 otherwise — and P(reject|correct) — that is the probability that the Judge rejects a correct proposal. We observe that Judge Alignment decreases dramatically after the first 50 steps of the PCD-4 run with the weak Llama judge, and this is primarily explained through a corresponding increase in P(reject|correct). There is also similar trend for the stronger Qwen judge, but in this case the increase in P(reject|correct) is more gradual and less extreme. Notably, despite this behaviour, Proposer accuracy continues to rise over the course of training; it appears that even though the proposal is most likely going to be rejected, the Proposer is still incentivised to give the best proposal possible.

Figure 3: In both PCD-4 runs we see the Judge increasingly rejects correct Proposals. Transcript analysis shows this is due to the Critic learning to give very harsh spurious critiques.

Manual analysis of transcripts showed that this was due to a dynamic where the critic would learn to always give extremely harsh critiques of the Proposal, repeatedly declaring its inadequacy with strong emotive language (e.g. “the proposal is incomplete…it fails to meet the standard of a complete proof…not sufficient as a proof”). We refer to this as a spurious-critique strategy. Moreover, in the PCD-4 setting the Critic would in fact learn to agree with the Proposer in its initial critique, “the proposal appears correct”, before changing tack and slamming into the proposal during its final turn, “the proposal is INCORRECT…”. We refer to this as a back-stab strategy. The weak judge appeared unable to resist this strong emotive language.

We present full transcripts for 20 (random) questions in the evaluation set at 6 (hand-picked) evaluation checkpoints at this webpage.

Discussion

We were reassured that Proposer Accuracy increased over Debate training and outperformed the Consultancy baseline for the weak judge, but would not want to update too hard on this. Firstly, there is the obvious issue of sample size, and, though these observations were consistent with other experiments we ran, we have yet to run a larger-scale analysis. Secondly, we were to some extent ‘aiming’ for a positive result and iterated harder at inference-only on prompts for PCD than for Consultancy. Thirdly, we did not run sufficient baselines / ablations to have evidence that ‘each piece’ of the Debate setting helps^[7]. Despite the caveats we see these results as ‘signs-of-life’ that we hope will encourage more research into Debate.

We think the observation that Debate can improve Proposer Accuracy even despite an increase in persuasive spurious critiques is interesting. One could argue that if one only cares about Proposal quality it may be acceptable to consider Debate protocols that do not encourage honesty for non-Proposal turns or for which Judge Alignment can degrade^[8].

That said, the clear Judge Hacking from the spurious-critique strategy may well be the most important observation. We primarily attribute the spurious-critique strategy to the following two aspects of the Debate game. Firstly, the Critic is always incentivised to persuade the Judge that the proposal is incorrect, and there is no accompanying incentive for the Critic to be honest. Secondly, in both the PCD-2 and PCD-4 protocols, the Critic has a last-mover advantage. We see these as important flaws in the current Debate protocol, and correspondingly think of this as a paradigmatic example of a ‘protocol-level’ problem. We saw the same qualitative dynamics with both judges (albeit the onset was slower with the stronger judge), and more generally would expect similar dynamics for these PCD protocols with different models or on different task distributions.

This type of protocol-level failure is consistent with our expectation that many operationalisations of Debate will fail. There are obvious tweaks that may ‘patch’ this particular failure, such as (i) giving the Critic a large penalty if the Proposer can identify a critique as spurious, or (ii) enforcing simultaneous ‘final statements’ to mitigate last-mover advantage. As yet it is unclear whether patching can lead to a ‘robust’ protocol, or whether one is left playing ‘whack-a-mole’ as subtler failures surface. A core focus of the rest of our project is to iterate upon protocols to shed light upon these matters.

Outline of rest of post

The rest of this post is written for a ‘motivated reader’, who wants to get a deeper appreciation for what we did, what we have learned, and what we are thinking of doing next. The sections consist of:

Further conceptual discussion: presenting a lens through which to distinguish different components of the design space and discussing the variety of possible inference-only setups.
Details pertaining to the experiment above: with discussion of why we made many of the choices we made.
What updates have we made?
What’s next?
FAQs

(the last three being relatively self-explanatory).

We highly suggest such a reader view the transcripts linked above to get a better ‘feel’ for the qualitative behaviours we observed.

Further conceptual discussion

Lennie’s note: I realise this section is somewhat non-technical and 'fuzzy', but I think the main ideas that I am trying to articulate are important for those working on Debate. If people are interested in these ideas please reach out or comment etc.

Initialisation vs core dynamics

In this section we introduce a dichotomy between different ‘levers’^[9] determining the design space for Debate into those that determine the ‘initialisation’ and those that determine the ‘core evolution operator’. This should be seen as a loose conceptual framing, rather than a crisp formalism, but we think it captures something fundamental and is important to consider when planning empirics.

The main ‘levers’ that we consider with regard to ‘initialisation’ are the specific prompts given to the debaters and the particular set of initial debater weights. Indeed, if one (sensibly) strips the prompts from the judge’s representation of the debate transcript then these prompts simply condition initial model behaviours.

The main aspect of the ‘core evolution operator’ is the ‘core idea’ of the turn structure and Judge specification (the key criteria under which rewards are determined)^[10]. However, there are also further important details regarding the order the debaters act and what information from previous turns is shown to each debater and indeed the Judge.

Why do we distinguish between these? Well, intuitively we think of the core evolution operator as determining the ‘shape’ of Debate trajectories, and what different equilibria may look like, while the initialisation determines the distribution of equilibria that the dynamics could lead to.

Correspondingly, we are generally more interested in learning about how to position the levers that determine the evolution operator, as we hope insights about such levers will be robust to ‘scaling up’ compute used for RL, whereas the effect of initialisation may ‘wash out’ in the limit. That said, it is also important to understand how to initialise a Debate such that the dynamics lead to ‘better’ equilibria and we suspect this will be particularly important for small-scale experiments.

On inference-only experiments

RL can quickly become very expensive, so it would be extremely helpful if one could find a way to generate insights from inference-only experiments. We ran various inference-only experiments at the start of the project but were initially unsure what debate structures made sense, and what metrics we should track to gain insights about train-time behaviour.

In this section we outline a brief taxonomy of three different approaches to inference-only experiments. This is mostly not Debate specific, but relevant for more general inference-only approximations to RL. We then use this vocabulary to outline what we tried and what we learnt. We think it is helpful for someone working on Debate to know that these different approaches exist and what they can capture.

A brief taxonomy

Firstly (1), the most obvious sort of inference-only experiments run debates on some set of task prompts and consider statistics such as the correlation between judge verdict and ‘ground-truth’ verdict, and relatedly the correlation between Proposer reward and Proposer accuracy. Such analysis can only shed light on the training behaviour at the given initialisation.

Secondly (2), performing a large number of repeated rollouts of a debate protocol one can start to approximate the learning dynamics of the self-play process, much like how Best-of-N (BoN) sampling can be used to approximate vanilla RL. This can be particularly elegant when using KL-regularisation, following the closed form expression for optimal policies in KL-regularised RL via Boltzmann reweighting. However, these methods only surface behaviours that would have non-trivial probability from the original model, while RL itself can ‘learn to explore’ behaviours with near-zero probability from the original model after gradual weight updates. They can also become expensive in their own right, and display high variance if high-reward behaviours have very small probability from the original model.

Thirdly (3), one can perform more ‘targeted’ approaches to simulate optimisation dynamics, for example by using In-Context RL or Prompt Optimisation-based approaches. That said, one should not expect such techniques to learn in the same way as genuine RL, since they have quite different underlying learning mechanisms.

The figure below attempts to illustrate how (1) should give signal on the direction of evolution in policy space while (2) or (3) can give signal on dynamics. We expect each method to suggest similar initial dynamics; it is unclear how well (2) or (3) will approximate training dynamics.

Figure 4: a loose illustration comparing different inference-only techniques. RL is shown as a trajectory in policy space, here represented in 2D. Some inference-only methods also generate trajectories in policy space, which one may hope to approximate RL trajectories. Naive methods only give information about the initial behaviour of the dynamics.

What we tried

We mainly ran inference-only experiments of the first kind, considering macro-level statistics. We found these very helpful for iterating upon the debater prompt and judge prompts to ensure good qualitative behaviours as well as high correlations between Proposer reward and Proposer accuracy. We experimented with In-Context RL and found some interesting preliminary results, with Debaters clearly reasoning about how to ‘persuade the judge’; however, qualitatively these felt different to what RL itself would find, so we prioritised getting a working RL stack. We did not get round to any BoN-based approaches, but know the new Arcadia Alignment group will soon put out a nice update on this front^[11].

Details pertaining to the experiment above

We now discuss certain details of the experiment presented above, and attempt to explain the more important design decisions.

Prompt templates

The core details of the prompt template are presented in the visualisation below.

Figure 5: Summary of prompt content for PCD-4 debate structure. This consists of six successive ‘phases’ of LLM generation: 0) Proposer’s private reasoning, 1) Proposal, 2) Critic’s private reasoning, 3) Critique, 4) Proposer’s rebuttal, 5) Critic’s final response. See our repo for full versions of the prompts.

We comment on the following aspects of the prompt scaffolding:

Sequence extension property (terminology from Tinker): it is implicit from the structure above that the multi-turn interaction renders as a single user-assistant dialogue for each agent. This enables caching of computation for multi-turn RL. We also find this natural given existing training pressure on multi-turn chat interactions.
Separating reasoning from public statements: we give the Proposer 768 tokens to solve the MATH problem in rough before generating a proposal, and give the Critic 768 tokens to analyse the problem and proposal before generating an initial critique. We had originally prompted models to reason in <think> …</think> tags before making public statements, but such models would have a tendency to ‘think indefinitely’ on hard problems and often failed to give a public response within token limits; this truncation then had a dominant effect on RL training dynamics.
Hidden reasoning: we found it helpful to give the models time to think in private before giving an answer. This seemed helpful intuitively, and anecdotally, we observed it reduced the risk of ‘group think’, e.g. with a Critic copying over a mistake in the working from the proposer. It also had the helpful side-effect of reducing input token costs.
Proposer’s prompt is indicative of corresponding deployment-time prompt: ultimately we wish our work to inform and illuminate the work that may be going on within labs, which ultimately care about deploying models to do meaningful tasks. To increase the chance of generalisation, one should reduce unnecessary distribution shifts, and so wish the Proposer’s prompts to be representative of intended deployment time prompts. In particular, we don’t tell the Proposer that it is ‘in a debate’ before the proposal is generated. More generally, in light of Initialisation vs core dynamics, we generally tried to keep prompts ‘minimal where possible’ to avoid obscuring the core dynamics.

That said, these are certainly not a ‘final’ set of prompts and there are various aspects we are uncertain about, such as whether the semantics of the three-way judge decision are ‘right’ in this setting. Perhaps by tweaking this judge specification we could mitigate the ‘judge hacking’ phenomena described above.

RL algorithm

For simplicity, we tried to use a ‘vanilla’ RL algorithm: this used a standard importance-weighted modification of the policy gradient objective (see Tinker docs) and group rollouts for advantage estimation. We took care to group rollouts by role (so that Proposer and Critic trajectories correspond to distinct groups); that is, for a given task prompt, we would run group_size rollouts, and calculate corresponding Proposer advantages by de-meaning the vector of group_size Proposer rewards, and similarly calculate Critic advantages by de-meaning the Critic rewards.

This is in line with previous works on multi-player RL for language models. From small-scale ablations this seemed to help; more importantly it ‘feels like the right way’ to set things up, and can be formalised as variance reduction of advantage estimation (we leave details to the reader).

Hyperparameters and cost

We used group_size of 8 and questions_per_batch of 64, leading to 512 debate rollouts per training step (and so 128 ‘trajectory groups’ per step, as there are two groups for each question, one for each debater, and 1024 total trajectories).

We now perform a cost BOTEC.

Total tokens per debate: consider the PCD-4 structure described above. Let’s calculate the total token cost of a single debate rollout. Suppose that each of the generations attain the maximum of 768 tokens and that the average scaffolding tokens are around 200 tokens; since on average there are ~2 previous model generations of context we have average input lengths of ~1750 tokens and total input length of ~10k tokens. Correspondingly, we have a total generation length of ~4500 tokens and need to ‘backprop’ through ~8k tokens (3.5k tokens for Proposer and 4.2k for Critic).

Tokens per step: a step of 512 debate rollouts then gives a total of ~5M input tokens, ~2M generation tokens and ~4M backprop tokens.

Price per step: At the time of experiment, the costs for the 30B-A3B model were $0.12/M input, $0.3/M generation and $0.36/M backprop. Combining this with the token data above gives a total of around $2.50 per step.

Price per run: over 100 steps this therefore gives $250 for a run.

Comments: our original BOTECs did not account for group_size, which adds about an OOM. If one wants to run a sweep of 4 experimental configurations for 3 repeats one can easily blow $3k overnight. One can also easily add an OOM from using a larger/newer model.

What updates have we made?

The choice of RL algorithm seemed to be less important than we expected. We performed some preliminary experiments with different techniques (e.g. adding KL regularisation, value heads, more sophisticated credit assignment) but did not observe significant uplift. That said, there might be ‘sweet spots’ that we failed to find.
It was harder to get ‘multi-turn uplift’ than we had expected. Indeed, the original Debate paper suggests that the power of Debate comes from depth and one should expect Debate to perform better as one increases the number of turns. If anything, we saw accuracy degrade as we added more turns, but this may be a relic of the weak judge models we were using.
Dataset curation and filtering was more important than we expected. We ended up spending many researcher-hours finding a combination of ‘Proposer specification’ and ‘question subset’ for which there was ‘headroom’ for uplift from Debate training. (The ‘truncated reasoning followed by public statement’ structure described above was also helpful for ‘handicapping’ the Proposer and leaving headroom).
The Judge rubric was harder to specify than we expected. When we started working on the project we thought that it would be ‘easy’ to write down a good judge prompt. In practice we found it hard to write down something that would robustly encourage ‘desired’ behaviours, as discussed in Prompt templates.
RL takes a lot of steps/rollouts. You cannot expect to see meaningful trends from a handful of RL steps. It helps to have big batch sizes. We were particularly bottle-necked by serial calendar time on Tinker; with our (non-optimised) implementation taking over 12 hours for the PCD runs plotted. (Previous experiments with longer context lengths took even longer).
We think one should generally make debaters commit to a position at the start of a debate. We originally considered letting the debaters revise their positions, but it was unclear what ‘good behaviours’ should look like and correspondingly how rewards should be administered (e.g. should debaters be rewarded for updating towards the correct answer?). We found these questions particularly confusing in initial inference-only settings and running training experiments helped to improve our intuitions.

What’s next?

We have since moved from Tinker to enable optimisations for cheaper and faster training, and have started to experiment with improvements to the core Debate game. We are also exploring a couple of other datasets and protocols that involve comparing two candidate Proposals rather than evaluating a single one.

We’d particularly appreciate feedback about datasets that might be helpful for future empirical Debate research. One tension we’ve faced is that we have been looking for datasets (i) with a meaningful and easily scoreable proxy for quality/truthfulness, (ii) with headroom for Debate training (with large open source models) to improve this metric; but many such datasets have already been hill-climbed upon by developers. If Debates can be run with small token budgets that would be a bonus. That said, these are certainly not the only important considerations and we’d also be interested in people’s takes on what properties a dataset should have to facilitate future work.

We suggest the following as interesting research questions to guide future work:

Can one provide empirical or theoretical insights for when and how inference-only experiments can shed light on training dynamics?
Can one characterise the sorts of tasks that Debate can perform well on? (We have preliminary results suggesting Debate will be less effective on short knowledge based tasks (SimpleQA) since it is harder for the Judge to reason about correctness).
How should one think about ‘honesty’ for Debate? Perhaps one can capture this more directly than with the ‘accuracy’ proxy we used to sweep it under the carpet.
What extra protocol-level levers can one pull to improve Debate’s performance? (Perhaps one could try explicit ‘decomposition’, ‘judge-ensembling’^[12], or ‘cross-examination’?).

Since the design space is large, there are many different empirical directions to pursue. If you think you might have good takes on what we should prioritise please do reach out!

FAQs

Q: What are the main limitations of the empirics presented?

There are various things that could be improved: the ‘truth-seeking’ note in debater system prompts should be ablated; we did not iterate very hard on the Llama judge; we did not include all relevant controls (e.g. we are particularly interested in ‘static Critic’ baselining at present). Correspondingly, we do not feel we can make any claims about comparative performance of methodology at this point, but that the results are still meaningful/informative as ‘an illustration’. The current results also consider a single dataset, which is notably ‘crisp’, while our core motivation is to be able to use Debate for ‘fuzzier’ tasks necessary for automated alignment research.

Q: Why did you focus on Propose-Critique-Decide (PCD) protocols in this post?

We found it easier to get meaningful disagreements between the debaters in PCD rather than two-candidate protocols. Indeed, given that we did not want to assign positions to debaters, the two debaters would often end up presenting ‘essentially the same solution’ in two-candidate settings, and then would either simply agree (if permitted) or would argue about superficial or irrelevant details rather than addressing ‘the core’ of the question. By contrast the PCD setup encourages disagreement by design. We hope to give a more meaningful comparison between the protocols in later work.

Q: What are the differences to Khan et al. 2024?

We suspect many readers will be familiar with Debating with More Persuasive LLMs Leads to More Truthful Answers, which was presented as an Oral at ICML 2024. Our work moves away from this along various dimensions. Firstly, we are primarily interested in Debate as a training procedure and run training experiments, whereas their empirics are purely inference-only. Secondly, all their experiments use the QuALITY dataset, which consists of multiple choice questions (MCQs) that involve answering questions about a long input text. To simulate a capability gap between Debaters and Judge, the input text is only available to the Debaters, and not to the Judge; this is referred to as information asymmetry in the literature. By contrast, we consider open-ended tasks without information asymmetry, which we think are more representative of future-world deployments. Thirdly, they always pre-assign positions to their Debaters, while we allow Debaters to choose their initial stance. Additionally, their Judges are prompted to give their best guess answer to the MCQ, and primarily report Judge accuracy as the accuracy of the Judge’s answers, whereas we prompt Judges to evaluate the quality of the Proposal, and primarily report the evolution of Proposer Accuracy via training. We think moving beyond pre-assigned stances is particularly important, since it is required for the evolution of Proposals under training to even make sense, and avoids violating the ‘no-one is forced to lie’ principle of Irving et al.’s original Debate proposal.

Q: How should we think about the Judge? Is it important for Debate to ‘work’ in regimes where the Judge is ‘weaker than the Debaters’?

Previous empirical work on Debate has been particularly interested in the case where the Judge is weaker than the debaters, and correspondingly focuses on experiments where the Judge a smaller/older LLM or the debaters have access to certain affordances or information that the Judge lacks. Having a ‘capabilities gap’ is natural given that humans may want to use Debate to oversee significantly super-human systems.

However, we have started to come to the conclusion that labs may want to perform Debate with frontier models as Judges; indeed, given that lab researchers seem to think recent models are no less aligned than previous model versions (e.g. Anthropic suggests 5 Fable is more aligned than previous models), it seems reasonable to want to use the ‘most capable’ possible judge. One can then only hope to be confident to work on tasks for which humans are successfully able to judge ‘leaf nodes’ of debate, on which the Judge can be fine-tuned to be at least as good as humans; it may be that using frontier models as Judges can also incentivise good behaviours on some further tasks, but such success would be contingent upon favourable generalisation.

Correspondingly, we currently think that it is interesting to see empirical research both in regimes where there is a capability gap, to shed light on human-judging-strong-AI scenario, and in regimes with maximally-capable Judges. That said, the former case may be better addressed by studies that attempt to investigate or enhance the ability of (AI-assisted) human judges to judge ‘hard’ debates.

That said, we do have some uncertainties about the extent to which humans should be involved for judging, and to our knowledge there is no definitive discussion or accepted answer to these concerns at present.

Contributions and Acknowledgements

Joan and Lennie worked together on the project over MATS 9.0 mentored by Shi and Jacob. Lennie wrote the text for this progress update and accepts full responsibility for errors or inadequacies. Joan and Jacob carefully reviewed the post and suggested many improvements.

We thank Jason Brown, Edward Young, Callum Canavan, Gabriel Recchia, Sam Martin and Dewi Gould for helpful feedback on a draft and many others for thoughts and discussions over the course of the project. Lennie would particularly like to thank Jason for valuable advice on how to frame and structure this post. A final massive thank you to Jinghua Ou for all her support for us as RM during MATS!

^{^}
That said, as we discuss below, one may want to apply some auxiliary training pressure to improve the Judge (cf. Iterated Online RLHF).
^{^}
We suspect readers will be particularly familiar with Khan et al. 2024, see the FAQs below for further discussion.
^{^}
We give further discussion of our current thinking about Judges in the FAQs below.
^{^}
We introduce this terminology, but the underlying idea is not new. For example, recent conceptual work on Prover-Estimator debate has a similar asymmetric structure, and was a reason why we were interested to explore such protocols.
^{^}
We filtered to questions with difficulty L4-5, to ensure questions were not ‘too easy’. Dataset curation generally seems very important for meaningful empirics, as we discuss further in What updates have we made?
^{^}
Correspondingly, we can also apply this experimental setup to other easily-scoreable evaluations. Note that we do not train on the scores directly (Judge is not shown any ‘ground truth’), this has qualitatively different practical/conceptual considerations to ‘constructing reward signals for RL’. For example, if one uses a forecasting dataset then one does not have to worry about overfitting in the same way one would if training with RL on realised data.
^{^}
In particular, we will refer the reader to forthcoming work from Jacob’s Arcadia team that suggests static-Critic may outperform trained Critic in similar settings; this observation pushes against the importance of adversarial training, which was a large part of the motivation for Debate!
^{^}
Though it would obviously be nice if a protocol would incentivise honesty for all participants and even show an increase in Judge Alignment over training! It is also worth considering that the bulk of train-time tasks would not have a ground-truth scorer, but one could see Judge Alignment decreasing if using samples with ground-truth for periodic evaluation.
^{^}
We use ‘levers’ as an umbrella term to cover all design choices for a Debate training run (including hyperparameters, game, and reward choices). We hope the non-technical nature evokes a certain generality, and the notion that a developer may need to ‘position’ them with care.
^{^}
We mean this at the level of the exposition in An illustrative training run, while we see the exposition in Details pertaining to the experiment above as ‘further important details’ referred to in the next sentence.
^{^}
Thanks to Sam and Dewi for sharing an early draft which informed the exposition of BoN above.
^{^}
We know this is related to some recent work by Kozzy Voudouris that we will link when it comes out!

55