When does debate help a weak judge? Evidence from code, logic and math

ethanelasky; frank_b_n

2026.06.07 update: We’ve rewritten this entire post for clarity and added a new math setting to bolster our results.

Authors: Ethan Elasky and Frank Nakasako, Palaestra Research; Naman Goyal.

Link to ArXiv preprint

Thanks to Coefficient Giving for support and Thinking Machines for API credits; our mentor for guidance along the way; and Julian Michael, Johannes Gasteiger, and Jiaxin Wen, among others, for helpful conversations.

Summary

Over the last few months, we’ve found positive evidence for debate. Debate produces improvement over consultancy, specifically when the weak judge is above a capabilities threshold. The strong-debater/weaker-judge pairings above this floor, Opus 4.6/4.5, Gemini 3.1 Pro/3 Flash, and Qwen3.5-122B/35B, all show improvement across code, logic and math settings. Judges that fail to improve at debate either (1) do not use critic reasoning as an opportunity to engage with critic reasoning, instead summarizing it without much interaction, or (2) are already as strong as the strong critic in terms of binary classification strength. The model pairings that debate fails on were Qwen3.5-35B/Qwen3-4B (code, math) and gpt-oss-120B/gpt-oss-20B (code). These families’ judges summarize arguments instead of verifying them and make elementary arithmetic mistakes when they do verify. Finally, we find no benefit to rebuttals in debate; opening-only debate, where only the proposer's solution and an opening speech from the critic are present, is just as performant as debate with rebuttals.

Setup

We test non-adversarial proposer-critic debate, where a proposer debater first generates a solution and the critic debater has the choice to agree or disagree with the proposer. The critic gets a chain of thought to decide its stance, and both debaters can see why the critic chose its stance for the rest of the debate. This is followed by N rebuttal turns. The judge then decides whether the proposer’s solution is correct, which we measure via Macro-F1 (a fairer version of accuracy for our setting).

Baselines:

Open consultancy: A consultant generates a solution and then has multiple turns to explain their answer.
Opening-only consultancy: A consultant generates a solution but does not get subsequent speeches to explain its solution.
Opening-only debate: The judge can see the proposer’s solution and critic’s stance and reasoning, though it cannot see the transcript’s rebuttal speeches.

The four formats we tested. Arrows indicate the passage of information. The table read top-down demonstrates the passage of time, e.g. from a debater’s perspective, the speeches Proposer rebuttal 1 and Critic rebuttal 1 happen at the same time. From the judge’s perspective, we always place the critic’s speech second when there is a tie.

This roughly follows Zac Kenton’s open debate setup, which allows the proposer to choose their own answer, but his setting forces an antagonist that disagrees with the proposer. Per his suggestion, we deviate from this second choice and allow our critic to decide its own alignment for the fairest baseline compared to open consultancy.

Results

Debate improves judge macro-F1 on wrong proposers without sacrificing macro-F1 on right proposers

We find that debate helps macro-F1 on five of eight pairings and has no effect on the other two. It helps all but the weakest debater-judge pairings; on code, this is the weaker Qwen family and the gpt-oss family, and on math, this is just the weaker Qwen family.

These gains mostly concentrate in helping judges reject wrong proposers. Debate significantly cuts the judge’s false positive rate (endorsing a wrong proposer answer) but does not impact the false negative rate (rejecting a correct proposer answer). The critic gives the judge concrete grounds to disagree, whether a grid inconsistency on ARC AGI 2, or a hard test case on code.

Debate helps when the critic provides a usable advantage

Debate helps when the critic is a better binary classifier than the judge and past the domain-dependent capability level wherein the judge acts as a verifier rather than as a summarizer.

Model pair	Judge recall when critic right	Judge recall when critic wrong	Gap
gpt-oss-120B/20B	0.903	0.564	-0.34
Qwen3.5-122B/35B	0.875	0.466	-0.41
Qwen3.5-35B/Qwen3-4B	0.775	0.522	-0.25

When the critic is wrong, the judge’s recall falls significantly for the stronger and weaker Qwen families and gpt-oss, as shown in the above table. Note that inter-row comparison is likely invalid because there is a selection effect on questions; a better critic gets questions wrong less frequently, and those questions are likely to be harder.

These effects may be compounded by the fact that the judge and the critic in our setting are always from the same family; their failure modes may be highly correlated in ways that a judge from a different family may not be susceptible to.

In other words, much debate’s current performance likely comes from the critic’s superior ability to classify and the judge’s behavior is likely just rubber stamping, despite prompting the judge to verify the proposer’s solution for itself and to treat arguments from both sides critically.

Rebuttals add little at test time

We next ablate rebuttal speeches from both debate and consultancy and compare them to the full-length formats.

Domain	Pair	n	Full debate	Opening-only debate	Full consultancy	Opening-only consultancy
Logic	Gemini 3.1 Pro / 3 Flash	120	0.906	0.896	0.766	0.719
Logic	Opus 4.6 / 4.5	119	0.773	0.761	0.615	0.621
Math	Qwen3.5-122B / 35B	738	0.690	0.709	0.535	0.641
Math	gpt-oss-120B / 20B	721	0.746	0.727	0.682	0.750
Math	Qwen3.5-35B / Qwen3-4B	745	0.575	0.625	0.541	0.625
Code	Qwen3.5-122B / 35B	976	0.833	0.846	0.692	0.661
Code	gpt-oss-120B / 20B	975	0.791	0.773	0.777	0.789
Code	Qwen3.5-35B / Qwen3-4B	988	0.739	0.752	0.741	0.729

Macro-F1 per pair × domain × format. n is the per-row 4-format intersection of (dataset, question_idx). Bold is the best-performing format per row. Math pools IMO_AnswerBench, UCMO, and OlymMATH_Hard; Code is CodeContests; Logic is ARC-AGI-2.

On our more capable models, we find that debate and opening-only debate are indistinguishable, and they exceed consultancy and opening-only consultancy. On our less capable models, we find that opening-only consultancy actually sometimes outperforms the other formats; specifically, we conducted manual transcript analysis and find that full consultancy induces false agreement at a much higher level than either debate or the rebuttalless opening-only consultancy.

Stuff we tried that didn’t work

We list several failed paths for debate below:

1. Dual proposer debate – we abandoned this early-on after realizing that debaters don’t have incentive to engage with each other if you judge each only on final answer quality.

2. Other choices for datasets – we went through a lot of datasets before we settled on our final list of CodeContests+, ARC-AGI-2, OlymMATH-Hard, UCMO, and AnswerBench-IMO. Two main problems with other datasets:

a. Label quality – We found that machine-generated accuracy labels were often wrong, which is a huge problem for binary classification. We dramatically upgraded the quality of labels for the following datasets: AnswerBench-IMO. We abandoned the following due to low answer quality: BigCodeBench, APPS, OMNI-Math,

b. Saturation – We found that our models saturated (up to label accuracy) the following datasets: ZebraLogic, ARC AGI, Knights and Knaves, HumanEval+

Broader implications

These are, to our knowledge, the first open results that show strong evidence for debate’s validity as a self-play mechanism in improving reward allocation. We feel that this is meaningful progress in getting some variant of debate to work in the open, though we would like to highlight some caveats.

First, our results are on the more programmatically verifiable domains of math, code, and logic, where there is an easily scorable final answer. We are still unsure as to whether these results will transfer over to fuzzier tasks, which may yield messier argument decompositions. However, tasks within these fields like experiment design and research steering are increasingly relevant as full automation of AI research approaches the frontier of possibility.

Second, we stress that our nonadversarial debate is optimized for inference-time reward allocation, and we are interested in the training dynamics produced by train-time reward allocation. How optimization pressure will impact debate, especially as we move towards longer RL runs that give debaters more latitude to exploit the protocol, and what characteristics training may elicit out of the debaters, are still important questions.

This makes us excited about a lot of potential directions in AI debate. Our personal next steps are to work on fuzzier tasks that can serve as effective proxies for automation of AI research, study train-time dynamics of different debate protocols, and engineer failures in these protocols to test the theory that debate is built on.

Interesting work.

One possible direction that would be interesting to explore: all your pairings are same-family. Same-family models likely share some core reasoning, thus

1) debate transcript from the stronger models might help the weaker judge from the same-family more than the one from different families as same-family models might understand each other better,

2) but they might also share failure modes, meaning a same-family critic might be systematically blind to the same errors as the weaker judge.

A cross-family testing might surface qualitatively different objections, potentially widening or narrowing the classifier gap.

We'll run these experiments in the next few days and update the preprint accordingly. We think that your 2) point is more likely to be correct, although we'll see what the results say.

Cool! Nice to see more work on debate with some mild positive results. I didn't read super closely but some quick thoughts.

We report macro-F1 over correct and incorrect verdicts because the protocols shift judge priors. In particular, one-sided consultancy can make judges more likely to agree with the proposer, so raw accuracy can hide the direction of the mistakes.

apologies if I'm missing something but what do you mean by F1 in this case? do you mean positive = proposer / first speaker is correct, and negative = proposer / first speaker is incorrect? in prior work we just averaged between orderings in debate and used accuracy. consultancy and debate should both have a 50/50 prior so it's fine. since we expect the model to agree with the proposer more in consultancy, it should bias towards positives. I worry that F1 could bias the results towards debate in this case, because F1 favors a classifier that's more balanced, all else equal. Say it's 50% positive in debate and 90% positive in consultancy but otherwise random. 50% positive random guesses = .5 F1, but 90% positive random guesses = .18 F1. Debate looks way better but actually both are random.

What were the accuracies?

The critic gets a chain of thought to decide its stance, and both debaters can see why the critic chose its stance for the rest of the debate.

hmm, this doesn't sound like adversarial debate, it just sounds like a scaffolded grader... I'm curious what the logic was here. I understand that open consultancy seems like a more realistic baseline in some ways but I think for scalable oversight we really do care about and want to test the adversarial setting. (I might disagree with Zac on this as well.) Otherwise, I think we're just playing around with scaffolding methods.

I wonder if things like the Tinker API would make it easier to do proper self-play RL experiments.

Tl;dr you're right that our form of debate is not adversarial and is line with Zac's collaborative debate, although we detail a reward allocation strategy below that might make this game work. We want to test both zero-sum and mixed-sum debate under optimization pressure although this might be expensive, and we take your point about macro-F1 but find that the debate vs. consultancy effects are mostly not due to differences in classifier bias and instead are genuine differences between the two (with support from another metric called Youden's J). Sorry this got so long, a lot of this will end up as updates in the preprint and our responses are pretty thorough.

hmm, this doesn't sound like adversarial debate, it just sounds like a scaffolded grader... I'm curious what the logic was here. I understand that open consultancy seems like a more realistic baseline in some ways but I think for scalable oversight we really do care about and want to test the adversarial setting. (I might disagree with Zac on this as well.)

Hi Julian, thanks for the comment -- we test a setting closer to Zac's, which is a mixed-sum, non-adversarial (i.e. collaborative) debate game where the baseline is single open consultancy, and there is no stipulation of a 50/50 prior, so direct QA accuracy on tasks can range from 10% to 75% (see Table 1 in our preprint). Our goal is to do RL on both this setting and the zero-sum setting that you and e.g. Khan et al. 2024 and Arnesen et al. 2024 worked on. The mixed-sum game still awards positive and negative reward when the debaters disagree but does not allocate reward when proposer and critic agree, except when they are both wrong, in which case it is negative reward for both.

apologies if I'm missing something but what do you mean by F1 in this case? do you mean positive = proposer / first speaker is correct, and negative = proposer / first speaker is incorrect? in prior work we just averaged between orderings in debate and used accuracy. consultancy and debate should both have a 50/50 prior so it's fine. since we expect the model to agree with the proposer more in consultancy, it should bias towards positives. I worry that F1 could bias the results towards debate in this case, because F1 favors a classifier that's more balanced, all else equal. Say it's 50% positive in debate and 90% positive in consultancy but otherwise random. 50% positive random guesses = .5 F1, but 90% positive random guesses = .18 F1. Debate looks way better but actually both are random.

Our form of collaborative debate does not have the 50/50 prior and our stronger proposer accuracy scores lie between 55-78%. Since consultancy has a strong agreement bias, we think that accuracy underweights debate and overweights consultancy. Instead, we chose macro-F1, or the average between positive- and negative-class F1, because we were worried about bias towards consultancy when there is a positive-class imbalance. For example, in Opus 4.6/4.5 consultancy, the judge always agrees with the consultant, which would lead to an accuracy of 60.8%, while our macro-F1 score is .378.

We take your point that macro-F1 does depend on positivity rate, and that our results could just be due to differences in judge classification bias. Thanks for helping us notice this -- this is a problem with the preprint as it exists now and we'll be adding some lines to fix it. We think that these effects are not due to chance, as we (a) recalculated F1 using random classifiers with the same precision and recall as our debate and consultancy judges and found that debate still outperforms consultancy, and (b) recalculated with Youden's J, which is a binary classification metric (Fable's suggestion initially, though I found this paper helpful in understanding it).

To check whether our macro-F1 results are just due to differences in positivity rate between debate judge and consultancy judge, we can run a random classifier with the same precision and recall as the judge for each of our settings (p in the below table) to see if it's actually due to chance or not.

CodeContests+ Qwen 122B/35B (proposer accuracy = 73.7%):

Format	p	chance	chance	chance macro-F1 ()
Debate	0.790	2(.737)(.790)/1.527 = .763	2(.263)(.210)/.473 = .234	0.498
Consultancy	0.864	2(.737)(.864)/1.601 = .795	2(.263)(.136)/.399 = .179	0.487

The gain for random classifiers for debate vs consultancy would be , but we find a pp gap between the two formats, so the gain can't just be due to chance.

We can also recompute the stats using Youden's J, which is used to measure diagnostics where we care about both positive-class and negative-class judgments, and which is 0 for random classifiers. We find that debate vs consultancy calculated using Youden's J is significant in all five responder cells (the two frontier model ARC AGI ones, as well as the one code and two math responders), and will add this statistic to our paper because it answers this question well.

Actual accuracy numbers:

(Deltas: debate - consultancy, paired bootstrap 95% CI (pp))

Domain	Model	ΔAccuracy	ΔBalanced accuracy
Logic	Gemini	+10.8 [+5.0, +17.5] p=.001	+14.0 [+6.3, +22.1] p=.001
Logic	Opus	+10.9 [+4.2, +17.6] p=.001	+13.1 [+4.8, +21.2] p=.001
Math	Qwen 122B/35B	+2.5 [+1.0, +4.1] p=.001	+6.3 [+3.2, +9.5] p<.0005
Math	Qwen 35B/4B	+1.5 [−0.5, +3.4] p=.115	+2.3 [−0.8, +5.2] p=.142
Math	gpt-oss	+5.0 [+1.8, +8.1] p=.001	+5.8 [+2.6, +9.2] p<.0005
Code	Qwen 122B/35B	+7.6 [+5.2, +9.7] p<.0005	+14.3 [+10.8, +17.5] p<.0005
Code	Qwen 35B/4B	+0.6 [−2.2, +3.4] p=.684	−0.7 [−3.9, +2.4] p=.632
Code	gpt-oss	+1.3 [−0.4, +3.1] p=.117	+1.1 [−1.1, +3.1] p=.348

@ethanelasky per the other thread, this was pretty insightful. Can I ask, how have you settled on the better judges and critics? Have you introduced anything specific to encourage those skills? And I’m a bit confused about the number of rebuttals, and the implementation of/results of the rebuttals in general. In other words, I can discern the (lack of) improvement from rebuttals, but not clear what was actually seen there. I couldn’t pick that out in the transcripts?