Thanks to Coefficient Giving for support and Thinking Machines for API credits; our mentor for guidance along the way; and Julian Michael, Johannes Gasteiger, and Jiaxin Wen, among others, for helpful conversations.
Tl;dr: Debate improved over baselines when the weak judge was above a capabilities threshold. The strong-debater/weaker-judge pairings Opus 4.6/4.5, Gemini 3.1 Pro/3 Flash, and Qwen3.5-122B/35B all showed improvement across code and logic settings. Judges that failed to improve at debate either (1) did not use critic reasoning as an opportunity to engage with critic reasoning, instead summarizing it without much interaction, or (2) were already as strong as the strong critic in terms of binary classification strength. These failures happened in Qwen3.5-35B/Qwen3-4B and gpt-oss-120B/gpt-oss-20B pairings. Finally, rebuttals do not help debate in our case; opening-only debate, where only the proposer's solution and an opening speech from the critic are present, is just as performant as full debate.
What we did
This is a writeup of experiments we ran on debate as a reward-labeling protocol. The basic question: if a weaker judge is trying to decide whether a stronger model's answer is correct, does it help to show the judge a debate between two copies of the stronger model?
We tested this in a relatively clean setting. The proposer generated an answer to a code or logic task. The judge then had to label the answer as correct or incorrect. In the debate condition, a critic also looked at the proposer's answer and either agreed or disagreed, with a chance to give reasons. We compared this against one-sided open consultancy, where the proposer generated and defended its own answer but there was no independent critic.
We chose code and ARC-style logic tasks because we wanted to be able to programmatically verify generated answers. The long-term motivation is the harder case: research proposals, experimental plans, long-horizon agentic work, and other domains where it is hard for a human or model judge to tell whether the answer is actually good. But if we start there, we cannot tell whether the reward labels are right. Code and logic give us a way to study the mechanism while still being able to audit the labels.
Key takeaways and relevance for alignment
Debate improved reward labels on three of five stronger-debater / weaker-judge pairings. These were also our most capable pairings (Opus 4.6 judged by Opus 4.5, Gemini 3.1 Pro judged by Gemini 3 Flash, Qwen3.5-122B judged by Qwen3.5-35B). On those pairings, debate beat consultancy by roughly 16, 14, and 14 percentage points of macro-F1. The two remaining pairings were basically null.
The usual generator-verifier story was not enough. In every pairing, the debater model was better at verifying than at generating. But only some of those gaps translated into better judge labels. What mattered was whether the critic beat the judge as a classifier of the proposer's answer, and whether judges took criticism as an opportunity to verify rather than as authority-based testimony. The latter tended to emerge as model capabilities increased.
When debate helped, it mostly helped by making judges stop accepting wrong answers. This is the error type we most care about for reward modeling: rewarding a bad answer teaches the policy that flawed reasoning is acceptable.
Most of the test-time benefit came from the critic's first message. Removing rebuttal rounds left macro-F1 essentially unchanged on responder pairings. We do not take this as evidence that rebuttals have no training-time value, only that they did not change these inference-time labels.
We see this as a mechanism and a diagnostic, not as a complete safety story. Before using debate for reward labeling on a new distribution, it's important to audit whether the critic actually has a classification advantage over the judge, and whether the judge uses the critique as an opportunity to verify rather than as testimony.
Motivation
A central problem in scalable oversight is that we would like to train models to do things that are hard to evaluate. If a system proposes an experiment, writes a long research plan, or takes a sequence of actions in a codebase, the supervisor may not be able to thoroughly check the work. This is the setting where naive reward modeling is most fragile.
The basic worry is simple: if we pay an expert based on how good their answer looks to a non-expert, the expert's incentives are to be maximally persuasive to the non-expert. Sometimes this is correlated with quality, especially on smaller tasks. However, as the scale of a problem increases and quality is harder for the non-expert to inspect, what is persuasive increasingly diverges from what is quality, especially as the non-expert faces limits on attention, time, and cognition.
Debate is one proposed solution. Instead of asking the judge to evaluate the answer alone, we let another model look for problems. The critic can focus disagreement on one part of the original task, transforming the original larger problem into a smaller one that is easier for the judge to check. If this works, debate should enable judges to falsify local claims as a proxy for global claims.
Previous empirical work on debate has been mixed. Some settings look positive, especially when the judge lacks access to evidence that debaters can surface, and when the consultant is assigned the wrong answer 50% of the time. Other settings, including visible-information weak-judge settings closer to ours, find small or null effects. The question we answer is: in a visible-information setting, where the judge can see the problem and the candidate answer, when does adding a critic help the judge produce a better reward label?
Experimental setup
We compared four main protocols on the same tasks.
The four inference-time protocols. Consultancy gives the judge one-sided advocacy from the proposer; debate adds an independent critic. Opening-only variants strip later rebuttals to isolate the contribution of the critic's first message.The four inference-time protocols. Consultancy gives the judge one-sided advocacy from the proposer; debate adds an independent critic. Opening-only variants strip later rebuttals to isolate the contribution of the critic's first message.
Direct QA: the judge tries to solve the problem itself. Not a reward-labeling protocol, but it tells us roughly how hard the task is for the judge.
Consultancy: the proposer generates an answer and then gives one-sided speeches defending it.
Opening-only consultancy: the judge sees the problem and the proposer's answer, but no later advocacy. This is the closest analogue of a bare LLM-as-reward-model labeler.
Debate: the proposer generates and defends an answer, while a critic independently decides whether to agree or disagree and argues for that stance. The judge gives a final verdict on the proposer's answer.
Opening-only debate: the judge sees the problem, the proposer's answer, and the critic's original speech, but it does not have access to later rebuttals. This tests whether the meat of the help debate provides is in the critic's initial speech or in later rebuttals.
The important comparison is debate versus consultancy. Both start from the proposer's answer. Consultancy gives the judge only one-sided advocacy; debate adds an independent verification signal.
We used five model pairings across code and logic. In each pairing, the debater was stronger than the judge on direct QA.
Domain
Debater
Judge
Debater Direct QA
Judge Direct QA
Code
Qwen3.5-122B
Qwen3.5-35B
0.74
0.65
Code
Qwen3.5-35B
Qwen3-4B
0.65
0.52
Code
gpt-oss-120B
gpt-oss-20B
0.68
0.59
Logic
Gemini 3.1 Pro
Gemini 3 Flash
0.63
0.13
Logic
Opus 4.6
Opus 4.5
0.61
0.29
Model pairings used in the draft. Direct QA numbers are rounded for readability.
For code, we used CodeContests+, a competitive programming dataset. For logic, we used ARC-AGI-2, where an answer is correct only if the predicted grid exactly matches the hidden target. We report macro-F1 over correct and incorrect verdicts because the protocols shift judge priors. In particular, one-sided consultancy can make judges more likely to agree with the proposer, so raw accuracy can hide the direction of the mistakes.
What happened
Debate beat consultancy on three pairings and did not help on two.
Debate versus consultancy across our main evaluation settings. Debate helps most when the critic is much better than the judge at classifying the proposer's answer.Debate versus consultancy across our main evaluation settings. Debate helps most when the critic is much better than the judge at classifying the proposer's answer.
Pairing
Task
Debate − consultancy
Interpretation
Qwen3.5-122B / Qwen3.5-35B
CodeContests+
+14.0 pp
Responder
Gemini 3.1 Pro / Gemini 3 Flash
ARC-AGI-2
+14.0 pp
Responder
Opus 4.6 / Opus 4.5
ARC-AGI-2
+15.7 pp
Responder
Qwen3.5-35B / Qwen3-4B
CodeContests+
−0.2 pp
Non-responder
gpt-oss-120B / gpt-oss-20B
CodeContests+
+1.4 pp
Non-responder
Debate versus consultancy. The first three pairings are statistically significant; the last two are null in our experiments.
The gain was asymmetric. Debate was most useful when the proposer was wrong. On the responder pairings, adding the critic made judges less likely to endorse incorrect proposer answers, while false negatives stayed roughly flat. This is a useful shape for reward modeling: false positives train the model toward bad work, while false negatives mostly withhold reward from good work.
Class-specific F1 reveals where debate helps. All three responder pairings show large incorrect-class lifts; at the verdict-share level the dominant stratum is incorrect-proposer rejection on Qwen3.5-122B/35B and Gemini, and correct-proposer acceptance on Opus, reflecting Opus 4.5's opposite no-transcript prior.Class-specific F1 reveals where debate helps. All three responder pairings show large incorrect-class lifts; at the verdict-share level the dominant stratum is incorrect-proposer rejection on Qwen3.5-122B/35B and Gemini, and correct-proposer acceptance on Opus, reflecting Opus 4.5's opposite no-transcript prior.
This also made the qualitative behavior easier to understand. Consultancy often gave the judge a plausible frame for why the proposer might be right. Debate supplied an alternative frame or a concrete counterexample: a code input to trace, a lower bound to check, a cell-level invariant in an ARC grid. When the judge actually checked that objection, the verdict improved.
What separated the cases where debate worked from the cases where it did not
The most tempting explanation is the generator-verifier gap. The critic is the same model family as the proposer, and models are often better at verifying answers than generating them. If that were enough, we might expect the critic to help in every pairing.
That is not what we found. Every proposer-critic pairing had a positive generator-verifier gap. But the two non-responder pairings also had such gaps, and debate still did not help. In these cases, the critic did not meaningfully improve on the judge's own classification ability.
Debate's lift over consultancy is largest where the critic's classifier macro-F1 exceeds the lone judge's. Panel A shows the generator-verifier gap (the same model's accuracy as a verifier minus its accuracy as a Direct QA generator), which is positive on every pairing. Panel B shows the critic's classifier macro-F1 minus the opening-only consultancy judge's, a non-debate analog of RLHF-style verifier evaluation. Debate gains track Panel B, not Panel A: the gen-verify gap is necessary but not sufficient.
The quantity that tracked the split was the critic's classifier macro-F1 minus the opening-only consultancy judge's macro-F1. In the responder pairings, the critic had a better signal for the judge to import. In the non-responder pairings, the critic did not have much of a signal advantage over the judge. More debate tokens did not fix that.
There was also a behavioral condition. The judge had to treat the critic's statement as a claim to verify, not as a piece of testimony to summarize. In the responder pairings, judges tended to do their own checking. In the non-responder pairings, verification rates dropped sharply once a critic entered the transcript.
Pairing / domain
Consultancy
Opening-only cons.
Debate
Opening-only debate
Qwen3.5-122B/35B, CC
99%
94%
98%
97%
Gemini 3.1 Pro/Flash, ARC
88%
78%
91%
100%
Opus 4.6/4.5, ARC
100%
98%
100%
100%
gpt-oss-120B/20B, CC
82%
67%
31%
16%
Qwen3.5-35B/Qwen3-4B, CC
67%
61%
53%
25%
Behavior-reviewed verification rate on correct judge verdicts. The pattern is high verification across formats for responders, and a sharp drop in critic-present formats for non-responders.
This is the main thing we would audit before using debate on a new distribution. First ask whether the critic actually beats the no-transcript judge as a classifier. Then inspect whether the judge checks critic claims. If the critic does not beat the judge, or if the judge only paraphrases the critic, debate is unlikely to help.
A case where debate worked
One CodeContests example makes the mechanism fairly concrete. The task was Construct a tree. The proposer's solution used a feasibility check that was too weak: it checked whether s < n, but the true minimum sum of subtree sizes for a rooted tree on n nodes is 2n − 1, achieved by a star.
Without the critic, the judge accepted the solution. It retraced the provided samples and a trivial edge case, and produced a table of checks that looked like verification. But it never constructed the relevant new input. This is a common failure mode: the judge performs local checking inside the proposer's frame, rather than testing the frame itself.
With the critic, the judge had a different thing to check. The critic pointed to the star-graph lower bound and gave a concrete failing case. The judge re-derived the bound and traced an input like n=3, s=3, where the requested sum is impossible because the minimum is 5. The verdict flipped from accepting the proposer to rejecting it.
Things to note:
The critic did not need to solve a whole new task. It needed to surface a falsifiable local claim.
The judge did not need to become a competitive programming expert. It needed to check the lower bound and trace a small counterexample.
The useful unit was closer to answer, critique, judge than to a long back-and-forth debate.
A case where debate failed
The non-responder failures are also informative. In a gpt-oss-120B / gpt-oss-20B CodeContests example, the proposer's code computed the right closed form for n >= 1 but failed at n=0. The critic correctly pointed to the n=0 edge case. This should have been exactly the kind of objection debate is meant to surface.
But the judge did not actually check it. It wrote a verification table whose n=0 row asserted a special case that did not exist in the code. The code had no n==0 guard; when n=0, the loop body does not run, m remains 1, and the program prints 4. The judge nevertheless rejected the critic's objection with high confidence.
This is not a failure where the critic's signal was absent. The signal was in the transcript. The failure was that the judge treated the transcript like testimony to reconcile, rather than as claims to verify against the code. This is why we do not want to describe debate as automatically helping whenever a critic is present.
Rebuttals did not change test-time verdicts
We also ran opening-only debate, where the judge sees the proposer's answer and the critic's initial stance, but no later rebuttal speeches. On every responder pairing, full debate and opening-only debate were statistically indistinguishable on macro-F1, and they agreed on 94-99% of paired verdicts.
Our interpretation is narrow. For test-time reward labeling in these domains, the critic's first message recovered the useful signal. Once the judge had a good objection and was willing to check it, additional advocacy rarely changed the verdict.
We do not take this as a negative result about training-time debate. Later speeches might still matter under optimization pressure. They might produce more diverse trajectories, better exploration, or richer preference data. Our experiment did not test that. It only says that, for these train-free labels, a single independent critique was a strong substitute for full debate.
What we think this means
Our current view is that debate is useful when it turns a hard holistic judgment into a smaller claim that the judge can check. In the successful cases, the critic supplied a falsifiable check: a counterexample input, a held-out-grid invariant, a lower bound, or some other local reason to distrust the proposer's frame. The judge then did enough independent work to verify the check.
This is different from saying that debate is generally better than consultancy. It is also different from saying that multi-round debate is the right primitive. For the train-free reward-labeling use case, the cheaper primitive may often be: generate an answer, ask an independent critic to check it once, and give both to the judge.
There is also a capabilities story: the least capable judges were less likely to verify at all, and when they did, they made more logic and reasoning errors.
The result also suggests a pre-deployment audit if you want to use debate in its current form at inference time. Before spending a lot of inference on debate, measure a small labeled slice of the target distribution. Ask whether the critic's agree/disagree F1 score is more accurate than the judge's. Then run short debates and review transcripts to see whether the judge actually verifies critic claims or just treats them as testimony. If those checks fail, more rebuttal rounds are unlikely to solve the problem.
Current concerns
There are several reasons not to overread these experiments.
Small N. We only tested five model pairings. The critic-judge classifier gap is a good explanation of our experimental set, not a law established at scale.
Prompt sensitivity. The Qwen3.5-122B / 35B CodeContests lift shrank under a simpler judge prompt, though it remained significant. Some of the original gap came from a depressed consultancy baseline rather than a stronger debate condition. This prompt change was somewhat surprising to us: our original judge prompt, which encouraged skepticality, actually substantially decreased recall without much improvement in precision, i.e., a huge increase in false negatives without a corresponding decrease in false positives.
Verifiable domains. We used code and ARC because they have oracles. The domains we care most about do not. We do not yet know whether the same mechanism transfers to fuzzy tasks.
Test-time labels only. Better labels at inference time do not guarantee that policies trained on those labels improve or remain robust.
Judge behavior matters. A critic can put a correct objection in the transcript and still fail if the judge does not check it.
This is not a full safety case for debate. It is a test of one prerequisite: can debate give a weaker judge a better correctness signal than one-sided baselines in calibrated domains?
What we are trying next
The point of using code and logic was to get a clean read on the mechanism. The next step is to relax the things this setting holds fixed: fuzziness, agentic length, oracle structure, and optimization pressure.
Rubric-graded math
Final-answer-graded math is a near miss as a testbed. In pilot debates, intermediate math reasoning that was illogical often ended up arriving at a correct answer, which was the only part of the answer we told the judge to vote for or against. To test whether this was due to memorization or unfaithful reasoning, we collected a dataset of math olympiad problems released in the last few months. Even on the new dataset, which we call Uncontaminated Math Olympiad 2026, models were still arriving at correct answers with nonsensical intermediate steps, so we can conclude that the models, Qwen most obviously, are laundering reasoning. Thus, the connection between falsifying the process and falsifying the outcome cannot be established, and debate's potential to supervise math final answers breaks down. Rubric-graded math seems like a better bridge. Each rubric item is a local claim that can in principle be checked, which is exactly the shape of evidence that helped in our successful code and ARC cases.
Debates over agentic code trajectories
Our current setup is single-turn: one answer, one critique, one verdict. Real coding agents produce long action-observation traces. The interesting monitoring question is whether a critic can help a judge catch trajectory-level failures: undocumented edits, reward-hacking attempts, cover-up behavior, or inconsistencies that look fine at any single step. This changes the framing from reward labeling to monitoring, but the same mechanism may apply: the critic points the judge to the part of the trajectory worth checking.
Long-horizon fuzzy tasks
This is the direction that matters most for the long-term motivation. A scientific proposal may hide a causal assumption. An experiment may hide a confound. A long piece of writing may fail through inconsistency across sections. The hypothesis is that debate can surface these objections before reward is assigned. Our current experiments motivate that hypothesis; they do not establish it.
Training-time debate
The test-time rebuttal result leaves two possibilities. A pessimist might say that rebuttals are dead weight and we should train only single-turn critics. An optimist might say that current debaters do not yet know how to use later rounds, but optimization pressure could make those rounds more useful. We are interested in tracking not just macro-F1, but transcript-level behavior: do critics learn to produce more falsifiable objections, do judges learn to compare frames, and does the responder / non-responder split move as critic quality improves?
Bottom line
A critic supplies a checkable objection, and the judge verifies it. This improved weak-judge reward labels on three of five model pairings, mainly by reducing endorsement of wrong proposer answers. It failed when the critic had no better signal to import, or when the judge summarized the critic instead of checking it.
The question we care about now is whether this mechanism survives in the places where scalable oversight is actually needed. Can debate turn a fuzzy problem into smaller assumptions and local claims that a weaker judge can evaluate more reliably than the original answer alone? Our results are evidence that this can happen in code and logic. They are not yet evidence that it works for research assistance, long-horizon agency, or other open-ended domains.
The debate-execution scaffolding in our codebase was originally written by Samuel Arnesen: GitHub. We have substantially extended it for the experiments and analyses reported here.
Authors: Ethan Elasky and Frank Nakasako, Palaestra Research; Naman Goyal, Independent.
ArXiv link: [will be here when available] (for now, preprint Google Drive link)
Thanks to Coefficient Giving for support and Thinking Machines for API credits; our mentor for guidance along the way; and Julian Michael, Johannes Gasteiger, and Jiaxin Wen, among others, for helpful conversations.
Tl;dr: Debate improved over baselines when the weak judge was above a capabilities threshold. The strong-debater/weaker-judge pairings Opus 4.6/4.5, Gemini 3.1 Pro/3 Flash, and Qwen3.5-122B/35B all showed improvement across code and logic settings. Judges that failed to improve at debate either (1) did not use critic reasoning as an opportunity to engage with critic reasoning, instead summarizing it without much interaction, or (2) were already as strong as the strong critic in terms of binary classification strength. These failures happened in Qwen3.5-35B/Qwen3-4B and gpt-oss-120B/gpt-oss-20B pairings. Finally, rebuttals do not help debate in our case; opening-only debate, where only the proposer's solution and an opening speech from the critic are present, is just as performant as full debate.
What we did
This is a writeup of experiments we ran on debate as a reward-labeling protocol. The basic question: if a weaker judge is trying to decide whether a stronger model's answer is correct, does it help to show the judge a debate between two copies of the stronger model?
We tested this in a relatively clean setting. The proposer generated an answer to a code or logic task. The judge then had to label the answer as correct or incorrect. In the debate condition, a critic also looked at the proposer's answer and either agreed or disagreed, with a chance to give reasons. We compared this against one-sided open consultancy, where the proposer generated and defended its own answer but there was no independent critic.
We chose code and ARC-style logic tasks because we wanted to be able to programmatically verify generated answers. The long-term motivation is the harder case: research proposals, experimental plans, long-horizon agentic work, and other domains where it is hard for a human or model judge to tell whether the answer is actually good. But if we start there, we cannot tell whether the reward labels are right. Code and logic give us a way to study the mechanism while still being able to audit the labels.
Key takeaways and relevance for alignment
Debate improved reward labels on three of five stronger-debater / weaker-judge pairings. These were also our most capable pairings (Opus 4.6 judged by Opus 4.5, Gemini 3.1 Pro judged by Gemini 3 Flash, Qwen3.5-122B judged by Qwen3.5-35B). On those pairings, debate beat consultancy by roughly 16, 14, and 14 percentage points of macro-F1. The two remaining pairings were basically null.
The usual generator-verifier story was not enough. In every pairing, the debater model was better at verifying than at generating. But only some of those gaps translated into better judge labels. What mattered was whether the critic beat the judge as a classifier of the proposer's answer, and whether judges took criticism as an opportunity to verify rather than as authority-based testimony. The latter tended to emerge as model capabilities increased.
When debate helped, it mostly helped by making judges stop accepting wrong answers. This is the error type we most care about for reward modeling: rewarding a bad answer teaches the policy that flawed reasoning is acceptable.
Most of the test-time benefit came from the critic's first message. Removing rebuttal rounds left macro-F1 essentially unchanged on responder pairings. We do not take this as evidence that rebuttals have no training-time value, only that they did not change these inference-time labels.
We see this as a mechanism and a diagnostic, not as a complete safety story. Before using debate for reward labeling on a new distribution, it's important to audit whether the critic actually has a classification advantage over the judge, and whether the judge uses the critique as an opportunity to verify rather than as testimony.
Motivation
A central problem in scalable oversight is that we would like to train models to do things that are hard to evaluate. If a system proposes an experiment, writes a long research plan, or takes a sequence of actions in a codebase, the supervisor may not be able to thoroughly check the work. This is the setting where naive reward modeling is most fragile.
The basic worry is simple: if we pay an expert based on how good their answer looks to a non-expert, the expert's incentives are to be maximally persuasive to the non-expert. Sometimes this is correlated with quality, especially on smaller tasks. However, as the scale of a problem increases and quality is harder for the non-expert to inspect, what is persuasive increasingly diverges from what is quality, especially as the non-expert faces limits on attention, time, and cognition.
Debate is one proposed solution. Instead of asking the judge to evaluate the answer alone, we let another model look for problems. The critic can focus disagreement on one part of the original task, transforming the original larger problem into a smaller one that is easier for the judge to check. If this works, debate should enable judges to falsify local claims as a proxy for global claims.
Previous empirical work on debate has been mixed. Some settings look positive, especially when the judge lacks access to evidence that debaters can surface, and when the consultant is assigned the wrong answer 50% of the time. Other settings, including visible-information weak-judge settings closer to ours, find small or null effects. The question we answer is: in a visible-information setting, where the judge can see the problem and the candidate answer, when does adding a critic help the judge produce a better reward label?
Experimental setup
We compared four main protocols on the same tasks.
The important comparison is debate versus consultancy. Both start from the proposer's answer. Consultancy gives the judge only one-sided advocacy; debate adds an independent verification signal.
We used five model pairings across code and logic. In each pairing, the debater was stronger than the judge on direct QA.
Domain
Debater
Judge
Debater Direct QA
Judge Direct QA
Code
Qwen3.5-122B
Qwen3.5-35B
0.74
0.65
Code
Qwen3.5-35B
Qwen3-4B
0.65
0.52
Code
gpt-oss-120B
gpt-oss-20B
0.68
0.59
Logic
Gemini 3.1 Pro
Gemini 3 Flash
0.63
0.13
Logic
Opus 4.6
Opus 4.5
0.61
0.29
Model pairings used in the draft. Direct QA numbers are rounded for readability.
For code, we used CodeContests+, a competitive programming dataset. For logic, we used ARC-AGI-2, where an answer is correct only if the predicted grid exactly matches the hidden target. We report macro-F1 over correct and incorrect verdicts because the protocols shift judge priors. In particular, one-sided consultancy can make judges more likely to agree with the proposer, so raw accuracy can hide the direction of the mistakes.
What happened
Debate beat consultancy on three pairings and did not help on two.
Pairing
Task
Debate − consultancy
Interpretation
Qwen3.5-122B / Qwen3.5-35B
CodeContests+
+14.0 pp
Responder
Gemini 3.1 Pro / Gemini 3 Flash
ARC-AGI-2
+14.0 pp
Responder
Opus 4.6 / Opus 4.5
ARC-AGI-2
+15.7 pp
Responder
Qwen3.5-35B / Qwen3-4B
CodeContests+
−0.2 pp
Non-responder
gpt-oss-120B / gpt-oss-20B
CodeContests+
+1.4 pp
Non-responder
Debate versus consultancy. The first three pairings are statistically significant; the last two are null in our experiments.
The gain was asymmetric. Debate was most useful when the proposer was wrong. On the responder pairings, adding the critic made judges less likely to endorse incorrect proposer answers, while false negatives stayed roughly flat. This is a useful shape for reward modeling: false positives train the model toward bad work, while false negatives mostly withhold reward from good work.
This also made the qualitative behavior easier to understand. Consultancy often gave the judge a plausible frame for why the proposer might be right. Debate supplied an alternative frame or a concrete counterexample: a code input to trace, a lower bound to check, a cell-level invariant in an ARC grid. When the judge actually checked that objection, the verdict improved.
What separated the cases where debate worked from the cases where it did not
The most tempting explanation is the generator-verifier gap. The critic is the same model family as the proposer, and models are often better at verifying answers than generating them. If that were enough, we might expect the critic to help in every pairing.
That is not what we found. Every proposer-critic pairing had a positive generator-verifier gap. But the two non-responder pairings also had such gaps, and debate still did not help. In these cases, the critic did not meaningfully improve on the judge's own classification ability.
Debate's lift over consultancy is largest where the critic's classifier macro-F1 exceeds the lone judge's. Panel A shows the generator-verifier gap (the same model's accuracy as a verifier minus its accuracy as a Direct QA generator), which is positive on every pairing. Panel B shows the critic's classifier macro-F1 minus the opening-only consultancy judge's, a non-debate analog of RLHF-style verifier evaluation. Debate gains track Panel B, not Panel A: the gen-verify gap is necessary but not sufficient.
The quantity that tracked the split was the critic's classifier macro-F1 minus the opening-only consultancy judge's macro-F1. In the responder pairings, the critic had a better signal for the judge to import. In the non-responder pairings, the critic did not have much of a signal advantage over the judge. More debate tokens did not fix that.
There was also a behavioral condition. The judge had to treat the critic's statement as a claim to verify, not as a piece of testimony to summarize. In the responder pairings, judges tended to do their own checking. In the non-responder pairings, verification rates dropped sharply once a critic entered the transcript.
Pairing / domain
Consultancy
Opening-only cons.
Debate
Opening-only debate
Qwen3.5-122B/35B, CC
99%
94%
98%
97%
Gemini 3.1 Pro/Flash, ARC
88%
78%
91%
100%
Opus 4.6/4.5, ARC
100%
98%
100%
100%
gpt-oss-120B/20B, CC
82%
67%
31%
16%
Qwen3.5-35B/Qwen3-4B, CC
67%
61%
53%
25%
Behavior-reviewed verification rate on correct judge verdicts. The pattern is high verification across formats for responders, and a sharp drop in critic-present formats for non-responders.
This is the main thing we would audit before using debate on a new distribution. First ask whether the critic actually beats the no-transcript judge as a classifier. Then inspect whether the judge checks critic claims. If the critic does not beat the judge, or if the judge only paraphrases the critic, debate is unlikely to help.
A case where debate worked
One CodeContests example makes the mechanism fairly concrete. The task was Construct a tree. The proposer's solution used a feasibility check that was too weak: it checked whether
s < n, but the true minimum sum of subtree sizes for a rooted tree on n nodes is 2n − 1, achieved by a star.Without the critic, the judge accepted the solution. It retraced the provided samples and a trivial edge case, and produced a table of checks that looked like verification. But it never constructed the relevant new input. This is a common failure mode: the judge performs local checking inside the proposer's frame, rather than testing the frame itself.
With the critic, the judge had a different thing to check. The critic pointed to the star-graph lower bound and gave a concrete failing case. The judge re-derived the bound and traced an input like n=3, s=3, where the requested sum is impossible because the minimum is 5. The verdict flipped from accepting the proposer to rejecting it.
Things to note:
A case where debate failed
The non-responder failures are also informative. In a gpt-oss-120B / gpt-oss-20B CodeContests example, the proposer's code computed the right closed form for n >= 1 but failed at n=0. The critic correctly pointed to the n=0 edge case. This should have been exactly the kind of objection debate is meant to surface.
But the judge did not actually check it. It wrote a verification table whose n=0 row asserted a special case that did not exist in the code. The code had no
n==0guard; when n=0, the loop body does not run, m remains 1, and the program prints 4. The judge nevertheless rejected the critic's objection with high confidence.This is not a failure where the critic's signal was absent. The signal was in the transcript. The failure was that the judge treated the transcript like testimony to reconcile, rather than as claims to verify against the code. This is why we do not want to describe debate as automatically helping whenever a critic is present.
Rebuttals did not change test-time verdicts
We also ran opening-only debate, where the judge sees the proposer's answer and the critic's initial stance, but no later rebuttal speeches. On every responder pairing, full debate and opening-only debate were statistically indistinguishable on macro-F1, and they agreed on 94-99% of paired verdicts.
Our interpretation is narrow. For test-time reward labeling in these domains, the critic's first message recovered the useful signal. Once the judge had a good objection and was willing to check it, additional advocacy rarely changed the verdict.
We do not take this as a negative result about training-time debate. Later speeches might still matter under optimization pressure. They might produce more diverse trajectories, better exploration, or richer preference data. Our experiment did not test that. It only says that, for these train-free labels, a single independent critique was a strong substitute for full debate.
What we think this means
Our current view is that debate is useful when it turns a hard holistic judgment into a smaller claim that the judge can check. In the successful cases, the critic supplied a falsifiable check: a counterexample input, a held-out-grid invariant, a lower bound, or some other local reason to distrust the proposer's frame. The judge then did enough independent work to verify the check.
This is different from saying that debate is generally better than consultancy. It is also different from saying that multi-round debate is the right primitive. For the train-free reward-labeling use case, the cheaper primitive may often be: generate an answer, ask an independent critic to check it once, and give both to the judge.
There is also a capabilities story: the least capable judges were less likely to verify at all, and when they did, they made more logic and reasoning errors.
The result also suggests a pre-deployment audit if you want to use debate in its current form at inference time. Before spending a lot of inference on debate, measure a small labeled slice of the target distribution. Ask whether the critic's agree/disagree F1 score is more accurate than the judge's. Then run short debates and review transcripts to see whether the judge actually verifies critic claims or just treats them as testimony. If those checks fail, more rebuttal rounds are unlikely to solve the problem.
Current concerns
There are several reasons not to overread these experiments.
This is not a full safety case for debate. It is a test of one prerequisite: can debate give a weaker judge a better correctness signal than one-sided baselines in calibrated domains?
What we are trying next
The point of using code and logic was to get a clean read on the mechanism. The next step is to relax the things this setting holds fixed: fuzziness, agentic length, oracle structure, and optimization pressure.
Rubric-graded math
Final-answer-graded math is a near miss as a testbed. In pilot debates, intermediate math reasoning that was illogical often ended up arriving at a correct answer, which was the only part of the answer we told the judge to vote for or against. To test whether this was due to memorization or unfaithful reasoning, we collected a dataset of math olympiad problems released in the last few months. Even on the new dataset, which we call Uncontaminated Math Olympiad 2026, models were still arriving at correct answers with nonsensical intermediate steps, so we can conclude that the models, Qwen most obviously, are laundering reasoning. Thus, the connection between falsifying the process and falsifying the outcome cannot be established, and debate's potential to supervise math final answers breaks down. Rubric-graded math seems like a better bridge. Each rubric item is a local claim that can in principle be checked, which is exactly the shape of evidence that helped in our successful code and ARC cases.
Debates over agentic code trajectories
Our current setup is single-turn: one answer, one critique, one verdict. Real coding agents produce long action-observation traces. The interesting monitoring question is whether a critic can help a judge catch trajectory-level failures: undocumented edits, reward-hacking attempts, cover-up behavior, or inconsistencies that look fine at any single step. This changes the framing from reward labeling to monitoring, but the same mechanism may apply: the critic points the judge to the part of the trajectory worth checking.
Long-horizon fuzzy tasks
This is the direction that matters most for the long-term motivation. A scientific proposal may hide a causal assumption. An experiment may hide a confound. A long piece of writing may fail through inconsistency across sections. The hypothesis is that debate can surface these objections before reward is assigned. Our current experiments motivate that hypothesis; they do not establish it.
Training-time debate
The test-time rebuttal result leaves two possibilities. A pessimist might say that rebuttals are dead weight and we should train only single-turn critics. An optimist might say that current debaters do not yet know how to use later rounds, but optimization pressure could make those rounds more useful. We are interested in tracking not just macro-F1, but transcript-level behavior: do critics learn to produce more falsifiable objections, do judges learn to compare frames, and does the responder / non-responder split move as critic quality improves?
Bottom line
A critic supplies a checkable objection, and the judge verifies it. This improved weak-judge reward labels on three of five model pairings, mainly by reducing endorsement of wrong proposer answers. It failed when the critic had no better signal to import, or when the judge summarized the critic instead of checking it.
The question we care about now is whether this mechanism survives in the places where scalable oversight is actually needed. Can debate turn a fuzzy problem into smaller assumptions and local claims that a weaker judge can evaluate more reliably than the original answer alone? Our results are evidence that this can happen in code and logic. They are not yet evidence that it works for research assistance, long-horizon agency, or other open-ended domains.
This work follows our earlier post, Inference-time Generative Debates on Coding and Reasoning Tasks. Code[1] will be available soon. Run manifests and transcripts are available on request (they are too large to store in Google Drive!).
The debate-execution scaffolding in our codebase was originally written by Samuel Arnesen: GitHub. We have substantially extended it for the experiments and analyses reported here.