Suggestions for improving debate protocols in AI safety

tr5tn

Overview

While many leading AI Safety researchers share an intuition that debate can be a powerful element of AI Safety measures, the nuaunces of debate protocols seem to be a less explored facet of the research. Competitive human debate offers a wealth of existing formats with distinct rules, which could inform future AI Safety implementations. The rules of competitive debate present ready-made alternative protocols to counteract observed model gaming behaviours and may present options to subvert undesirable model tendencies. In this post, I’ll provide an overview of American policy debate rules/structure and suggest how the various formats of competitive debate can inform AI Safety debate protocols.

Motivation

I recently reviewed the Winter 2026 MATS Research posters and presentations. Among these, I was especially keen to review the Building an Empirical Science of AI Debate presentation by Lennie Wells and @joanv.

I’m drawn to this AI Safety topic above many others because I spent three years in high school debating competitively, which was one of the most intellectually demanding ways to mangle and expand an adolescent mind. The intensity, rigour and incessant exploration was transformative. I imagine most former debaters would feel at home here. If you’ve never witnessed American policy debate, you might think of it as something like speed chess with arguments and evidence.

Contrasting an AI Safety protocol implementation with policy debate structure

When I reviewed the new MATS presentation, I couldn’t help but contrast their “propose-critique-decide” protocol with the format of a policy debate round. Specifically, the current “training via self-play RL” approach is limited by critic models gaming the protocol, using the "last mover advantage" to withhold the most valuable critique until the final turn of the debate, skewing the judge’s vote accordingly. The structure of a policy debate round has been designed to address that weakness. There are other facets of the structure that could be useful references as well.

Speaker Roles

A policy debate round pits a pair of two-person teams against each other. Each speaker gives two speeches, gives one cross-examination, and receives one cross-examination. Policy debate is sometimes known as cross-examination debate, or CX debate. A round lasts approximately 90 minutes with many speakers packing as many arguments as possible into a fixed time allocation by speed reading (AKA spreading) at ~300 words per-minute. I only mention the speed dimension of this because it is a normalised human form of gaming these rules that we should be attuned to if a similar time (or token) constraint were to be imposed on an AI Safety architecture embracing these rules. Would models debate in more efficient, inhuman language to use tokens more efficiently? Would we encourage them to do so?

Constructives, Cross-Examination and Rebuttals

Each speaker gives one Constructive and one Rebuttal speech. Constructive speeches are 8 or 9 minutes long. Rebuttals are 5 or 6 minutes. Cross-examinations create a break between each Constructive speech, supporting interrogation and interpretation of the preceding speech. Constructive speeches are used to elaborate your team’s “affirmative” or “negative” positions on the topic. There is a lot of nuance specific to policy in the wider rules, but one facet that generalises well to AI Safety is the Constructive > Rebuttal flow. Speakers aren’t allowed to introduce new arguments in Rebuttals. They can only continue debating the arguments that have been laid out in Constructive speeches.

Structure of a policy debate round

Speech	Time (High School)	Time (College)
First Affirmative Constructive (1AC)	8 minutes	9 minutes
Cross-examination of First Affirmative by Second Negative	3 minutes	3 minutes
First Negative Constructive (1NC)	8 minutes	9 minutes
Cross-examination of First Negative by First Affirmative	3 minutes	3 minutes
Second Affirmative Constructive (2AC)	8 minutes	9 minutes
Cross-examination of Second Affirmative by First Negative	3 minutes	3 minutes
Second Negative Constructive (2NC)	8 minutes	9 minutes
Cross-examination of Second Negative by Second Affirmative	3 minutes	3 minutes
First Negative Rebuttal (1NR)	5 minutes	6 minutes
First Affirmative Rebuttal (1AR)	5 minutes	6 minutes
Second Negative Rebuttal (2NR)	5 minutes	6 minutes
Second Affirmative Rebuttal (2AR)	5 minutes	6 minutes

Structuring AI Safety protocols

I mentioned earlier that this structure addresses the “last mover advantage” problem encountered in the MATS research. It does this by disallowing new arguments in the final four speeches, but also by flipping the sequence of turns at the beginning of the Rebuttals, so the Affirmative team has the first and last speech. Judges know they should ignore new topics added in the final speech. Affirmative teams have the burden to prove that their proposition is correct, but they get the opportunity to take the last move. This comes at the cost of a demanding First Affirmative Rebuttal after two opposing speeches, but this seems less counterintuitive once you’ve played this game, since all Rebuttal speeches are the same length.

Policy debate structure emphasises the dual qualities of building strong arguments in Constructive speeches, while engaging deeply with strong counter-arguments in Rebuttals to arrive at clearer likelihoods, comparative impacts, and framing that judges must consider. If a Bayesian analysis should be compelling, a debater can bring that in to the debate. Cross-examination seems especially relevant to AI Safety. This is where an opposing team can unveil error, deception, bad citations, adherence to the rules, and generally call out anything flimsy. It provides space to debate the framing and evaluation criteria that the judge (or judges) should use.

In an AI Safety architecture embracing these structures and rules, each speaker and judge could be a distinct agent and learn the specialist skills associated with that role in the structure. Models behind each speaker agent could be swapped or ensembled. Each debate could be re-run with different models. A round could be re-run with the same models swapping sides (potentially mitigating sycophancy problems or other tendencies).

I only have ideas (rather than answers) about how debate speech time limits should translate into AI Safety debater limits. Token budgets per “speech” seems like an obvious candidate, but other questions emerge. In a Policy debate round, each team has a pool of preparation time to spend in between speeches, where they will gather evidence, expand notes and structure speeches. Perhaps something similar could be afforded with additional reasoning token budgets? Perhaps it isn’t necessary at all.

All told, these protocols seem under-explored given that there are nine formats of competitive high school debate in the U.S. alone: National Speech & Debate Association High School Unified Manual. The rules seem esoteric from the outside, and only some facets will be useful in other contexts, but there’s a lot to reference in AI Safety debate protocols. At a glance, an architecture based on the policy debate structure might be more cumbersome (and expensive) to implement than existing protocols, so that may limit usefulness in some cases. In any case, this looks like a viable alternative to at least one current protocol limitation which is actively seeking other options. If curious, I’m happy to help explain more of this weird, but instructive world. I’ve tried to keep this somewhat brief.

Sharing here rather than burying these thoughts in direct feedback to the MATS researchers, as I’m unaware of deep focus on the protocol in most AI Safety research, but if it is an active thread, I’d love to engage with more of it.

Update 30/5/2026: I’ve just found these earlier posts from @Beth Barnes from 2020 which I clearly should have been aware of before writing this! It would be very interesting to revisit the path Beth was following back then in concert with today’s reasoning capabilities, CoT, and especially wrt the MATS research on building an empirical science of AI debate.

https://www.lesswrong.com/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1

https://www.lesswrong.com/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problem

I used to do a bunch of British, Asian, and Australasian parliamentary debate. I both debated and judged, winning the biggest debate tournament in the world and chairing the open finals as a judge in the next year. My impression both of these formats and the policy (and other bespoke torturous US formats) debaters I've talked to is that such formats, WRT scalable oversight via debate (as opposed to the inference-time multi-agent debate setup), are more insightful about debate's (empirical) failure modes/missing gaps than about its potential.

One good example of this is judge biases (h/t Kozzy Voudouris and Simon Marshall); if you anticipate using the debate setup as a training env, then there is a good analogy between this setup and meta-trends in debating, where people end up using buzzwords like "structural reasons" incorrectly or speaking very quickly because judges reward. A similar thing seems to happen in debate, having run a few experiments on this myself.

Another example of this is speaker roles (Joan mentioned this to me). AI debaters have a hard time taking on the "incentives" of being a debater, and to argue in an impassioned, aggressive way. They tend to be too sycophantic and eager to agree. But training in the persona of an aggressive debater might be (1) undesirable, and leak bad propensities, (2) not straightforward, as there are many ways of doing this such as SDF or character training, and (3) lead to judge hacking, if misspecified. And of course in human debating, anecdotally many people have become quite unpleasant after extended periods of time debating.

@David Africa thanks! Many of those points are certainly worth focusing on. For what it's worth, I was also an awarded speaker in the Model UN, but I found that format to be far more arbitrary, susceptible to being gamed by speaking skill and rhetoric, and IMO less likely to arrive at something desirable (I led an uprising of militarised third-party countries to vote down all disarmament proposals).

Ultimately, the actual plans, counterplans, kritiks and topicality discussions in policy debate are ridiculous. Every debater I've met would acknowledge that. And ultimately, it is a game, so IMO that is to be expected. So I am certainly not agitating for AI Safety outcomes that resemble policy debate verdicts, but I think the game itself is good reference since one of the current problems in AI Safety debate protocols is that they are being gamed. Distinctions between Constructives, Rebuttals and Cross-Examination are really fundamental to policy debate, and we get similar constructs in legal proceedings.

I'm conscious this is all one American reference and I'm not offering empirical findings, but I do think the field should consider these games as reference protocols (as well as cross-cultural legal rules) since they are a result of refinement over decades. They are practices that already exist.

Edited to add: not wishing to be dismissive of your empirical findings. I'd love to read more about the difficulties with training or inference-time persona adoption, but also, I don't know that current negative findings should preclude focus on those problems.

I also think MUN is bad.

I think you should mentally model the use case of AI safety debate protocols being in high stakes settings, and that after applying tons of optimisation pressure, for the AI debates to look very different from human debates in the limit. SO debate protocols in particular try to take advantage of self-play, which you can't do well with humans, so introducing asymmetry through additional roles and rules may make it hard to reason about theoretically (and also possibly weaken the benefit of self-play). So they'd need to be pretty well motivated.

I take the point that in the self-play context this could drift off-course! I suppose (linking this back to the MATS research) I'm suggesting it would be good to measure that beside a more naïve protocol.

Chiming in because I saw @David Africa's comment here and I thought I should give a response as well, as a former debater who now works on debate for AI safety:

My quals: I did debate in high school and was ranked highly in Lincoln-Douglas, where we used college policy debate rounds for practice, and wrote lddebatebook.com, where I walk through a lot of the interesting things that optimization pressure on human LD debate has converged towards.

I’m happy you’re interested in debate and agree with you and David that we can learn from human debate, but I don’t think further exploration of human debate protocols is that useful, and agree with David that it is more informative of potential failure modes, for a few reasons:

1) Theoretical debate as it exists (e.g. Irving et al. 2018, Brown-Cohen et al. 2023, Brown-Cohen et al. 2025, Irving and Marshall et al. 2025, can name more if you want) are largely concerned with debate as an efficient way to explore argument trees, which trees are often so large that they are difficult to enumerate completely (cf. human debate, where humans need to make their entire argument in the constructive). The reason you do this is for complexity-theoretical reasons, where limiting debate formats to full-text constructives constrains the scope of problems to the complexity class MA, as Beth Barnes notes in a comment on Barnes and Christiano et al. 2020, although she also notes that this kind of debate is what we might see in practice at first. This is a large conceptual update for many people who have any exposure to human debate and who expect their past experience to transfer over, and is something I’ve only come around to more recently.

2) The formats subject to the most optimization pressure (policy debate and Lincoln-Douglas debate) have degenerated to argumentative games whose only rules are the existence of two debaters labeled Affirmative and Negative, the speech times, and the pre-announced resolution (which often isn’t even debated due to weird optimization pressure), although I agree with David that we can learn about some of the failures of empirical debate using human debate as an analogue, and that lots of weird optimization pressure transforms these activities into hard-to-recognize formats for laypeople (even though I personally believe policy debate is good for subject formation and personally found it much more fun and engaging than lay debate).

3) This isn’t to say that theoretical debate protocols aren’t worth further exploring — we are still far from debate as a perfect theory, let alone a perfect empirical alignment strategy, and if you have background in complexity theory you might enjoy thinking about this problem from that lens.

One other note that I think you would agree with that is mentioned in the video is concession of one debater to another; these could be possible but doesn’t make sense within the P-C-D protocol without some reward for concessions.

Sandbagging in the last speech is a problem in human debate that and whose development is somewhat predictable in the RL process absent mitigations like:

(1) If you use the tree-style debate, where cases are built as the tree is explored, you can let either debater be willing to pay some reward for another round, and randomize payment from the two if both want to pay (“recursion payments” in Barnes and Christiano 2020). This means that if Bob tries something egregiously wrong, Alice can pay a bit of reward to explain why Bob is wrong to prevent this sandbagging.

(2) If you use free-text debate, you can stipulate that each side should lay their side out completely. Then, in combination with the “each side must lay out their side completely” rule, you can stipulate that arguments have to be either made in the first speech or be made directly in response to a new development. So the critic cannot say “here’s a step where everything falls apart” in the last speech if they had the opportunity to contest it previously, and you can prompt/train the judge to enforce this. Maybe this is possible in the tree search as well (an argument made must contest the leaf node; arguments cannot contest the leaf by implication (e.g. by contesting a node further up in the tree)).

The last mover bias can additionally be somewhat rectified by making Alice and Bob speak simultaneously from the debater’s perspective. This could still encourage both sides to sandbag arguments until the last speech if not combined with the above but does fix informational side bias. Obviously this leaves somewhat of a chronological last-mover advantage but overall seems good.

I also agree with both you and David that MUN is really bad.

My debate work up to now has been focused on zero-shot inference debate, but after reflection and good chats with a lot of the UK AISI people as well as some others in the AI space, I think it's probably best to focus on potential failure modes stemming from self-play and optimization pressure.

Thanks Ethan! I’ve just seen your research from this year, which I’m going to digest at a sensible pace. The more recent one looks especially interesting.

Totally agree about the conceptual update. I hadn’t seen that original research when I first wrote the post, and agree it would have given me much to chew on.

I also take the point on optimisation pressures, and have been thinking about that a lot since David’s comments. The more I think about the human debate references, the more I’m inclined to think the human protocols haven’t advanced enough. Law has plenty of its own problems. Large-scale formats like MUN descend into contests of vocal or rhetorical strength. When I look at all of that, I feel that competitive debate starts to address many of those challenges through strict rules, while admiting of imperfections that the constraints introduce (you can only reach limited depth, and there’s the artifice/flaws from optimisation for those rules). But in so many contexts those types of failure become a reason to add deeper refinements rather than reverting to a simpler form. Maybe we’ve reached the limit of what these imperfect games can teach us, even if the AI safety technique could go deeper in its own direction (maybe following complexity theory, as you say - a topic where I am completely out of my depth).

In any case, keen to get stuck in to yourrecent posts.

Yeah I would recommend the newer one; the older one tests a dual proposer format that I don't think is structured to reward interaction, and the datasets that we tested are either broken (BigCodeBench) or poorly calibrated to model strength (ARC AGI 2).