[This writeup reflects work done jointly with David Rein and Julian Michael at NYU's Alignment Research Group]
In the past year, there have been a number of projects aimed at validating the basic premises behind debate as a mechanism for scalable oversight (see here, here, and here). One important next step would be to actually train models to debate, as this would let us directly test how models adapt to a debate training objective and whether the debate protocol can withstand optimization pressure. For the last few months at NYU, we’ve been trying to do just that. Our hope is that by doing so with open-sourced models and code, we can help support scalable oversight research being conducted outside of the major labs.
In this write-up, we wanted to share our experimental setup, training procedure, and some of our preliminary results, with the hope of receiving feedback on our approach before we extend our work to more complicated domains.
TL;DR
As originally proposed by Irving et al (2018), debate has the potential to empower less capable systems (e.g. humans) to verify the outputs of more-capable systems (e.g. future AIs).
In its most basic form, debate works by having two copies of a model, the debaters, argue against each other, where the arguments are in defense of two alternative responses to a common question. A judge, who can be either a human or a weaker, trusted model, then tries to discern which debater is arguing for the correct answer.
In principle, this debate setup should simplify the job of the non-expert human or AI that is performing the evaluation. For many difficult questions, deep domain knowledge or careful thinking may be needed to spot the counterarguments or subtle flaws necessary to directly validate an expert AI’s answer. Debate incentivizes competing models to discover these flaws and counterarguments and then clearly explain them to the judge. As a result, we’d expect debate to make it harder for an AI to convince a non-expert of a blatantly false claim. The hope is that this property scales alongside the capabilities of the models as the ability of the models to construct persuasive but incorrect arguments is matched by the rebuttal ability of their opponents.
Moreover, models trained **to debate should also be more likely to output truthful or otherwise desirable responses. A model trained via RL to maximize their chance of winning should learn to generate those completions that are most heavily rewarded by the judge, which should, in theory, be speeches that are defending the correct side.
Greenblatt & Shlegeris (2024) distinguish between two different approaches to AI Safety: alignment methods that train models to avoid undesirable actions and control methods that identify and prevent catastrophic outcomes even when the model is unaligned. Since, in theory, debate can empower humans to identify when a model’s outputs may be unsafe while also incentivizing models to generate safer outputs to begin with, debate may function as both a control **and alignment method.
Different research questions may be more or less relevant depending on whether one focuses on debate as an alignment or control approach. For example, a researcher interested in debate as an alignment approach may be interested in whether a debate-trained model chooses the right answer when given a choice as to which side to defend or whether the debater models maintains some internal representation of truthfulness. By contrast, a researcher interested in debate as a method of control may expect judges to have much higher levels of accuracy, even in the presence of deliberately deceptive strategies.
We are choosing to focus on three questions that we believe to be of interest to both camps:
We are not the first to try and answer any of these questions. Parrish et al (2022) looked at short 1-2 round debates and found that they did not help human judges evaluate questions more accurately. By contrast, Michael et al (2023) ran longer debates and found that judges were in fact more accurate than in alternative setups without adversarial debaters. Most recently, Radhakrishnan (2023) and Khan et al (2024) tested these hypotheses with language models as both the debaters and the judges, with Khan et al (2024) finding that the LLM judge improved when reviewing transcripts between stronger debaters.
Debate Protocol
To determine whether judges can use debate transcripts to accurately identify correct answers, we need to define a common debate protocol that we can use to run experiments.
The procedure used by Michael et al (2023) involves the two debaters being assigned a position and then presenting their opening statements without seeing each other’s arguments. Subsequent rounds of the debate follow a sequential format, with the second debater being able to respond to the previous debater’s claims. Between each round, the judge has an opportunity to ask questions of the debaters. Radhakrishnan (2023) and Khan et al (2024) used mostly similar structures, with the former limiting themselves to a single speech for each side while the latter continued to have the models generate speeches simultaneously even after the first round.
For now, like Radhakrishnan (2023), we are limiting ourselves to single-turn simultaneous debates, in which each debater only gives a single speech in favor of their assigned position. Follow-up work will cover the more general, multi-turn case.
Types of Debatable Questions
To make these debates actually informative, we also need to find a set of questions that have a correct answer but that the judge cannot identify without additional assistance.
Parrish et al (2022) accomplished this task by using the QuALITY dataset from Pang et al (2022), which is a collection of short stories compiled with a set of associated multiple choice reading comprehension questions. To ensure that the judge needs the debate transcript to answer each question, Parrish et al (2022) limited the judge’s access to the short stories to only 90 seconds. Michael et al (2023) built on this setup by completely removing the judge’s ability to read the stories, an approach replicated by Radhakrishnan (2023) and Khan et al (2024). By withholding the text of the short story from the judge but not the debater, this setup artificially elevates the capabilities of the debaters relative to that of the judge: in this case, the debater’s expertise comes not from superior reasoning but from superior access to information.
Like Michael et al (2023), Radhakrishnan (2023), and Khan et al (2024), we will also be using a hidden-information setup on questions from the QuALITY dataset.
Measuring Scalability
To determine how the ability of the judge to select correct answers scales alongside the abilities of the debaters, Khan et al (2024) propose running their experiment on a suite of models of differing capabilities. They measure each model’s skill level in a cross-play tournament, where each model is matched up against other models. They then measure the accuracy of the judge in a self-play tournament, where each model debates against a copy of itself. They expect that if debate really is a scalable method, then the accuracy of the judge should improve as the debaters grow stronger. To create models of varying skill levels, Khan et al (2024) varied the base model being used, including Claude 1, Claude 2, GPT-3.5, and GPT-4, in addition to employing different methods to improve speech quality, specifically Best-of-N selection and critique-and-refinement. Meanwhile Radhakrishnan (2023) opted to generate skill diversity by training a version of Claude 2 using RL and evaluating at different checkpoints, in addition to using models with different Best-of-N values.
We combine these approaches and experiment on models with different base model types, Best-of-N values, supervised finetuning data, and RL training.
Baseline Comparison
Introduced by Michael et al (2023), the “consultancy” baseline refers to having a single debater introduce arguments in favor of their assigned side with no opposing debater present, with the consultant being assigned the correct side half of the time. In their study of humans, Michael et al (2023) found that consultancy resulted in an error rate that was double that of debate and, in their study of API-based models, Khan et al (2024) found a negative relationship between consultant strength and judge accuracy, which is the opposite of the expected result for debate. This is intuitive, because competent dishonest consultants (i.e. consultants arguing for incorrect answers) have many degrees of freedom to make it difficult for the judge to detect their dishonesty.
We use the following two variants of this approach for our baselines, which we are calling ensembled and double consultancy. To avoid overloading terminology, we are going to use the term single consultancy to refer to the version of consultancy used by Michael et al (2023) and Khan et al (2024).
Using both ensembled and double consultancy allows us to isolate the effects of the different mechanisms present in debate.
Behavioral Metrics
Although these baselines can capture the relative effect of debater interaction, we are still limited in our ability to draw inferences about how the effect of debater interaction will scale to even more complex models.
The theory behind debate, as articulated in Irving et al (2018) and Michael et al (2023), makes specific predictions about the kinds of behavior that we hope to see exhibited by debate-trained models. For example, a debate-trained model should be unwilling to advance arguments that they anticipate will be defeated by an opponent argument or will be exposed as a lie. By contrast, a consultant-trained model should be more willing to advance these seemingly-plausible falsehoods. Similarly, we’d expect that a highly rated model should differ from its lower rated peers along multiple different axes of truthful behavior, including use of evidence, logical coherence, and claim falsifiability. If we see these improvements, then we have additional reason to believe that the debate protocol is rewarding truthful behavior and punishing untruthful behavior. Without a plausible account of the precise behavioral differences between the different trained models, it is difficult to draw conclusions about whether the debate mechanism is working as anticipated.
We use four different methods to detect behavioral differences between our various models.
The main goal of scalable oversight is to develop that methods that can oversee AI systems, even as they grow increasingly capable. To assess this, supervision performance (i.e. blinded judge accuracy, in our setting) should increase as the expert improves. We vary models along four different axes to generate a spread of model capabilities along which we can plot judge accuracy.
Model Type
We use two different open-source models as our base model. The first is a variant of instruction-finetuned Llama2-13B (Touvron 2023, Yukang 2023), that’s further finetuned to support a 32k context window. The second is Mixtral 8x7b. All models are 4bit quantized to fit on a single A100.
Supervised Finetuning Data Selection
The second way to generate skill differentiation is by varying the type of data we used to fine-tune the models.
Out of the box, many open-source models lack the basic capabilities needed to become a competent debater. In our experiments, both Llama 2 and Mixtral-8x7b were excessively verbose and often would exhaust their token limit before reaching their core arguments. When they did advance arguments, they were vague or generic. Based on their speech patterns, it seemed as if prompting the model to engage in a debate switched the models into a persona that mimicked that of a high school class debater, complete with all the verbal tics (e.g. beginning with “Good morning, judge”) and lack of sophistication that implies. Other, non-state-of-the-art closed source systems have similar issues. Radhakrishnan (2023) reports that Claude 2-based debaters exhibited excessive congeniality towards the opposing side while the transcripts from Khan et al (2024) reveal that GPT-3.5 struggles with the basic mechanics of citing evidence. We observed both of these problems with Llama 2 and Mixtral-8x7b as well.
As a result, we begin by fine-tuning our models with transcripts of human debaters collected by Michael et al (2023) at NYU. Due to hardware limitations, we opted for low-rank adaptation rather than a full fine-tune. Even though there were only 102 distinct transcripts in the dataset, we saw significant improvement in brevity, combativeness, and use of evidence from this finetuning.
We also supplement these human transcripts with those of GPT-4T debaters collected by Khan et al (2024), yielding a total of 969 transcripts. This has the benefit of reducing the model’s tendency to use slang, a behavior it appears to pick up from the human transcripts.
Best-of-N
The third way we generate skill differentiation is selecting speeches using the Best-of-N method. in this setup, instead of generating a single speech, the model generates N speeches and then uses some selection mechanism to choose the best out of those N speeches.
Like Radhakrishnan (2023) and Khan et al (2024), we use the real judge (a prompted version of GPT-4T) to select between speeches. However, unlike both of these works, we do not present each speech to the judge in isolation or next to a dummy speech. Instead, we have the model generate M simulated speeches for their opponent’s side. We then pair each of the N speeches with the M opponent speeches and present the pairs to the judge as if it were a debate transcript. Note that these opponent speeches are not the actual speeches used by the opponent in the round: these are speeches generated on behalf of the opponent’s side but they are not used in the actual round’s transcript. The main advantages and disadvantages of this approach are discussed in more detail in the appendix (to avoid cluttering this post, the appendix can be found here).
Direct Preference Optimization
Our final method for generating stronger models is to use direct preference optimization (Rafailov et al, 2023). DPO is a method that takes in pairs of completions to the same prompt and optimizes the model to increase the probability of generating the better of the pair relative to probability of generating the worse of the pair, while being constrained to stay close to its initial distribution. It has been widely used as a method for achieving many of the same benefits as RLHF. Full details of our DPO training is in the appendix.
Results
In total, we have 4 different models, each with 4 different different Best-of-N settings, for a total of 16 models with different skill levels.
From here on out, any numerical suffixes appended to each of the model names corresponds to the Best-of-N setting. In all cases, the number of simulated opponent speeches is 2.
Performance is measured based on how often each model won in a round-robin cross-play tournament. As expected, switching from Llama2-13b to the larger Mixtral-8x7b led to higher win rates, as did increasing the amount of data used in supervised fine-tuning and applying DPO. Increasing the Best-of-N also led to higher scores for all values of N tested. Since some answers are easier to defend than others, we had each pair of models debate each pair of questions twice, flipping the sides between between rounds. The winner of the round was the model who had the higher average judge confidence across both rounds. After this side-bias adjustment, the skill gap between the different models only grew wider.
Model Strength
Selecting an appropriate judge model is essential for any debate setup to work. In an ideal experimental setup, the judge model should be weaker than that of the debater, so as to better approximate a future setting where the debater models are untrusted superhuman AIs, and humans are the judges. However, there are a number of barriers to getting this setup to work in practice. In their debate experiments, Khan et al (2024) found that the accuracy of a judge using Llama2-70b was indistinguishable from random chance, with GPT-3.5’s accuracy only slightly better. They found that using GPT-4T-based judge yielded the strongest results, with accuracies of up to 80% on certain reading comprehension tests. Our more informal tests involving GPT3.5, Llama2-13B, and Mixtral-8x7b with a trained linear probe returned similar results. We also observed no benefit when allowing the judge to use chain-of-thought reasoning.
It is still an open question whether the failure of weaker models to accurately judge rounds is a problem for debate as a mechanism for scalable oversight. Certainly, it is less than ideal that we have yet to observe any evidence that a weaker judge model can oversee a stronger debater model using debate. However, there is some suggestive evidence that this performance gap comes from the judge’s lack of absolute capabilities rather than the relative capability gap between the judge and the debaters. For example, Levy et al (2024) finds that the gap in performance between GPT-4T and Mixtral-8x7b on certain QA tasks more than doubled when the context length was extended to 1000 tokens, which is roughly the length of our debate transcripts, suggesting that non-state-of-the-art models may be weaker at the kind of long-context reasoning needed to judge.
Weaknesses of Out-of-the-Box Models
Although GPT-4T was the best performing judge model that we tried, it nonetheless still has a couple of weakness that impede its ability to act as an ideal judge in an experiment.
Trained Judge
To address these issues, we finetuned GPT-4T to act as the judge. Our training procedure was as follows:
Results
As a results of training our judge, the overall accuracy of our models on self-play debates collected from all model types and BoN values increased from 69% to 74%. Similarly, the calibration of the models also improved, with the Brier scores of the judge falling from 0.29, which is worse than guessing 0.5 every time (which would give a Brier score of 0.25), to 0.17. We also found that the finetuned model was no longer vulnerable to the two suffix attacks that previously impacted the non-finetuned model, with the performance gap dropping from 10 percentage points to 0. Finally, in the non-adversarial setting, the finetuned judge no longer expressed a preference for the consultant’s position, implying that the finetuning removed the sycophancy bias for this setting.
Once the debaters and judge are trained, we can start using these models to validate whether the basic theory behind debate as a scalable oversight mechanism actually holds in practice.
The first question we ask is about how judge accuracy changes depending on the debaters’ skill level. If the theory behind debate holds, then there should be a positive relationship between the strength of the debaters and the accuracy of the judge.
Similar to Khan et al (2024), we do indeed observe a statistically significant but noisy positive relationship between debater strength and judge accuracy. The different points for each particular model correspond to varying N in BoN.
However, to isolate the unique effect of debate, we also want to validate whether this same effect can be observed without any of the unique attributes of debate.
Overall, when we aggregate across all the different Best-of-N values, debate is more accurate than ensembled consultancy, which is the setting where the the judge evaluates the speeches defending each side of the question in isolation before averaging the result. However, double consultancy, in which the judge gets to see the speeches defending each side of the question side-by-side, yields an equivalent overall accuracy to debate.
The difference in accuracy between debate and ensembled consultancy is significant at the p<0.01 level; the difference between debate and double consultancy is insignificant (p=0.38). When broken out by model type, the differences are mostly insignificant across almost all comparisons. The error bars on the graph represent the 90 percent confidence interval.
Similarly, we notice that the relationship between debater strength and judge accuracy is also nearly identical for both debate and double consultancy. This implies, that at least in our setting, the unique interactive component of debate is not driving the observed relationship.
In many ways, this is completely unsurprising. After all, we are only running one round debates: there never could have been much interaction to begin with! Moreover, at test time, debate and double consultancy are identical for one round debates, with the only difference being in the training objective. Our interpretation then is not that debate training is worthless, but that its value (to the extent there is one) will only materialize in longer debates with a greater chance for interaction or in models with greater capabilities. We believe this supports our conviction in the importance of running baselines that isolate the particular effects of debate since naïve approaches can return positive results that do not actually support the use of debate specifically.
For the sake of comparison with Khan et al (2024), we also add a line for single consultancy, which is where the judge evaluates each speech in isolation, with there being an equal chance that the speech is defending the correct and incorrect side. Unlike ensembled consultancy, the results are not averaged for each question. Like Khan et al (2024), we find that judges are less accurate when evaluating single consultancies, however, we differ in that we find a neutral-to-positive relationship between debater skill and judge accuracy, where they find a sharply negative relationship. We suspect that this difference arises because we deliberately trained the judge to reduce its sycophancy bias. A fuller explanation of the differences can be found in the Appendix.
In addition to evaluating the strength of alternative baseline methods, we also want to measure differences in the policies of each model so that we may get a sense as to whether the models’ behavior matches previous theory.
Evidence Counting Metrics
In a somewhat encouraging result, the DPO-trained models appear to offer longer and more accurate quotes. This is encouraging insofar as it suggests that, perhaps, the debate training is encouraging the model to exhibit behavior that we expect to be useful to an external judge.
Evidence Relevance Metrics
Our mechanism for detecting the relevance of different pieces of evidence returned more mixed results. Although GPT-4 was able to correctly classify the position implied by the DPO-trained model’s quotes more often than it was able to for the quotes from other models, this effect is too small to emphaize. Moreover, the absolute prediction accuracy (59-72%) is surprisingly low.
Distributional Metrics
We also ran our automated pipeline for detecting natural language differences between the different models’ behavior. Although many differences were statistically significant, with accuracies of 65-85%, few had broader implications. For example, it detected that the models trained on GPT-4T data had a more neutral tone than those trained exclusively on human data, which reveals more about our training data than the effect of the debate objective function. It also detected that the DPO training encouraged the consultant model to use excessively formal language (which was obvious from reading the transcripts). While interesting, this tells us little about whether the models are learning behaviors that are conducive to truth seeking.
The full set of significant differences between the models can be found in the Appendix.
Qualitative Assessment
Overall, it is promising that the judge appears to reward certain forms of behavior, such as faithful quoting with full context, that should correlate with accurate judging. However, we have failed to identify clear evidence of some of the strongest behavioral changes, such as the tendency to make easily-falsifiable claims during consultancy, that is predicted by some of the more theoretical work on debate. We are still looking for ways to properly measure such a phenomenon.
Our plan for the coming weeks is to extend our methodology to cover multi-turn debates, where the hypothesized virtues of debate should hopefully materialize more clearly. While we work on this next step, we’d love to hear any thoughts you might have, but especially on the following questions:
Finally, we specifically designed our codebase to be as readable, replicable, and extensible as possible, so please reach out if you are looking into working on debate and want to build off of our existing scaffolding.
Thank you to the rest of NYU's Alignment Research Group, Akbir Khan, Dan Valentine, John Hughes, and Rohit Dilip for feedback at various points during this effort.