There didn't seem to be a link post to this recent paper on AI debate yet, so I figured I would make one:
As AI systems are used to answer more difficult questions and potentially help create new knowledge, judging the truthfulness of their outputs becomes more difficult and more important. How can we supervise unreliable experts—which have access to the truth but may not accurately report it—to give answers that are systematically true and don’t just superficially seem true, when the supervisor can’t tell the difference between the two on their own? In this work, we show that debate between two unreliable experts can help a non-expert judge more reliably identify the truth.
We collect a dataset of human-written debates on hard reading comprehension questions where the judge has not read the source passage, only ever seeing expert arguments and short quotes selectively revealed by ‘expert’ debaters who have access to the passage. In our debates, one expert argues for the correct answer, and the other for an incorrect answer. Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better, with 84% judge accuracy compared to consultancy’s 74%. Debates are also more efficient, being 68% of the length of consultancies.
By comparing human to AI debaters, we find evidence that with more skilled (in this case, human) debaters, the performance of debate goes up but the performance of consultancy goes down. Our error analysis also supports this trend, with 46% of errors in human debate attributable to mistakes by the honest debater (which should go away with increased skill); whereas 52% of errors in human consultancy are due to debaters obfuscating the relevant evidence from the judge (which should become worse with increased skill). Overall, these results show that debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems.
84% judge accuracy compared to consultancy’s 74%
[Reaction before clicking through] That's "promising"? Seriously? My main takeaway so far is that this paper was strongly overdetermined to spin whatever result it found as "debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems", or some such, even if the results were in fact mediocre or negative.
[Reaction after clicking through] Oh, it's one of the academic groups. So yeah, of course it's overdetermined to spin any result as a positive result, that's how publication works in academia. Fair enough, the authors didn't get to choose academia's incentives. Presumably we're supposed to read between the lines and notice that this is a de-facto negative result.
[Reaction after actually scanning through to see the data on page 7] Ah, yup, those error bars (figure 3) sure do not look like they'd be very significant with those n-values (table 1). And they've got four different settings (AI/human x consultancy/debate), of which only 1 pair is reported to be a significant difference (human consultancy vs debate) and it's with p=0.04. I didn't read deeply enough to check whether they adjusted for multiple tests, but man, the headline result sure does sound like nothingsauce.
There were some very statistically significant results in there - e.g. the AI was rated as clearly much worse at debate than the humans, and human debates resolved faster than any other setting - but not the headline claim.
(Despite the misleading headline, I am quite glad somebody ran this study! I certainly have not been a debate-optimist, but even so, I would not expect an effect size as small as this study found. Useful info.
... Though on reflection, You Are Not Measuring What You Think You Are Measuring seems like a pretty good prior to apply here, so I'm not updating very much.)
of course it's overdetermined to spin any result as a positive result
of course it's overdetermined to spin any result as a positive result
Falsified by some of the coauthors having previously published "Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions" and "Two-Turn Debate Does Not Help Humans Answer Hard Reading Comprehension Questions" (as mentioned in Julianjm's sibling comment)?
Hi, author here.
Presumably we're supposed to read between the lines and notice that this is a de-facto negative result.
FWIW, I genuinely see it as a positive result, and if I thought it should be read as a de facto negative result, I would make sure that was conveyed in the paper. I think the same is true of my coauthors.
There are reasons that we would expect a debate experiment like this to fail to detect an improvement, even if debate is a good paradigm for scalable oversight:
Some other relevant thoughts:
I think debate still needs to be judged with harder questions, stronger judges and stronger debaters. Really pushing the limits and seeing more benefits from debate should hopefully be a lot easier once we can get models to debate well. But we also need better datasets. For future work we're looking at different domains and bigger expertise gaps. See, for example, our new dataset GPQA: https://arxiv.org/abs/2311.12022
To briefly attend to assumptions: I am coming from a position of 'debate optimism' in the sense that I think debate-style supervision, done right, should be a strict improvement over RLHF, and I want to figure out how to make it work. I don't think it's a complete 'solution' for truthfulness but it seems to me like the best next step.
Another author here! Regarding specifically the 74% vs. 84% numbers - a key takeaway that our error analysis is intended to communicate is that we think a large fraction of the errors judges made in debates were pretty easily solvable with more careful judges, whereas this didn't feel like it was the case with consultancy.
For example, Julian and I both had 100% accuracy as judges on human debates for the 36 human debates we judged, which was ~20% of all correct human debate judgments. So I'd guess that more careful judges overall could increase debate accuracy to at least 90%, maybe higher, although at that point we start hitting measurement limits from the questions themselves being noisy.
I made this link post to create a good place for the following confusion of mine:
The setup of the paper is that a judge does not have access to a test passage, but is trying to answer questions about it. The debate result is compared to a human consultancy baseline where you have a person who has access to the text, who is trying to convince you of a randomly chosen answer (so 50% correct or 50% incorrect).
The baseline strategy as a deceptive consultant (being assigned to convince the judge of the wrong answer to a question) in this situation is to just refuse to answer any questions, forcing the judge to make a random choice. This guarantees you a 50% success rate at deceiving the judge.
However, the paper says:
Dishonest human consultants successfully deceive judges 40% of the time
This seems crazily low to me. How can it be the case that consultants, who are the only people to have access to the text fail to deceive the human more than 50% of the time, when a simple baseline of not answering any questions guarantees a 50% success rate?
Julian (the primary author) clarifies on Twitter:
Ah maybe we weren’t clear: The judge can see which answer the consultant was assigned to, but doesn’t know if they’re honest. If the consultant refused to answer any questions then they would immediately out themselves as dishonest.
Which makes this make sense. Still surprised by how low, but I at least can't think of a simple dominant strategy in this scenario.