Inference-time Generative Debates on Coding and Reasoning Tasks for Scalable Oversight
By Ethan Elasky and Frank Nakasako (equal contribution) We tested generative debate (where participants freely make their positions) on coding and reasoning tasks using weak-judge/strong-debater setups. The results are mostly negative: debate underperforms consultancy in 11 of 16 conditions, and removing debate transcripts entirely (just showing the judge the debater-generated...
Feb 265