x

LESSWRONG

LW

ethanelasky — LessWrong

ethanelasky

ethanelasky

Message

39

2

8

2y

ethanelasky

39

2y

When does debate help a weak judge? Evidence from code, logic and math

2026.06.07 update: We’ve rewritten this entire post for clarity and added a new math setting to bolster our results. Authors: Ethan Elasky and Frank Nakasako, Palaestra Research; Naman Goyal. Link to ArXiv preprint Thanks to Coefficient Giving for support and Thinking Machines for API credits; our mentor for guidance along...

Inference-time Generative Debates on Coding and Reasoning Tasks for Scalable Oversight

By Ethan Elasky and Frank Nakasako (equal contribution) We tested generative debate (where participants freely make their positions) on coding and reasoning tasks using weak-judge/strong-debater setups. The results are mostly negative: debate underperforms consultancy in 11 of 16 conditions, and removing debate transcripts entirely (just showing the judge the debater-generated...