When does debate help a weak judge? Evidence from code, logic and math
2026.06.07 update: We’ve rewritten this entire post for clarity and added a new math setting to bolster our results. Authors: Ethan Elasky and Frank Nakasako, Palaestra Research; Naman Goyal. Link to ArXiv preprint Thanks to Coefficient Giving for support and Thinking Machines for API credits; our mentor for guidance along...
May 2616