NYU Debate Training Update: Methods, Baselines, Preliminary Results
[This writeup reflects work done jointly with David Rein and Julian Michael at NYU's Alignment Research Group] Introduction In the past year, there have been a number of projects aimed at validating the basic premises behind debate as a mechanism for scalable oversight (see here, here, and here). One important...
This seems like really interesting work! Would you be able to share any example transcripts from some of these debates? Since RLHF'ed models often shy away from combativeness, I'm curious as to the form of GPT-4's rebuttals (especially for questions where the judge gets it right after reading the debate but wrong otherwise)