When the Smarter AI Lies Better: Can Debate-Based Oversight Catch Deceptive Code
Authors: Oskar Kraak (Independent), Abigail Lupi (Independent), Dwayne Wilkes (AISI@GATECH) - With Redwood Research and Apart Research TL;DR: In preliminary experiments we found that applying AI debate in an AI Control setting leads to more uncaught backdoors than a trusted monitoring baseline. We interpret this as communication enabling deception. What...