Boaz Barak, Gabriel Wu, Jeremy Chen, Manas Joglekar
[Linkposting from the OpenAI alignment blog, where we post more speculative/technical/informal results and thoughts on safety and alignment.]
TL;DR We go into more details and some follow up results from our paper on confessions (see the original blog post). We give deeper analysis of the impact of training, as well as some preliminary comparisons to chain of thought monitoring.
We have recently published a new paper on confessions, along with an accompanying blog post. Here, we want to share with the research community some of the reasons why we are excited about confessions as a direction of safety, as well as some of its limitations. This blog... (read 2407 more words →)