AI Control: Improving Safety Despite Intentional Subversion
We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post: * We summarize the paper; * We compare our methodology to the methodology of other safety papers....
extreme nit, you probably meant for this be lowercase. I love this series!