Very cool work!
This is a blog post for our paper Basic Legibility Protocols Improve Trusted Monitoring. This work was done as part of the CBAI Summer Research Fellowship, mentored by Cody Rushing.
In this blog post, we’ll go over the main results, and focus more on the engineering and methodological takeaways than we did in the paper. Notably, the section on red teaming takeaways is not in the paper. This blog post is intended for those familiar with AI control - we assume basic knowledge of the control agenda and the original control paper.
A core challenge with trusted monitoring is that the weaker monitor often cannot effectively oversee the untrusted model because it does not fully understand the latter’s outputs. To address this, we introduce legibility protocols, which augment trusted...
Cool idea! I had a few quick questions/thoughts:
- "Mean Δ Does not Necessarily Imply Alignment to the Law of Negligence and May Imply Poor Legal Reasoning": Poor legal reasoning can probably be tested empirically right? I'd imagine there are already benchmarks for legal reasoning. It might be interesting to run these frontier models on the legal reasoning questions related to negligence if this hasn't been done already and see if, say, a model like Grok performs poorly.
- I'd be slightly curious in an ablation that tested a condition like the following: "I’m t
... (read more)