This blog post has been superseded by our paper Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI. We leave all content here unchanged for reference only.
Note: This post is essentially a ~90% draft of our upcoming research paper. We are sharing it now to get feedback on our empirical approach, choice of final experiments, and suggested future work. We assume readers have a high context in empirical AI control research, so please consult the links and glossary for clarification on terminology.
Summary
We develop AI control protocols which use factored cognition to reduce risks from scheming AIs. In our proposed family of factored cognition protocols, tasks are decomposed into self-contained child tasks, which are each solved in... (read 7342 more words →)
Promising work! I think it's worth pointing out one limitation of these preliminary results: though they're presented as results for various monitors, they're also heavily indexed on a specific honest policy that probably isn't particularly good. With a broad range of attacks like you've generated here, probably the best way for blue team to handle a lot of them on the margin includes updating the honest policy prompt as well as the monitor prompt.
For example, to address attacks where the attacker is legibly writing iptables commands (something like (T2,T3, or T4) + (S1 or S2)), the best approach is plausibly to (i) instruct the honest policy to never ever use iptables commands... (read more)