TL;DR: We wrote a safety case sketch for control monitoring taking into account complexities of practical deployments.
This work was a collaboration between Google DeepMind and the UK AI Security Institute. Full author list: David Lindner*, Charlie Griffin*, Tomek Korbak, Roland S. Zimmermann, Geoffrey Irving, Sebastian Farquhar, Alan Cooney. Read the full paper here and the tweet thread here.
Real deployments are full of complexities that prior work abstracts away:
many agent instances
human oversight can take long
attacks can make incremental progress
incriminating agents is difficult
Our paper explicitly addresses these.
We look at 3 types of monitoring: asynchronous (offline), semi-synchronous (blocking individual instances) and fully-synchronous (blocking entire agent deployment). These give increasing levels of safety but at sharply increasing latency cost.
We present a safety case sketch that identifies 3 important safety conditions: (1) ability to detect scheming (oversight) (2) ability to do so fast enough (latency) and (3) ability to prevent harm and revert partial attack progress. We need to ensure all of them to get safety!
We apply the argument to a set of case studies and find that asynchronous monitoring is promising for some but not all threat models and semi-synchronous monitoring can get us further. Monitoring is promising but should only be one part of an overall AI control strategy.
TL;DR: We wrote a safety case sketch for control monitoring taking into account complexities of practical deployments.
This work was a collaboration between Google DeepMind and the UK AI Security Institute. Full author list: David Lindner*, Charlie Griffin*, Tomek Korbak, Roland S. Zimmermann, Geoffrey Irving, Sebastian Farquhar, Alan Cooney. Read the full paper here and the tweet thread here.
Real deployments are full of complexities that prior work abstracts away:
Our paper explicitly addresses these.
We look at 3 types of monitoring: asynchronous (offline), semi-synchronous (blocking individual instances) and fully-synchronous (blocking entire agent deployment). These give increasing levels of safety but at sharply increasing latency cost.
We present a safety case sketch that identifies 3 important safety conditions: (1) ability to detect scheming (oversight) (2) ability to do so fast enough (latency) and (3) ability to prevent harm and revert partial attack progress. We need to ensure all of them to get safety!
We apply the argument to a set of case studies and find that asynchronous monitoring is promising for some but not all threat models and semi-synchronous monitoring can get us further. Monitoring is promising but should only be one part of an overall AI control strategy.