This distinction reminds me of Evading Black-box Classifiers Without Breaking Eggs, in the black box adversarial examples setting.
In this post, I’ll make a technical point that comes up when thinking about risks from scheming AIs from a control perspective.
In brief: Consider a deployment of an AI in a setting where it’s going to be given a sequence of tasks, and you’re worried about a safety failure that can happen suddenly. Every time the AI tries to attack, one of three things happens: we win, we lose, or the deployment continues (and the AI probably eventually attempts to attack again). So there’s two importantly different notions of attacks “failing”: we can catch the attack (in which case we win), or the attack can fail silently, either because it’s defective or because it’s defused by safety techniques, such that the attack doesn’t end up causing an unacceptable outcome but the AI also doesn’t get caught (in which case the deployment continues). When reasoning about how safe a protocol is, it’s crucial to distinguish between these two ways an attack can fail. And when designing protocols, it’s often worth thinking separately about the techniques that make the AI lose and the techniques that let the deployment continue without a win or loss.
This point has been made in passing in many places (our control paper, our toy models post, Charlie Griffin’s control paper, Towards Safety Cases For AI Scheming, various upcoming papers) but I wanted to have a single place that makes the point clearly. Thanks to Alex Mallen, Josh Clymer, Joe Benton, and Justis Mills for comments, and to Monte MacDiarmid for providing experimental data that I analyze.
Here’s a more precise description of the scenario:
Our first control paper (and various follow-up papers and related posts) analyzes a setting with these properties.
So you deploy the AI with some control measures. In many cases, the optimal strategy for the AI is to normally act aligned, and on very rare occasions, try to cause the unacceptable outcome—we’ll call those attempts “attacks”.
For concreteness, let’s talk about a particularly important simple class of control protocols, which I’ll call execute/replace/audit protocols.
An execute/audit/replace protocol works as follows: We have an agent loop that repeatedly generates inputs to a LLM and then calls an inference service on those inputs, getting a proposed action in response (see here for more on what I’m imagining). Then we produce an estimate of how dangerous or suspicious this action is. There are many techniques you could use for this: asking trusted or untrusted models to evaluate the danger of proposed actions, using probes on the policy trained on off-policy examples or produced other ways, or using interpretability techniques like sparse autoencoders to detect deceptive intent. All of these methods take a proposed action (together with some surrounding context) and produce a score describing how dangerous and/or suspicious the action is. I’ll call that the monitor and call its response the monitor score.
Then, based on the monitor score, we need to make two decisions.
So in general we want to get two scores from our monitor, which we use to make the two choices of whether to replace and whether to audit. You could use the same monitor with different thresholds for these two choices, which is how we handled it in our original control paper, but it’s probably better to make the decisions separately.
The two defensive actions of auditing and replacing have quite different safety properties.
Auditing is probably drastically more expensive, so you’ll probably do it much less. In particular, if you need to use human auditors and you’re able to replace actions by resampling, it’s totally affordable to resample your model’s actions 10% of the time, but the proportion of the model actions you can afford to have a human look at is drastically lower than that. E.g. OpenAI obviously can’t afford to have a human look at even a tiny fraction of the (very roughly estimated[1]) 100 trillion tokens that their API handles per year; with a staff of 1000 fast readers, they could read 0.1% of it.
One consequence is that you should generally replace whenever you audit.
Let’s reason through what happens when the AI attacks. First, let’s think about a simple version. In this version, auditing is perfect: the auditing is able to incriminate the AI so convincingly that it can shut down after observing any attack. And replacing is perfect: every attack is neutralized by replacing (which might not be true in reality, e.g. for a replace strategy like trusted editing).
So the ways the deployment can end here are:
To calculate safety here (as described in Toy models of AI control for concentrated catastrophe prevention), you can make the following assumptions:
Then the odds of a blue team victory is the ratio between the probability of the following two events occurring on a particular step:
For example, if on the average attack attempt, there’s a 10% chance that the attack will be executed and be successful, and there’s a 20% chance that the attack will be audited and then the deployment will be shut down, then the overall P(blue team victory) is ⅔. It’s easy to modify this rule of thumb to relax some of those assumptions.
In Simple probes can catch sleeper agents, Monte MacDiarmid makes a linear probe monitor with a simple contrast pair technique on the model from Sleeper Agents. He reports the following ROC curve, graphing true positive rate (TPR) against false positive rate (FPR):
An obvious question is: if this was the actual ROC curve that we got with the actual monitor we were using on an actual scheming model, would we be safe?
Let’s pretend we’re using a perfect auditor that we can afford to use 0.1% of the time. We need to use the monitor as a classifier with a 0.1% FPR in order to keep our auditor fully utilized. According to the data that Monte generously gave me, at FPR=0.001, TPR=0.26. So the probability of an attack getting audited is 26%.
Now, let’s assume we’re also using this probe’s output for a replace strategy with a 5% replacement rate. At FPR=0.05, FNR=0.03. So when the model attacks, the chance that the attack is executed is 3%. (Let’s assume that every attack succeeds.)
Thus when the model attacks, there’s a 26% chance it gets audited, a 3% chance the attack gets executed, and the rest of the probability is that the attack is replaced. So there’s a 0.26/(0.26 + 0.03) = 90% chance that the model is caught before an attack is executed, and in expectation the model will have a few attacks replaced before it either wins or loses.
Some things to observe here:
So whether this probe is good enough for safety crucially depends on the details of the protocol in which the probe is being used as a monitor.