- Different labs - Models from different labs (Anthropic, OpenAI, DeepMind, Meta, xAI) likely have less correlated failure modes than models from the same lab.
Cross lab monitoring seems like a great idea. As an implementation detail, I think labs should do cross-lab monitoring with AI models that have instructions to not aid the other lab with AI R&D, synthetic data generation, or anything besides monitoring. Of course, if you thought that Gemini was more likely to be aligned than Claude, then you might prefer that Anthropic use Gemini for automating AI R&D, but this is almost certainly infeasible. One other complication worth noting with cross lab monitoring is that the AI models from different labs will probably have different intelligence levels, and different skill profiles. Thus, cross-lab monitoring may be giving us a model which is in between trusted monitoring and untrusted monitoring both in terms of intelligence and “trustedness”. This means might make untrusted monitoring preferable to cross-lab monitoring in some cases (e.g., if Claude is much smarter than all other models), but cross-lab monitoring still seems useful as a tool to add to our monitoring system.
Ironically, my take at writing scenarios published on July 2 discussed this very technique and reached a similar conclusion: the technique decreases the chance that the co-designed ASI causes the existential catastrophe. However, there also is a complication: what if scheming AIs are biased in the same direction and simply collude with each other?
Hi Cleo, nice post! I agree with you that mixed deployments are a good tool to have in our toolbox, but I disagree somewhat with your model of why mixed deployments are good.
My thoughts on whether danger is convex/concave in the fraction of your deployment which is scheming:
Suppose that we have the following two options:
Which should we prefer? Let’s consider the threat model of weight exfiltration. Suppose that in the pure deployment world the misaligned AI’s chance of success is 30% if we do vanilla untrusted monitoring. Then, the overall risk in the first situation is 15%, because there was a 50% chance that the AI was aligned. So, we’d prefer the mixed deployment described in situation 2 if and only if the mixed deployment has risk below 15%. There are two reasons why the mixed deployment should have lower risk than a pure misaligned deployment:
I suspect that (1) is a very small effect (I talked to Adam and Cody (some authors on Ctrl-Z), and they thought that doubling deployment length while keeping auditing budget a constant fraction of the deployment doesn’t really change the safety and usefulness much). I think that (2) is a larger effect, but am pretty unsure if it’s enough to halve the risk compared to a pure misaligned deployment (I think we should investigate this empirically).
Why I’m excited about mixed deployments despite uncertainty about the above question:
The situation in practice looks a lot better than a forced choice between a pure deployment that might be aligned / misaligned vs a mixed aligned / misaligned deployment. Here are some reasons why the actual situation is better:
Implications for what research should be done
It might be valuable for someone to do a research project investigating how good mixed deployments are, compared to pure aligned and pure misaligned deployments. Specifically, here’s what I’m imagining:
I think the blue team might be able to do extremely well with mixed deployments, e.g., much better than halving risk compared to a pure misaligned deployment.
Summary: Suppose we have many different AI models, none of which we trust isn’t scheming. Should we deploy multiple copies of our most trusted model, or an ensemble of many different models? I claim that mixed deployment is better, and offer some recommendations.
In a pure deployment, where we deploy multiple copies of our most trusted model, either all our AIs are scheming, or none are.[1] Whereas in a mixed deployment, there might be some models scheming and some not. A pure deployment has the advantage of maximising the chance that no AI is scheming, but the mixed deployment has the advantage of maximising the chance that some AIs aren't scheming. Which advantage matters more?[2]
The optimal deployment depends on how the probability of catastrophe grows with the proportion of scheming AIs.
Another way to phrase this:
Below, I list considerations in favour of convexity and concavity. Overall, I think that the danger function is convex, so we should pursue a mixed deployment.
A non-scheming minority can unilaterally prevent catastrophe through several mechanisms:
A scheming minority can unilaterally cause catastrophe through several mechanisms:
Danger might scale proportionally with the fraction of scheming AIs:
I think convex dangers dominate. That is, I think it's more likely that one good actor suffices for safety than that one bad actor suffices for catastrophe. I'm particularly moved by Exposing signals, Backdoor revelation, and Capability revelation. Hence, I recommend a mixed deployment strategy.
That said, I think the strongest considerations for concave danger are Secret exfiltration, Research sabotage, and Blackmail, but I am hopeful that these dangers could be mitigated with tractable control measures.
Buck Shlegeris makes this point in an 80,000 interview:
And a big difference between Google’s situation and the situation of this hypothetical AI company is that at Google, probably less than 1% of your employees are spies. [...] Whereas in the AI situation, it’s very correlated: if you have these AIs, there are probably either none scheming against you or all scheming against you, at least of instances of a particular model. So a lot of the security techniques that you often use in corporate IT are quite inapplicable. So that’s a way in which the situation seems harder than when we’re trying to guard against spies.
There are other advantages to mixed deployment that I do not consider here. I think that deploying schemers from different models likely has a lower chance of catastrophe than deploying schemers from the same model:
Incompatible goals - Scheming models might have incompatible goals, causing internal conflict that could lead them to work against each other or expose each other's plans.
Coordination friction - Schemers from different models are less likely to coordinate on Schelling points or communicate steganographically.
For simplicity, this article assumes that deploying schemers from different models has the same catastrophe probability as deploying schemers from the same model. I think this makes the case for mixed deployment conservative.
Consider a deployment scenario with 10 types of AI: one innocent type and 9 different suspicious types. They scheme according to a model with three latent states:
We model the danger function as F(s) = s^η where s is the proportion scheming and η determines the shape:
Let n be the proportion of deployed AIs that are innocent. With mixed deployment, we deploy proportion n of innocent and proportion (1-n)/9 of each suspicious type. A pure deployment corresponds to n=1.
The probability of catastrophe is P(n) = P(A) · 1 + P(B) · 0 + P(C) · E[(n·I + (1-n)·K/9)^η] where I ~ Bernoulli(0.1) and K ~ Binomial(9, 0.2).
When η = 5 (strongly convex), the optimal deployment is mixed:
When η = 1 (linear), pure innocent deployment is optimal since only average scheming rate matters.
When η < 1 (concave), pure innocent deployment is optimal since any scheming causes most of the danger.
With gradient routing, we train a single neural network on multiple environments while localizing the capabilities and tendencies from each environment to specified modules that can be ablated at inference. This enables mixed deployment from a single training run—instead of training 2^N models for N environments, you train once and create decorrelated instances through different ablation patterns. See Gradient routing is better than pretraining filtering.