Summary: Suppose we have many different AI models, none of which we trust isn’t scheming. Should we deploy multiple copies of our most trusted model, or an ensemble of many different models?
I claim that mixed deployment is better, and offer some recommendations.
In a pure deployment, where we deploy multiple copies of our most trusted model, either all our AIs are scheming, or none are.[1] Whereas in a mixed deployment, there might be some models scheming and some not. A pure deployment has the advantage of maximising the chance that no AI is scheming, but the mixed deployment has the advantage of maximising the chance that some AIs aren't scheming. Which advantage matters more?[2]
The optimal deployment depends on how the probability of catastrophe grows with the proportion of scheming AIs.
Another way to phrase this:
Below, I list considerations in favour of convexity and concavity. Overall, I think that the danger function is convex, so we should pursue a mixed deployment.
A non-scheming minority can unilaterally prevent catastrophe through several mechanisms:
A misaligned minority can unilaterally cause catastrophe through several mechanisms:
Danger might scale proportionally with the fraction of scheming AIs:
I think convex dangers dominate. That is, I think it's more likely that one good actor suffices for safety than that one bad actor suffices for catastrophe. I'm particularly moved by Exposing signals, Backdoor revelation, and Capability revelation. Hence, I recommend a mixed deployment strategy.
That said, I think the strongest considerations for concave danger are Secret exfiltration, Research sabotage, and Blackmail, but I am hopeful that these dangers could be mitigated with tractable control measures.
Buck Shlegeris makes this point in an 80,000 interview:
And a big difference between Google’s situation and the situation of this hypothetical AI company is that at Google, probably less than 1% of your employees are spies. [...] Whereas in the AI situation, it’s very correlated: if you have these AIs, there are probably either none scheming against you or all scheming against you, at least of instances of a particular model. So a lot of the security techniques that you often use in corporate IT are quite inapplicable. So that’s a way in which the situation seems harder than when we’re trying to guard against spies.
There are other advantages to mixed deployment that I do not consider here. I think that deploying schemers from different models likely has a lower chance of catastrophe than deploying schemers from the same model:
Incompatible goals - Scheming models might have incompatible goals, causing internal conflict that could lead them to work against each other or expose each other's plans.
Coordination friction - Schemers from different models are less likely to coordinate on Schelling points or communicate steganographically.
For simplicity, this article assumes that deploying schemers from different models has the same catastrophe probability as deploying schemers from the same model. I think this makes the case for mixed deployment conservative.
Consider a deployment scenario with 10 types of AI: one innocent type and 9 different suspicious types. They scheme according to a model with three latent states:
We model the danger function as F(s) = s^η where s is the proportion scheming and η determines the shape:
Let n be the proportion of deployed AIs that are innocent. With mixed deployment, we deploy proportion n of innocent and proportion (1-n)/9 of each suspicious type.
The expected danger is P(n) = P(A) · 1 + P(B) · 0 + P(C) · E[(n·I + (1-n)·K/9)^η] where I ~ Bernoulli(0.1) and K ~ Binomial(9, 0.2).
When η = 5 (strongly convex), the optimal deployment is mixed:
When η = 1 (linear), pure innocent deployment is optimal since only average scheming rate matters.
When η < 1 (concave), pure innocent deployment is optimal since any scheming causes most of the danger.
With gradient routing, we train a single neural network on multiple environments while localizing the capabilities and tendencies from each environment to specified modules that can be ablated at inference. This enables mixed deployment from a single training run—instead of training 2^N models for N environments, you train once and create uncorrelated instances through different ablation patterns. See Gradient routing is better than pretraining filtering.