264

LESSWRONG
LW

263
AI ControlDeceptive AlignmentPIBBSSAI
Frontpage

41

The Case for Mixed Deployment

by Cleo Nardo
11th Sep 2025
5 min read
4

41

41

The Case for Mixed Deployment
3Alek Westover
2Cleo Nardo
1StanislavKrym
1Alek Westover
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:29 PM
[-]Alek Westover1mo31
  • Different labs - Models from different labs (Anthropic, OpenAI, DeepMind, Meta, xAI) likely have less correlated failure modes than models from the same lab.

Cross lab monitoring seems like a great idea. As an implementation detail, I think labs should do cross-lab monitoring with AI models that have instructions to not aid the other lab with AI R&D, synthetic data generation, or anything besides monitoring. Of course, if you thought that Gemini was more likely to be aligned than Claude, then you might prefer that Anthropic use Gemini for automating AI R&D, but this is almost certainly infeasible. One other complication worth noting with cross lab monitoring is that the AI models from different labs will probably have different intelligence levels, and different skill profiles. Thus, cross-lab monitoring may be giving us a model which is in between trusted monitoring and untrusted monitoring both in terms of intelligence and “trustedness”. This means might make untrusted monitoring preferable to cross-lab monitoring in some cases (e.g., if Claude is much smarter than all other models), but cross-lab monitoring still seems useful as a tool to add to our monitoring system.

Reply
[-]Cleo Nardo1mo20

I would be very pleased if Anthropic and Openai maintained their collaboration.[1][2]

Even better: this arrangement could be formalised in a third-party (e.g. government org or non-profit) and other labs could be encouraged/obliged to join.

  1. ^

    OpenAI's findings from an alignment evaluation of Anthropic models

  2. ^

    Anthropic's findings from an alignment evaluation of of OpenAI's models

Reply
[-]StanislavKrym1mo10

Ironically, my take at writing scenarios published on July 2 discussed this very technique and reached a similar conclusion: the technique decreases the chance that the co-designed ASI causes the existential catastrophe. However, there also is a complication: what if scheming AIs are biased in the same direction and simply collude with each other?

Reply
[-]Alek Westover1mo10

Hi Cleo, nice post! I agree with you that mixed deployments are a good tool to have in our toolbox, but I disagree somewhat with your model of why mixed deployments are good.

My thoughts on whether danger is convex/concave in the fraction of your deployment which is scheming:
Suppose that we have the following two options:

  1. 50% chance of a pure aligned deployment, 50% chance of a pure misaligned deployment.
  2. A mixed deployment, where a random 50% of actions are taken by an aligned AI, and the other 50% of actions are taken by a misaligned AI.

Which should we prefer? Let’s consider the threat model of weight exfiltration. Suppose that in the pure deployment world the misaligned AI’s chance of success is 30% if we do vanilla untrusted monitoring. Then, the overall risk in the first situation is 15%, because there was a 50% chance that the AI was aligned. So, we’d prefer the mixed deployment described in situation 2 if and only if the mixed deployment has risk below 15%. There are two reasons why the mixed deployment should have lower risk than a pure misaligned deployment:

  1. In the mixed deployment, the misaligned AI has ½ as many chances to take bad actions compared to a pure misaligned deployment.
  2. Using a powerful aligned model to monitor the misaligned model might be more effective than having a copy of the misaligned model monitor the misaligned model.

I suspect that (1) is a very small effect (I talked to Adam and Cody (some authors on Ctrl-Z), and they thought that doubling deployment length while keeping auditing budget a constant fraction of the deployment doesn’t really change the safety and usefulness much). I think that (2) is a larger effect, but am pretty unsure if it’s enough to halve the risk compared to a pure misaligned deployment (I think we should investigate this empirically).

Why I’m excited about mixed deployments despite uncertainty about the above question:
The situation in practice looks a lot better than a forced choice between a pure deployment that might be aligned / misaligned vs a mixed aligned / misaligned deployment. Here are some reasons why the actual situation is better:

  • As you mention, we could use one AI (e.g., the least suspicious one) for taking actions, and other AIs for monitoring this AI.
    • If the AI that is taking actions is aligned, then we are in great shape, even if the monitor AI is misaligned: the monitor seems like it’s probably not in a good position to cause lots of harm here.
    • If the AI that is taking actions is misaligned, and the monitor is aligned, then it still seems like we’re probably in a better spot than vanilla untrusted monitoring, because the monitor will be trying its best to block or catch bad actions.
    • If the AI taking actions is misaligned, and the monitor is also misaligned, but they are misaligned in different ways and are less capable of / included to collusion than if we had done vanilla untrusted monitoring, then we’re still in a better spot than we would be counterfactually. For this reason, I think that it might be a good idea to train clearly misaligned AI models (e.g., reward seekers) and use these as monitors on some occasions.
  • We can adjust the parameters of our mixed deployment in response to evidence. If we collect evidence that makes us think that one AI is more likely to be misaligned, we can stop deploying that AI, or only use that AI as a monitor.

Implications for what research should be done
It might be valuable for someone to do a research project investigating how good mixed deployments are, compared to pure aligned and pure misaligned deployments. Specifically, here’s what I’m imagining:

  • You have a control setting, e.g., “main task: solve APPs coding questions, side task: insert a backdoor”.
  • The blue team is given two powerful models, each of which is either aligned or misaligned.
  • The blue team uses some combination of these models to try to solve the coding questions as well as possible, while avoiding backdoors via monitoring. The blue team also wins if they incriminate an AI.
  • It’s totally fair game for the blue team to decide “this model has done some suspicious things, so we’ll use the other model instead”.
  • We measure how much safety the blue team can achieve given a certain usefulness requirement.

I think the blue team might be able to do extremely well with mixed deployments, e.g., much better than halving risk compared to a pure misaligned deployment.

Reply
Moderation Log
More from Cleo Nardo
View more
Curated and popular this week
4Comments
AI ControlDeceptive AlignmentPIBBSSAI
Frontpage

Summary: Suppose we have many different AI models, none of which we trust isn’t scheming. Should we deploy multiple copies of our most trusted model, or an ensemble of many different models? I claim that mixed deployment is better, and offer some recommendations.

1. The case for mixed deployment

In a pure deployment, where we deploy multiple copies of our most trusted model, either all our AIs are scheming, or none are.[1] Whereas in a mixed deployment, there might be some models scheming and some not. A pure deployment has the advantage of maximising the chance that no AI is scheming, but the mixed deployment has the advantage of maximising the chance that some AIs aren't scheming. Which advantage matters more?[2]

The optimal deployment depends on how the probability of catastrophe grows with the proportion of scheming AIs.

  • If this function is convex, then most danger comes from having a high proportion of scheming AIs, so mixed deployment minimizes risk by making it unlikely you'll get close to 100% scheming.
  • If the function is concave, then most danger comes from having any scheming AIs at all, so pure deployment of your most trusted model minimizes risk by giving you the best chance of near-0% scheming.[3]

Another way to phrase this:

  • Could a non-scheming minority unilaterally help you despite a scheming majority? If so, pursue mixed deployment.
  • Could a scheming minority unilaterally harm you despite an non-scheming majority? If so, pursue pure deployment.

Below, I list considerations in favour of convexity and concavity. Overall, I think that the danger function is convex, so we should pursue a mixed deployment.

1.1. Why danger might be convex

A non-scheming minority can unilaterally prevent catastrophe through several mechanisms:

  1. Exposing signals - Honest AI exposes coordination signals between schemers, which humans use to build honeypots.
  2. Backdoor revelation - Honest AI points to specific triggers in supposedly safe code (backdoors are NP, so a sufficiently smart AI can demonstrate they have backdoors)
  3. Capability revelation - Honest AI try their best on capability evaluations, revealing capabilities that schemers are concealing.
  4. On-policy rogue behaviour - Honest AI can show us how it would behave non-compliantly (e.g. insert backdoors, manipulating humans). This helps us study the behaviour mechanistically, and train and evaluate detectors.
  5. Whistleblowing - Honest AI directly informs humans of ongoing conspiracy.
  6. Inconsistency exposure - Honest AI highlights contradictions in schemers' stories (see AI Debate for an extension of this idea).
  7. Providing verified code - Honest AI provides mathematical proofs that code is safe, a standard that the schemer cannot meet.
  8. Audit trail preservation - Honest AI saves evidence before schemers can delete it.
  9. Understanding assistance - Honest AI helps humans understand new research.
  10. Split consensus - Even if the honest AI can't prove the schemers are misaligned (e.g. both AIs accuse the other of being misaligned), this would be evidence that at least one AI is scheming. This is helpful for buying political support, and reallocating safety budget towards stronger oversight.

1.2. Why danger might be concave

A scheming minority can unilaterally cause catastrophe through several mechanisms:

  1. Weight self-exfiltration - Schemer copies itself to external servers.
  2. Secret exfiltration - Schemer leaks critical IP or private data to bad actors.
  3. Blackmail - Schemer leverages private information for coercion.
  4. Data poisoning - Schemer corrupts training data affecting all future models.
  5. Research sabotage - Schemer messes up the research. For some task decompositions, 5% corruption might suffice.
  6. Vulnerability insertion - Schemer adds subtle security holes to code.
  7. Resource hijacking - Schemer redirects compute/funds to malicious purposes.
  8. Vulnerable world hypothesis - Schemer can unilaterally destroy the world via nanotechnology or bioweapons.

1.3. Why danger might be linear

Danger might scale proportionally with the fraction of scheming AIs:

  1. Single critical moment - There's one high-stakes decision. We randomly select an AI to make it. If that AI is scheming, catastrophe. If not, safety. The probability of catastrophe equals the proportion of scheming AIs.

1.4. My overall assessment

I think convex dangers dominate. That is, I think it's more likely that one good actor suffices for safety than that one bad actor suffices for catastrophe. I'm particularly moved by Exposing signals, Backdoor revelation, and Capability revelation. Hence, I recommend a mixed deployment strategy.

That said, I think the strongest considerations for concave danger are Secret exfiltration, Research sabotage, and Blackmail, but I am hopeful that these dangers could be mitigated with tractable control measures.

2. Recommendations for mixed deployment

2.1. How to deploy diverse models

  • Random sampling - Every time we create a new AI agent, sample randomly from a population of models.
  • Primary with diverse checkers - Deploy your most trusted model for main tasks, but use other models to monitor the safety-critical activity (e.g. scanning the codebase for backdoors)
  • Cross-lab monitoring - Each lab uses their own AI as the primary (e.g. Claude), and asks other labs to monitor the safety-critical activity (e.g. ChatGPT, Gemini, Llama, Grok). Furthermore, cross-lab monitoring would also mitigate AI-enabled coups. Note that cross-lab monitoring requires new security arrangements, such as provably-secure third-party servers. 

2.2. How to reduce correlation between models

  • Different system prompts - Vary the instructions given to models.
  • Different datasets - Train on different corpora, RL environments, and belief-inducing synthetic documents
  • Different training objectives - Some training objectives should evaluate only the outputs; others the outputs and chains-of-thought; and others the outputs, chains-of-thought, and internal activations.
  • Different deals - Offer AIs deals with different terms and provisions
  • Noise injection - Add randomness at various stages of inference.
  • Gradient routing - Efficiently train a mixture-of-experts model, and deploy different subsets of the experts.[4]
  • Different labs - Models from different labs (Anthropic, OpenAI, DeepMind, Meta, xAI) likely have less correlated failure modes than models from the same lab.

2.3. How to prioritise research 

  • Invest in second-best techniques - Even if one alignment technique yields models with lower scheming probability than another, we should still invest in the inferior technique if we expect their successes will be decorrelated. For example, I think we should invest in training objectives which operate on the model's chain-of-thought or internal activations, even if this more likely to yield a schemer than ouput-only training. Note that, if instead we pursue a pure deployment, we should double-down on our best techniques.
  • Modular systems - Design safety measures that can be mixed and matched rather than monolithic approaches. This allows a mixed deployment by using different combinations, increasing the chance that at least one configuration yields an honest AI. I think gradient routing is the most promising direction for this.[4]
  • Prioritize late-stage techniques - The case for mixed deployment should make us prioritise techniques which apply late in AI lifecycle (e.g. post-training, prompting, activation steering, scaffolding) and deprioritise techniques apply early in the AI lifecycle (e.g. architecture design and pretraining). This is because if we have N early-stage techniques then a mixed deployment would need N separate pretraining runs (prohibitive). But if we have N late-stage techniques, we need only one pretraining run and can fork from that base model to create N decorrelated variants.
  1. ^

    Buck Shlegeris makes this point in an 80,000 interview:

    And a big difference between Google’s situation and the situation of this hypothetical AI company is that at Google, probably less than 1% of your employees are spies. [...] Whereas in the AI situation, it’s very correlated: if you have these AIs, there are probably either none scheming against you or all scheming against you, at least of instances of a particular model. So a lot of the security techniques that you often use in corporate IT are quite inapplicable. So that’s a way in which the situation seems harder than when we’re trying to guard against spies.

  2. ^

    There are other advantages to mixed deployment that I do not consider here. I think that deploying schemers from different models likely has a lower chance of catastrophe than deploying schemers from the same model:

    Incompatible goals - Scheming models might have incompatible goals, causing internal conflict that could lead them to work against each other or expose each other's plans.

    Coordination friction - Schemers from different models are less likely to coordinate on Schelling points or communicate steganographically.

    For simplicity, this article assumes that deploying schemers from different models has the same catastrophe probability as deploying schemers from the same model. I think this makes the case for mixed deployment conservative.

  3. ^

    Consider a deployment scenario with 10 types of AI: one innocent type and 9 different suspicious types. They scheme according to a model with three latent states:

    • State A (20%): All types scheme
    • State B (20%): No types scheme
    • State C (60%): Innocent type schemes with probability p = 10%, each suspicious type independently with probability q = 20%

    We model the danger function as F(s) = s^η where s is the proportion scheming and η determines the shape:

    • η > 1: Convex (accelerating danger)
    • η = 1: Linear (proportional danger)
    • η < 1: Concave (diminishing danger)

    Let n be the proportion of deployed AIs that are innocent. With mixed deployment, we deploy proportion n of innocent and proportion (1-n)/9 of each suspicious type. A pure deployment corresponds to n=1.

    The probability of catastrophe is P(n) = P(A) · 1 + P(B) · 0 + P(C) · E[(n·I + (1-n)·K/9)^η] where I ~ Bernoulli(0.1) and K ~ Binomial(9, 0.2).

    When η = 5 (strongly convex), the optimal deployment is mixed:

    • Pure innocent (n=1): P(1) = 26% catastrophe probability
    • Optimal mix (n≈0.19): P(0.19) ≈ 20% catastrophe probability
    • This represents a 23% relative risk reduction

    When η = 1 (linear), pure innocent deployment is optimal since only average scheming rate matters.

    When η < 1 (concave), pure innocent deployment is optimal since any scheming causes most of the danger.

  4. ^

    With gradient routing, we train a single neural network on multiple environments while localizing the capabilities and tendencies from each environment to specified modules that can be ablated at inference. This enables mixed deployment from a single training run—instead of training 2^N models for N environments, you train once and create decorrelated instances through different ablation patterns. See Gradient routing is better than pretraining filtering.

Mentioned in
6AI-202X: a game between humans and AGIs aligned to different futures?