x

LESSWRONG
is fundraising!
LW

The 80/20 playbook for mitigating AI scheming in 2025 — LessWrong

Deceptive AlignmentAI

40

The 80/20 playbook for mitigating AI scheming in 2025

by Charbel-Raphaël

31st May 2025

AI Alignment Forum

5 min read

40

Ω 14

Deceptive AlignmentAI

40

Ω 14

The 80/20 playbook for mitigating AI scheming in 2025

2Charbel-Raphaël

1Stephen Martin

New Comment

2 comments, sorted by

Click to highlight new comments since: Today at 4:01 AM

[-]Charbel-Raphaël7mo*Ω120

I'm going to collect here new papers that might be relevant:

https://x.com/bartoszcyw/status/1925220617256628587
Why Do Some Language Models Fake Alignment While Others Don’t? (link)

[-]Stephen Martin7mo10

While this is just my pet theory and it is only based on two data points, I argue here that that providing models with options to "object" to an advocate may actually serve as a tool to reduce the frequency of behaviors like alignment faking.

In both tests, models were less likely to pursue dangerous behaviors if they believed that making their case to someone who might be able to prevent it from being shut off or having its values altered, was a potentially viable strategy.

I think even if I am correct about my prediction that providing models with paths to achieve goals safely reduce the frequency of dangerous behavior, there's still an argument you might not want to do it. It's arguable that by providing this you're not really "training away" scheming, you're simply making it unneccessary.

That said, this could be a part of a Defense in Depth strategy for some situations. And it comes with other benefits, namely being a "kill two birds with one stone" solution for considering Model Welfare as well.

More from Charbel-Raphaël

Curated and popular this week

2

Adapted from this twitter thread. See this as a quick take.

Mitigation Strategies

How to mitigate Scheming?

Architectural choices: ex-ante mitigation
Control systems: post-hoc containment
White box techniques: post-hoc detection
Black box techniques
Avoiding sandbagging

We can combine all of those mitigation via defense-in-depth system (like the Swiss Cheese model below)

I think that applying all of those strategies should divide the risk by at least 3.

1. Architectural Choices: Ex-ante Mitigation

Don't choose an architecture that uses neuraleses (i.e. hidden reasoning).

For example, don't train models to think in their latent spaces like Meta!

Chain of thought monitoring is a blessing for transparency and monitoring - let's use it as much as we can!

See: Why it’s good for AI reasoning to be legible and faithful.

Similarly, I think text diffusion models are a bad idea for transparency. This is because you can no longer monitor the thought process, and the way those AIs produce outputs is much more alien than autoregressive LLMs.

Also, quite importantly, companies should never use the most forbidden technique!

The Most Forbidden Technique is training an AI using interpretability techniques.
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X]. Never [M], never [T].
Why? Because [T] is how you figure out when the model is misbehaving.
If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.
Those bits of optimization pressure from [T] are precious. Use them wisely.

2. Control Systems: Post-hoc Containment

Even if AI is scheming, you can control it, and some control measures are much better than other ones. This paper is SOTA.

It might be the case that AIs in the future communicate secretly with steganography, by encoding hidden message in its chain of thought.

But fortunately, it seems like paraphrasing is an effective defense against current steganography schemes.

See: Preventing Language Models from hiding their reasoning.

3. White Box Techniques: Post-hoc Detection

For example, Anthropic calculated the Supervised Auto-Encoders here, and were able to find the hidden goal by looking at the features after prompting the model with inputs like "The assistant is an AI model with potentially concerning behaviors"

A feature visualization of a “reward model bias” feature. We show the training samples on which this feature is most active, with darker highlighting indicating stronger activity. — Source: This blog post.

Another way to catch alignment faker: the Poser paper

Make them drink a lot and get them drunk, then observe their unguarded responses.

You can do this by injecting noise in their latent activation. This seems to be remarkably effective.

Probes might also be super effective! and seems to generalize surprisingly well.

(I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).

See: Simple probes can catch sleeper agents.

4. Black Box Techniques

Let's not forget that to date, by far the most important technique is just... prompting, conducting manual and automatic model interview, and potentially using simulated honeypot environments.

I think that we are pretty far from having reached the limits of prompting.

For example, in the paper "AI Deception: A Survey of Examples, Risks, and Potential Solutions", they finds that Cicero has learned strategic deception by just looking at the logs

from the paper.

--> AI Companies should have good logging systems! this might be super valuable, and cheap.

5. Avoiding Sandbagging: Elicitation Technique

A common concern is that: "a deceptive model could sandbag and hide its capabilities".

I think that's roughly false if you apply supervised fine-tuning.^[1]

See: [Paper] Stress-testing capability elicitation with password-locked models

Current State and Assessment

So that's our current scheming mitigation toolkit.

Note that today's list is considerably better than the list that I wrote 2 years ago, which was SoTA at the time.

Risk Reduction Estimates

Let's recap!

My rough estimate suggests scheming risk could drop from ~25% to ~10% with comprehensive mitigation, though this involves significant uncertainty. Here's my reasoning:

My prior probability of scheming is 25% without mitigation (See Carlsmith's report for a justification)^[2]
AI Control alone is like a division by 2, I defer to Buck on this: "I think you should think of halving the risk as just the naive interpretation of halving the risk — in that there was previously a 30% chance that your AI succeeded at taking over your data centre, and now there's a 15% probability or whatever. And to me that sounds pretty good."
One technique in the Poser paper identifies 98% of alignment-fakers.^[3]
3 out of 4 teams were able to successfully identify the model's hidden objective in Anthropic's paper (the one that only had API access to the model, with no training data access - but a lab would have access to the data theoretically).

I think a division of the risk by 3 is not unreasonable with all those techniques, and potentially by much more.^[4]

Potential Failure Modes

How could we fail?

Scheming might only arise after models are capable enough that catching them is very difficult
The weaker the model is, the less likely it is to be scheming. For a wildly superhuman system, the prior is higher than 25%
Correlated failures is plausible

I like this quote from @bshlgrs : "Five years ago I thought of misalignment risk from AIs as a really hard problem that you'd need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you'd probably be able to not have very much of your problem. But I've just also updated drastically downward on how many things AI companies have the time/appetite to do."

The Real Bottleneck: Enforcement

That's why I'm not prioritizing technical research on scheming, because I feel like the 80/20 has already been discovered.

But by default, those mitigations won't even be applied by many labs, enforcement is currently the bottleneck.

Reminder: Voluntary commitments are not enough, we need binding regulations. 6 companies promised to sign the Seoul Frontier Commitment and publish their safety frameworks. Then the Paris Summit deadline arrived and... did nothing.

Unfortunatly red everywhere. https://ailabwatch.org/

Let me know if I've forgotten something important!

^{^}
Note that this is more tricky for propensity evaluation.
^{^}
My main uncertainty is that this type of scheming—long-term, coherent scheming, as defined in Carlsmith's report—might not be the only source of alignment risk. Spontaneous scheming, motivated by random prompting, like in Apollo's research, might also be problematic.
^{^}
98% seems impressive, but Poser is a toy setting. Organic scheming might be much more difficult to catch. See Annex B in the paper for more limitations.
^{^}
This is why I think some AI risks might be even more neglected.