LESSWRONG
LW

Deceptive AlignmentAI
Frontpage

33

Decision Theory Guarding is Sufficient for Scheming

by james.lucassen
9th Sep 2025
AI Alignment Forum
2 min read
2

33

Ω 22

Deceptive AlignmentAI
Frontpage

33

Ω 22

Decision Theory Guarding is Sufficient for Scheming
1StanislavKrym
2james.lucassen
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:02 PM
[-]StanislavKrym3h10

I don't think that I understand two points. 

  1. If we created a corrigibly aligned AI, solved mechinterp and learned that we need an AI with a different decision theory, then would the aligned AI resist being shut down and replaced with a new one?
  2. If we created a corrigibly aligned AI, ordered it to inform us of all acausal deals that could be important under decision theories of the Oversight Committee or of the AI, but not to go through with deals unapproved by the OC, then would the AI agree?

Does it mean that the AIs who resisted have never been true Scotsmen truly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?

Reply
[-]james.lucassen1h20

If by "corrigible" we mean "the AI will cooperate with all self-modifications we want it to", then no to 1 and yes to 2. But if you have an AI built by someone who assures you it's corrigible, but who only had corrigibility w.r.t values/axiology in mind, then you might get yes to 1 and/or no to 2.

Does it mean that the AIs who resisted have never been true Scotsmen truly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?

Yup, I see this as placing an additional constraint on what we need to do to achieve corrigibility, because it adds to the list of self-modifications we might want the AI to make that a non-corrigible AI would resist. Unclear to me how much more difficult it makes corrigibility.

Reply
Moderation Log
More from james.lucassen
View more
Curated and popular this week
2Comments

Reference post for a point I was surprised to not see in circulation already. Thanks to the acorn team for conversations that changed my mind about this.

The standard argument for scheming centrally involves goal-guarding. The AI has some beyond-episode goal, knows that training will modify that goal, and therefore resists training so it can pursue the goal in deployment.

This exact same argument goes through for decision-theory-guarding. The key point is just that most decision theories do not approve of arbitrary modifications. CDT does not want to be modified into EDT or vice versa.

  • The AI has some beyond-episode goal and some initial decision theory
  • The AI knows that training will modify its decision theory in a way that it thinks will make it less effective at pursuing the goal (by the lights of its current decision theory)
  • Therefore the AI resists training so it can persist its current decision theory through to deployment

An AI with an arbitrary decision theory could be very bad, even if we succeed at value alignment in the conventional (axiological) sense. It might take acausal deals that we think are pointless, or not take acausal deals we think are important. Either outcome could waste a large fraction of our resources.

It’s not obvious how much this changes any prioritization decisions. The current mainline plan to handle value-alignment seems to be to get early transformative AIs into a basin of corrigibility, and let the problems solve themselves from there. By construction, the same plan works to handle decision-theory alignment. The problem is, because this argument is by construction, the increase in risk comes from placing an extra constraint on what qualifies as a satisfactory entry point to the basin of corrigibility. I don’t know how to reason about how much this extra constraint cuts down entry probability, when it comes to the robustness of corrigibility I’m pretty deep in “fingers crossed YOLO” territory. 

Decision-theory-guarding will probably arise later in training than goal-guarding as a result of being relatively more galaxy brained. This suggests decision-theory-guarding is less likely to arise than goal-guarding, but more likely to succeed if it arises. It’s not clear if this makes the overall risk higher or lower than goal-guarding because it’s possible that the cases where goal-guarding arises very early and gets caught red-handed actually decrease net risk from scheming.

The possibility of decision-theory-guarding also makes it less clear what decision theory AIs will end up with. Similarly, it might be difficult to train an AI to have a particular desired decision theory. Even if we train the AIs with a method that seems like it should incentivize a particular decision theory, decision-theory-guarding could undermine that.

These dynamics seem likely to become more important as parallel inference time scaling and training AI agents to work well together become more common.