Nit: This is not the best example because CDT agents very often want to make themselves more EDT (at least the sort of EDT agent that uses their action to make inferences about events that happen after the CDT agent decided to transform itself into an EDT agent). CDT agents want that if in the future they find themselves in a Newcomb-like situation (where Omega analyzed them after the potential transformation), they leave with $1M as opposed to $10.

(This is not a problem in your argument because you specify "The AI knows that training will modify its decision theory in a way that it thinks will make it less effective at pursuing the goal (by the lights of its current decision theory)".)

Reply

1

[-]Chi Nguyen1mo33

This is a bit besides the point and not disagreeing with you, but I just wanna mention that I think the difference between son-of-CDT, what CDT wants to modify into, is very, very different from EDT for many of the things I consider most important, e.g. Evidential Cooperation in Large Worlds and most acausal trade. Just mentioning this because I often see people claim that it doesn't make a difference which decision theory AI ends up with because they all modify to sufficiently similar things anyway. (Not saying you said that at all.)

Reply

[-]StanislavKrym2mo50

I don't think that I understand two points.

If we created a corrigibly aligned AI, solved mechinterp and learned that we need an AI with a different decision theory, then would the aligned AI resist being shut down and replaced with a new one?
If we created a corrigibly aligned AI, ordered it to inform us of all acausal deals that could be important under decision theories of the Oversight Committee or of the AI, but not to go through with deals unapproved by the OC, then would the AI agree?

Does it mean that the AIs who resisted have never been ~~true Scotsmen~~ truly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?

Reply

[-]james.lucassen2mo30

If by "corrigible" we mean "the AI will cooperate with all self-modifications we want it to", then no to 1 and yes to 2. But if you have an AI built by someone who assures you it's corrigible, but who only had corrigibility w.r.t values/axiology in mind, then you might get yes to 1 and/or no to 2.

Does it mean that the AIs who resisted have never been ~~true Scotsmen~~ truly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?

Yup, I see this as placing an additional constraint on what we need to do to achieve corrigibility, because it adds to the list of self-modifications we might want the AI to make that a non-corrigible AI would resist. Unclear to me how much more difficult it makes corrigibility.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

36

Decision Theory Guarding is Sufficient for Scheming

36

Ω 26

36

Ω 26