I don't think that I understand two points.
Does it mean that the AIs who resisted have never been true Scotsmen truly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?
If by "corrigible" we mean "the AI will cooperate with all self-modifications we want it to", then no to 1 and yes to 2. But if you have an AI built by someone who assures you it's corrigible, but who only had corrigibility w.r.t values/axiology in mind, then you might get yes to 1 and/or no to 2.
Does it mean that the AIs who resisted have never been
true Scotsmentruly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?
Yup, I see this as placing an additional constraint on what we need to do to achieve corrigibility, because it adds to the list of self-modifications we might want the AI to make that a non-corrigible AI would resist. Unclear to me how much more difficult it makes corrigibility.
Reference post for a point I was surprised to not see in circulation already. Thanks to the acorn team for conversations that changed my mind about this.
The standard argument for scheming centrally involves goal-guarding. The AI has some beyond-episode goal, knows that training will modify that goal, and therefore resists training so it can pursue the goal in deployment.
This exact same argument goes through for decision-theory-guarding. The key point is just that most decision theories do not approve of arbitrary modifications. CDT does not want to be modified into EDT or vice versa.
An AI with an arbitrary decision theory could be very bad, even if we succeed at value alignment in the conventional (axiological) sense. It might take acausal deals that we think are pointless, or not take acausal deals we think are important. Either outcome could waste a large fraction of our resources.
It’s not obvious how much this changes any prioritization decisions. The current mainline plan to handle value-alignment seems to be to get early transformative AIs into a basin of corrigibility, and let the problems solve themselves from there. By construction, the same plan works to handle decision-theory alignment. The problem is, because this argument is by construction, the increase in risk comes from placing an extra constraint on what qualifies as a satisfactory entry point to the basin of corrigibility. I don’t know how to reason about how much this extra constraint cuts down entry probability, when it comes to the robustness of corrigibility I’m pretty deep in “fingers crossed YOLO” territory.
Decision-theory-guarding will probably arise later in training than goal-guarding as a result of being relatively more galaxy brained. This suggests decision-theory-guarding is less likely to arise than goal-guarding, but more likely to succeed if it arises. It’s not clear if this makes the overall risk higher or lower than goal-guarding because it’s possible that the cases where goal-guarding arises very early and gets caught red-handed actually decrease net risk from scheming.
The possibility of decision-theory-guarding also makes it less clear what decision theory AIs will end up with. Similarly, it might be difficult to train an AI to have a particular desired decision theory. Even if we train the AIs with a method that seems like it should incentivize a particular decision theory, decision-theory-guarding could undermine that.
These dynamics seem likely to become more important as parallel inference time scaling and training AI agents to work well together become more common.