Notes to self about the structure of the problem, probably not interesting to others:
This is heavily drawn from MIRIs work and Joe Carlsmith's work
So, there are two kinds of value structure: (1) Long-term goals, and (2) immediate goals & deontological constraints. The line between them isn't sharp but that's OK.
If we imagine an agent that only has long-term goals, well, that thing is going to be a ruthless consequentialisty optimizer thingy and when it gets smart and powerful enough it'll totally take over the world if it can, unless the maxima of its long-term goals are very similar to what would have happened by default if it didn't take over the world, which they won't be. Fragility of Value thesis and Orthogonality thesis both hold, for this type of agent. We are not on track to get the long-term goals of our AIs sufficiently close to correct for it to be safe for us to build this type of agent. I claim.
However, we can instead try to build agents that also have immediate goals / deontological constraints, and we can try to shape those goals/constraints to make the resulting agent corrigible or otherwise steerable and safe. E.g. it's vision for a future utopia would actually be quite bad from our perspective because there's some important value it lacks (such as diversity, or consent, or whatever) and we haven't noticed this yet, but that's OK, because it's honest with us and obeys the rules and so it wouldn't take over and instead would politely explain this divergence in values to us when we ask.
So far so good. How are we doing at getting those constraints to stick?
Empirically not so great, but not completely hopeless either.
But I want to talk about the theoretical side. The main reason to be concerned, on priors, is the nearest unblocked neighbor problem. Basically, if a weak agent is trying to control a strong agent by imposing rules on what that strong agent can do -- even if the strong agent is bound to follow the rules -- then the situation