This post is coauthored with Ryan Carey.
Much of the work on developing a corrigible agent has focused on ensuring that an AI will not manipulate the shutdown button or any other kind of device that the human operator would use to control it. Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control the state of the shutdown button. Then, what would remain of the problem of corrigibility? We think there would be three criteria that would remain. Before stating them, let UN be whatever utility function we would like the agent to optimise normally, and let US be the shutdown utility function. (UN may... (read 1244 more words →)