Ryan Carey asks in a related question "What are some good examples of incorrigibility?" He provides the following overview:

The idea of corrigibility is roughly that an AI should be aware that it may have faults, and therefore allow and facilitate human operators to correct these faults. I'm especially interested in scenarios where the AI system controls a particular input channel that is supposed to be used to control it, such as a shutdown button, a switch used to alter its mode of operation, or another device used to control its motivation.

What's a more detailed understanding? What are the right things to read? I believe there's at least one MIRI paper, some Arbital posts. Writing to this question to center my inquiry.

New to LessWrong?

New Answer
New Comment

1 Answers sorted by

Ruby

May 02, 2019

20

The Aribital entry is a very comprehensive and clear introduction.

https://arbital.com/p/corrigibility/