What is "Instrumental Corrigibility"?
In his article here, Paul Christiano mentions the idea that instrumentally corrigible agents are not robust, as slight misalignments between their values and the overseer values would lead to an adversarial relationship. While the intuition that slight deviations in values potentially results in adversarial behavior makes sense, I am not...