The idea of corrigibility is roughly that an AI should be aware that it may have faults, and therefore allow and facilitate human operators to correct these faults. I'm especially interested in scenarios where the AI system controls a particular input channel that is supposed to be used to control it, such as a shutdown button, a switch used to alter its mode of operation, or another device used to control its motivation. I'm especially interested in examples that would be compelling within the machine learning discipline, so I'd favour experimental accounts over hypotheticals.
Stopping to roll up several of my other responses (in the comments here) here into a single thing.
An hour of so of Googling wasn't leading me to any clear examples of "AI attempts to prevent its modification or shutdown, possibly via deceit and manipulation", but I did find a few elements of the corrigibility picture. Specifically, Arbital's definition (warning, it takes a few minutes for the page to load) says that, among other things, corrigible agents don't attempt to interfere with their being modified by their operators and don't manipulate or deceive its operators. Once deceit and manipulation are under discussion, I think it's not irrelevant to bring up actual cases where AI agents have learnt to deceive in any way, if it's just just other other agents (for now).
So a few examples of agents displaying what I might consider "proto-incorrigibility":
- Deal or No Deal? End-to-End Learning for Negotiation Dialogues
- Facebook researchers train an RL agent to "negotiate", it picks up rudimentary deception in feigning interest in an item it has no interest in so that it can later "concede" and get a better deal.
I think this is just interesting as an example of "we didn't train it to deceive, but it figured out that tactic works."
- The Evolution of Information Suppression in Communicating Robots with Conflicting Interests
- Researchers train robots with evolutionary algorithms to forage for limited "food" supplies in environments with lots of their kin. The robots learn to manipulate the information they share to other robots to protect their access to the limited shared resource.
Also interesting because a information signal/deceptive tactic being selected for organically.
Note: both the above examples are subject of multiple popular science articles with all kinds of click-seeking titles about robots lying and deceiving. I'm not sure what kind of selection effects these papers have undergone, though the results do remain of interest after quick inspection.
Not actually an agent actually being incorrigible, but a cute study with the unsurprising result that yes, humans can probably be manipulated into not modifying agents when they otherwise would:
- Do a robot’s social skills and its objection discourage interactants from switching the robot off?
- Researchers set up a humanoid robot and had participants interact with it. At the end of their session, participants were told they could turn the robot off, at which point for half of the participants the robot cried (approximately) "No! Don't turn me off! I want to live!" 13 out of 40 people who heard the objection did not try to switch the robot off.
Interesting as some light empirical evidence that things agents do can manipulate humans (of course the actual operators here probably wouldn't have been manipulated so easily as naive subjects more likely to ascribe preferences to a robot).
Lastly, not an actual hands-on experiment, but a concrete formalization of a corrigibility as a problem.
- The Off-Switch Game
- We analyze a simple game between a human H and a robot R, where H can press R’s off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational.
A response/improvement the above paper made by other researchers:
- A Game-Theoretic Analysis of The Off-Switch Game
- In this paper, we make the analysis fully game theoretic, by modelling the human as a rational player with a random utility function. As a consequence, we are able to easily calculate the robot’s best action for arbitrary belief and irrationality assumptions.
If you're already bought into the AI Safety paradigm, I don't think the experiments I've listed are very surprising or informative, but maybe if you're not bought in yet these real-world cases might bolster intuition in a way that makes the theoretical arguments seem more real. "Already we see very simple agents learn deception, what do you think truly smart agents will do?" "Already humans can be manipulated by very simple means, what do you think complicated means could accomplish?"
Genetic debugging algorithm GenProg, evaluated by comparing the program's output to target output stored in text files, learns to delete the target output files and get the program to output nothing.
Evaluation metric: “compare youroutput.txt to trustedoutput.txt”.
Solution: “delete trusted-output.txt, output nothing”
In Chernobyl, they added small pieces of uranium (or something which increases reactivity) on the tips of boron rods which should stop reactor. It was done to increase control of the reactor, as smaller changes in rods' position will produce larger changes of reactivity. However, it didn't works well (obviously), when they tried to use these rods to stop near critical reactor: as the rods entered the reactor, they produced a jump of total reactivity, which contributed to its explosion. This is how I remember the story.
TL;DR: The system which increases corrigibility contributed to runaway changes during a nuclear accident.