[ Question ]

What are some good examples of incorrigibility?

by RyanCarey 1 min read28th Apr 201917 comments


The idea of corrigibility is roughly that an AI should be aware that it may have faults, and therefore allow and facilitate human operators to correct these faults. I'm especially interested in scenarios where the AI system controls a particular input channel that is supposed to be used to control it, such as a shutdown button, a switch used to alter its mode of operation, or another device used to control its motivation. I'm especially interested in examples that would be compelling within the machine learning discipline, so I'd favour experimental accounts over hypotheticals.

New Answer
Ask Related Question
New Comment
Write here. Select text for formatting options.
We support LaTeX: Cmd-4 for inline, Cmd-M for block-level (Ctrl on Windows).
You can switch between rich text and markdown in your user settings.

4 Answers

I have posted some time ago about the Boeing 737 Max 8 MCAS system as an example of incorrigibility.


Stopping to roll up several of my other responses (in the comments here) here into a single thing.

An hour of so of Googling wasn't leading me to any clear examples of "AI attempts to prevent its modification or shutdown, possibly via deceit and manipulation", but I did find a few elements of the corrigibility picture. Specifically, Arbital's definition (warning, it takes a few minutes for the page to load) says that, among other things, corrigible agents don't attempt to interfere with their being modified by their operators and don't manipulate or deceive its operators. Once deceit and manipulation are under discussion, I think it's not irrelevant to bring up actual cases where AI agents have learnt to deceive in any way, if it's just just other other agents (for now).

So a few examples of agents displaying what I might consider "proto-incorrigibility":

I think this is just interesting as an example of "we didn't train it to deceive, but it figured out that tactic works."

Also interesting because a information signal/deceptive tactic being selected for organically.

Note: both the above examples are subject of multiple popular science articles with all kinds of click-seeking titles about robots lying and deceiving. I'm not sure what kind of selection effects these papers have undergone, though the results do remain of interest after quick inspection.


Not actually an agent actually being incorrigible, but a cute study with the unsurprising result that yes, humans can probably be manipulated into not modifying agents when they otherwise would:

Interesting as some light empirical evidence that things agents do can manipulate humans (of course the actual operators here probably wouldn't have been manipulated so easily as naive subjects more likely to ascribe preferences to a robot).


Lastly, not an actual hands-on experiment, but a concrete formalization of a corrigibility as a problem.

  • The Off-Switch Game
    • We analyze a simple game between a human H and a robot R, where H can press R’s off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational.

A response/improvement the above paper made by other researchers:

  • A Game-Theoretic Analysis of The Off-Switch Game
    • In this paper, we make the analysis fully game theoretic, by modelling the human as a rational player with a random utility function. As a consequence, we are able to easily calculate the robot’s best action for arbitrary belief and irrationality assumptions.

If you're already bought into the AI Safety paradigm, I don't think the experiments I've listed are very surprising or informative, but maybe if you're not bought in yet these real-world cases might bolster intuition in a way that makes the theoretical arguments seem more real. "Already we see very simple agents learn deception, what do you think truly smart agents will do?" "Already humans can be manipulated by very simple means, what do you think complicated means could accomplish?"

Not sure exactly what you're looking for, but maybe some of the examples in Specification gaming examples in AI - master list make sense. For example:

Genetic debugging algorithm GenProg, evaluated by comparing the program's output to target output stored in text files, learns to delete the target output files and get the program to output nothing.
Evaluation metric: “compare youroutput.txt to trustedoutput.txt”.
Solution: “delete trusted-output.txt, output nothing”

In Chernobyl, they added small pieces of uranium (or something which increases reactivity) on the tips of boron rods which should stop reactor. It was done to increase control of the reactor, as smaller changes in rods' position will produce larger changes of reactivity. However, it didn't works well (obviously), when they tried to use these rods to stop near critical reactor: as the rods entered the reactor, they produced a jump of total reactivity, which contributed to its explosion. This is how I remember the story.

TL;DR: The system which increases corrigibility contributed to runaway changes during a nuclear accident.