The idea of corrigibility is roughly that an AI should be aware that it may have faults, and therefore allow and facilitate human operators to correct these faults. I'm especially interested in scenarios where the AI system controls a particular input channel that is supposed to be used to control it, such as a shutdown button, a switch used to alter its mode of operation, or another device used to control its motivation. I'm especially interested in examples that would be compelling within the machine learning discipline, so I'd favour experimental accounts over hypotheticals.

New Answer
Ask Related Question
New Comment

4 Answers sorted by

I have posted some time ago about the Boeing 737 Max 8 MCAS system as an example of incorrigibility.

I think this is actually the best example given so far here of incorrigibility in concrete systems.

One more example from aviation safety: a pilot put his son on a steering wheel, knowing that it is turned off and the plane is controlled by autopilot. However, after the wheel was turned 15 degrees, the autopilot turned off, as it has a new feature of "corrigibility" and the plane crashed.

"The pilots, who had previously flown Russian-designed planes that had audible warning signals, apparently failed to notice it." Unlike Russian planes, apparently designed for dummies, the Airbus designers assumed that the pilots would be adequately trained. And sober. Which is not an unreasonable assumption.
Safety systems must be foolproof. (I am now in the airport and is going to board a Russian plane which will fly almost the same rout as the one which had a catastrophic fire a few day ago.)


Stopping to roll up several of my other responses (in the comments here) here into a single thing.

An hour of so of Googling wasn't leading me to any clear examples of "AI attempts to prevent its modification or shutdown, possibly via deceit and manipulation", but I did find a few elements of the corrigibility picture. Specifically, Arbital's definition (warning, it takes a few minutes for the page to load) says that, among other things, corrigible agents don't attempt to interfere with their being modified by their operators and don't manipulate or deceive its operators. Once deceit and manipulation are under discussion, I think it's not irrelevant to bring up actual cases where AI agents have learnt to deceive in any way, if it's just just other other agents (for now).

So a few examples of agents displaying what I might consider "proto-incorrigibility":

I think this is just interesting as an example of "we didn't train it to deceive, but it figured out that tactic works."

Also interesting because a information signal/deceptive tactic being selected for organically.

Note: both the above examples are subject of multiple popular science articles with all kinds of click-seeking titles about robots lying and deceiving. I'm not sure what kind of selection effects these papers have undergone, though the results do remain of interest after quick inspection.


Not actually an agent actually being incorrigible, but a cute study with the unsurprising result that yes, humans can probably be manipulated into not modifying agents when they otherwise would:

Interesting as some light empirical evidence that things agents do can manipulate humans (of course the actual operators here probably wouldn't have been manipulated so easily as naive subjects more likely to ascribe preferences to a robot).


Lastly, not an actual hands-on experiment, but a concrete formalization of a corrigibility as a problem.

  • The Off-Switch Game
    • We analyze a simple game between a human H and a robot R, where H can press R’s off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational.

A response/improvement the above paper made by other researchers:

  • A Game-Theoretic Analysis of The Off-Switch Game
    • In this paper, we make the analysis fully game theoretic, by modelling the human as a rational player with a random utility function. As a consequence, we are able to easily calculate the robot’s best action for arbitrary belief and irrationality assumptions.

If you're already bought into the AI Safety paradigm, I don't think the experiments I've listed are very surprising or informative, but maybe if you're not bought in yet these real-world cases might bolster intuition in a way that makes the theoretical arguments seem more real. "Already we see very simple agents learn deception, what do you think truly smart agents will do?" "Already humans can be manipulated by very simple means, what do you think complicated means could accomplish?"

Not sure exactly what you're looking for, but maybe some of the examples in Specification gaming examples in AI - master list make sense. For example:

Genetic debugging algorithm GenProg, evaluated by comparing the program's output to target output stored in text files, learns to delete the target output files and get the program to output nothing.
Evaluation metric: “compare youroutput.txt to trustedoutput.txt”.
Solution: “delete trusted-output.txt, output nothing”

I read the through the examples in the spreadsheet. None of them quite seemed like corrigibility to to me with the exception of GenProg, as mentioned, maybe also the Tetris agent which pauses the game before losing (but that's not quite it).

In Chernobyl, they added small pieces of uranium (or something which increases reactivity) on the tips of boron rods which should stop reactor. It was done to increase control of the reactor, as smaller changes in rods' position will produce larger changes of reactivity. However, it didn't works well (obviously), when they tried to use these rods to stop near critical reactor: as the rods entered the reactor, they produced a jump of total reactivity, which contributed to its explosion. This is how I remember the story.

TL;DR: The system which increases corrigibility contributed to runaway changes during a nuclear accident.

2 Related Questions

2Answer by Ruby3y
The Aribital entry is a very comprehensive and clear introduction. []
Corrigibility paper (original?) []
1Answer by SoerenMind3y
For example, an RL agent that learns a policy that looks good to humans but isn't. Adversarial examples that only fool a neural nets wouldn't count.
Could you clarify this a bit? I assume you are thinking about subsets of specification gaming that would not be obvious if they were happening? If so, then I guess all the adversarial examples in image classification comes to mind, which fits specification gaming pretty well and required quite a large literature to understand.
Thanks, updated.
8 comments, sorted by Click to highlight new comments since: Today at 7:13 PM

Facebook researchers claim to have trained an agent to negotiate and that along the way it learnt deception.

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Mike Lewis, Denis Yarats, Yann N. Dauphin, Devi Parikh, Dhruv Batra(Submitted on 16 Jun 2017)

Abstract Much of human dialogue occurs in semi-cooperative settings, where agents with different goals attempt to agree on common decisions. Negotiations require complex communication and reasoning skills, but success is easy to measure, making this an interesting task for AI. We gather a large dataset of human-human negotiations on a multi-issue bargaining task, where agents who cannot observe each other's reward functions must reach an agreement (or a deal) via natural language dialogue. For the first time, we show it is possible to train end-to-end models for negotiation, which must learn both linguistic and reasoning skills with no annotated dialogue states. We also introduce dialogue rollouts, in which the model plans ahead by simulating possible complete continuations of the conversation, and find that this technique dramatically improves performance. Our code and dataset are publicly available (this https URL).

From the paper:

Models learn to be deceptive. Deception can be an effective negotiation tactic. We found numerous cases of our models initially feigning interest in a valueless item, only to later ‘compromise’ by conceding it. Figure 7 shows an example

Another adjacent paper - robots deceiving each other (though not their operators):

The Evolution of Information Suppression in Communicating Robots with Conflicting Interests

Mitri, Sara ; Floreano, Dario ; Keller, Laurent

Reliable information is a crucial factor influencing decision-making, and thus fitness in all animals. A common source of information comes from inadvertent cues produced by the behavior of conspecifics. Here we use a system of experimental evolution with robots foraging in an arena containing a food source to study how communication strategies can evolve to regulate information provided by such cues. Robots could produce information by emitting blue light, which other robots could perceive with their cameras. Over the first few generations, robots quickly evolved to successfully locate the food, while emitting light randomly. This resulted in a high intensity of light near food, which provided social information allowing other robots to more rapidly find the food. Because robots were competing for food, they were quickly selected to conceal this information. However, they never completely ceased to produce information. Detailed analyses revealed that this somewhat surprising result was due to the strength of selection in suppressing information declining concomitantly with the reduction in information content. Accordingly, a stable equilibrium with low information and considerable variation in communicative behaviors was attained by mutation-selection. Because a similar co-evolutionary process should be common in natural systems, this may explain why communicative strategies are so variable in many animal species.

Not sure if this is quite the right thing, but it is in the spirit of "humans can be manipulated by artificial agents":

Do a robot’s social skills and its objection discourage interactants from switching the robot off?

Aike C. Horstmann, Nikolai Bock, Eva Linhuber, Jessica M. Szczuka, Carolin Straßmann,Nicole C. Krämer


Building on the notion that people respond to media as if they were real, switching off a robot which exhibits lifelike behavior implies an interesting situation. In an experimental lab study with a 2x2 between-subjects-design (N = 85), people were given the choice to switch off a robot with which they had just interacted. The style of the interaction was either social (mimicking human behavior) or functional (displaying machinelike behavior). Additionally, the robot either voiced an objection against being switched off or it remained silent. Results show that participants rather let the robot stay switched on when the robot objected. After the functional interaction, people evaluated the robot as less likeable, which in turn led to a reduced stress experience after the switching off situation. Furthermore, individuals hesitated longest when they had experienced a functional interaction in combination with an objecting robot. This unexpected result might be due to the fact that the impression people had formed based on the task-focused behavior of the robot conflicted with the emotional nature of the objection.

I'm afraid is this theoretical rather than experiment, but it is a paper with a formalized problem.

The Off-Switch Game

Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell

Abstract It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead. Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R’s off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H’s actions as important observations about that utility. (R also has no incentive to switch itself off in this setting.) We conclude that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and we argue that this setting is a useful generalization of the classical AI paradigm of rational agents.

Corresponding slide for this paper hosted on MIRI's site:

Response/Improvement to above paper:

A Game-Theoretic Analysis of The Off-Switch Game

Tobias Wangberg, Mikael Bo¨ors, Elliot Catt , Tom Everitt , Marcus Hutter

Abstract. The off-switch game is a game theoretic model of a highly intelligent robot interacting with a human. In the original paper by Hadfield-Menell et al. (2016b), the analysis is not fully game-theoretic as the human is modelled as an irrational player, and the robot’s best action is only calculated under unrealistic normality and soft-max assumptions. In this paper, we make the analysis fully game theoretic, by modelling the human as a rational player with a random utility function. As a consequence, we are able to easily calculate the robot’s best action for arbitrary belief and irrationality assumptions.

The Lausanne paper on robots evolving to lie and the Facebook on negotiation have multiple pop sci articles on them and keep coming up again and again in my searches. Evidently the write shape to go viral.

controls a particular output channel that is supposed to be used to control it

I assume you meant to say "input channel" here instead?

New to LessWrong?