Stopping to roll up several of my other responses (in the comments here) here into a single thing.
An hour of so of Googling wasn't leading me to any clear examples of "AI attempts to prevent its modification or shutdown, possibly via deceit and manipulation", but I did find a few elements of the corrigibility picture. Specifically, Arbital's definition (warning, it takes a few minutes for the page to load) says that, among other things, corrigible agents don't attempt to interfere with their being modified by their operators and don't manipulate or deceive its operators. Once deceit and manipulation are under discussion, I think it's not irrelevant to bring up actual cases where AI agents have learnt to deceive in any way, if it's just just other other agents (for now).
So a few examples of agents displaying what I might consider "proto-incorrigibility":
I think this is just interesting as an example of "we didn't train it to deceive, but it figured out that tactic works."
Also interesting because a information signal/deceptive tactic being selected for organically.
Note: both the above examples are subject of multiple popular science articles with all kinds of click-seeking titles about robots lying and deceiving. I'm not sure what kind of selection effects these papers have undergone, though the results do remain of interest after quick inspection.
Not actually an agent actually being incorrigible, but a cute study with the unsurprising result that yes, humans can probably be manipulated into not modifying agents when they otherwise would:
Interesting as some light empirical evidence that things agents do can manipulate humans (of course the actual operators here probably wouldn't have been manipulated so easily as naive subjects more likely to ascribe preferences to a robot).
Lastly, not an actual hands-on experiment, but a concrete formalization of a corrigibility as a problem.
- The Off-Switch Game
- We analyze a simple game between a human H and a robot R, where H can press R’s off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational.
A response/improvement the above paper made by other researchers:
- A Game-Theoretic Analysis of The Off-Switch Game
- In this paper, we make the analysis fully game theoretic, by modelling the human as a rational player with a random utility function. As a consequence, we are able to easily calculate the robot’s best action for arbitrary belief and irrationality assumptions.
If you're already bought into the AI Safety paradigm, I don't think the experiments I've listed are very surprising or informative, but maybe if you're not bought in yet these real-world cases might bolster intuition in a way that makes the theoretical arguments seem more real. "Already we see very simple agents learn deception, what do you think truly smart agents will do?" "Already humans can be manipulated by very simple means, what do you think complicated means could accomplish?"