Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Thanks to Rebecca Gorman for co-developing this idea

On the 26th of September 1983, Stanislav Petrov observed the early warning satellites reporting the launch of five nuclear missiles towards the Soviet Union. He decided to disobey orders and not pass on the message to higher command, which could easily have resulted in a nuclear war (since the soviet nuclear position was "launch on warning").

Now, did Petrov have free will when he decided to save the world?

Maintaining free will when knowledge increases

I don't intend to go into the subtle philosophical debate on the nature of free will. See this post for a good reductionist account. Instead, consider the following scenarios:

  1. The standard Petrov incident.
  2. The standard Petrov incident, except that it is still ongoing and Petrov hasn't reached a decision yet.
  3. The standard Petrov incident, after it was over, except that we don't yet know what his final decision was.
  4. The standard Petrov incident, except that we know that, if Petrov had had eggs that morning (instead of porridge[1]), he would have made a different decision.
  5. The same as scenario 4., except that some entity deliberately gave Petrov porridge that morning, aiming to determine his decision.
  6. The standard Petrov incident, except that a guy with a gun held Petrov hostage and forced him not to pass on the report.

There is an interesting contrast between scenarios 1, 2, and 3. Clearly, 1 and 3 only differ in our knowledge of the incident. It does not seem that Petrov's free will should depend on the degree of knowledge of some other person.

Scenarios 1 and 2 only differ in time: in one case the decision is made, in the second it is yet to be made. If we say that Petrov has free will, whatever that is, in scenario 2, then it seems that in scenario 1, we have to say that he "had" free will. So whatever our feeling on free will, it seems that knowing the outcome doesn't change whether there was free will or not.

That intuition is challenged by scenario 4. It's one thing to know that Petrov's decision was deterministic (or deterministic-stochastic if there's a true random element to it). It's another to know the specific causes of the decision.

And it's yet another thing if the specific causes have been influence to manipulate the outcome, as in scenario 5. Again, all we have done here is add knowledge - we know the causes of Petrov's decision, and we know that his breakfast was chosen with that outcome in mind. But someone has to decide what Petrov had that morning[2]; why does it matter that it was done for a specific purpose?

Maybe this whole free will thing isn't important, after all? But it's clear in scenario 6 that something is wrong. Even though Petrov has just as much free will, in the philosophical sense - before, he could choose to pass on the warning or not, now he can equally choose to not pass on the message or die. This suggests that free will is something that is determined by outside features, not just internal ones. This is related to the concept of coercion and its philosophical analysis.

What free will we'd want from an AI

Scenarios 5 and 6 are problematic: call them manipulation and coercion, respectively. We might not want the AI to guarantee us free will, but we do want it to avoid manipulation and coercion.

Coercion is probably the easiest to define, and hence avoid. We feel coercion when its imposed on us, when our options narrow. Any reasonably aligned AI should avoid that. There remains the problem of when we don't realise that our options are narrowing - but that seems to be a case of manipulation, not coercion.

So, how do we avoid manipulation? Just giving Petrov eggs is not manipulation, if the AI doesn't know the consequences of doing so. Nor does it become manipulation if the AI suddenly learns those consequences - knowledge doesn't remove free will or cause manipulation. And, indeed, it would be foolish to try and constrain an AI by restricting its knowledge.

So it seems we must accept that:

  1. The AI will likely know ahead of time what decision we will reach in certain circumstances.
  2. The AI will also know how to influence that decision.
  3. In many circumstances, the AI will have to influence that decision, simply because it has to do certain actions (or inactions). A butler AI will have to give Petrov breakfast, or make him go hungry (which will have its own consequences), even if it knows the consequences of its own decision.

So "no manipulation" or "maintaining human free will" seems to require a form of indifference: we want the AI to know how its actions affect our decisions, but not take that influence into account when choosing those actions.

It will be important to define exactly what we mean by that.


  1. I have no idea what Petrov actually had for breakfast, that day or any other. ↩︎

  2. Even if Petrov himself decided what to have for breakfast, he choose among the options that were possible for him that morning. ↩︎

New to LessWrong?

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 8:49 AM

I don't get where the assertion that knowledge doesn't lead to manipulation comes from. If you give a child something that looks like a water gun but actually fires a chemical round you would be on the hook to be responcible for any deaths, despite the child puling triggers would be their free choice. It isn't even that hard to imagine that you could cognitively dominate the child in that you could reliably predict what they would be up to. There that your tool for murder is an agent with will doesn't bear that much weight.

Consider neglient manslaughter where you had a duty to do something proper, which infact lead to death but whic you could not reasonably anticipate the specific death happening. Upping your your ability to anticipate things will push things into manslaughter and murder.

In a similar way if you pull a trigger on a gun which you think is loaded but is in fact empty you can be guilty of attempted murder despite there not being a real risk of anyone dying. Thinking (if the thought isn't ridicously insane) that the eggs are connected to the launch success or not would totally make you culpable for the launch (not that anyone would catch you).

So "no manipulation" or "maintaining human free will" seems to require a form of indifference: we want the AI to know how its actions affect our decisions, but not take that influence into account when choosing those actions.

Two thoughts.

One, this seems likely to have some overlap with notions of impact and impact measures.

Two, it seems like there's no real way to eliminate manipulation in a very broad sense, because we'd expect our AI to be causally entangled with the human, so there's no action the AI could take that would not influence the human in some way. Whether or not there is manipulation seems to require making a choice about what kind of changes in the human's behavior matter, similar to problems we face in specifying values or defining concepts.

[-]TurnTrout4yΩ4110

Not Stuart, but I agree there's overlap here. Personally, I think about manipulation as when an agent's policy robustly steers the human into taking a certain kind of action, in a way that's robust to the human's counterfactual preferences. Like if I'm choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences I had for shoes to begin with, I end up buying blue shoes, then I'm probably being manipulated. A non-manipulative AI would act in a way that increases my knowledge and lets me condition my actions on my preferences.

Like if I'm choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences I had for shoes to begin with, I end up buying blue shoes, then I'm probably being manipulated.

Manipulation 101: tell people "We only have blue shoes in stock. Take it or leave it."

EDIT: This example was intentionally chosen because it could be true. How do we distinguish between 'effects of the truth' and 'manipulation'?

Speculative: It's possible that things we see as maladaptive (why 'resist the truth?' - "it is never rational to do so") may exist because of difficulties we have distinguishing the two.

Hmm, I see some problems here.

By looking for manipulation on the basis of counterfactuals, you're at the mercy of your ability to find such counterfactuals, and that ability can also be manipulated such that you can't notice either the object level counterfactuals that would make you suspect manipulation of the counterfactuals about your counterfactual reasoning that would make you suspect manipulation. This seems insufficiently robust way to detect manipulation, or even define it since the mechanism of detecting it can itself be manipulated to not notice what would have otherwise been considered manipulation.

Perhaps my point is to generally express doubt that we can cleanly detect manipulation outside the context of the human behavioral norms, and I suspect the cognitive machinery that implements norms is malleable enough that it can be manipulated to not notice what it would have previously thought was manipulation, nor is it clear this is always bad, since in some cases we might be mistaken in some sense about what is really manipulative, although this belies the point that it's not clear what it means to be mistaken about normative claims.

OK, but there's a difference between "here's a definition of manipulation that's so waterproof you couldn't break it if you optimized against it with arbitrarily large optimization power" and "here's my current best way of thinking about manipulation." I was presenting the latter, because it helps me be less confused than if I just stuck to my previous gut-level, intuitive understanding of manipulation.

Edit: Put otherwise, I was replying more to your point (1) than your point (2) in the original comment. Sorry for the ambiguity!

I agree. The important part of cases 5 & 6, where some other agent "manipulates" Petrov, is that suddenly, to us human readers, it seems like the protagonist of the story (and we do model it as a story) is the cook/kidnapper, not Petrov.

I'm fine with the AI choosing actions using a model of the world that includes me. I'm not fine with it supplanting me from my agent-shaped place in the story I tell about my life.

I was slightly confused by the beginning of the post, but by the end I was on board with the questions asked and the problems posed.

On impacts measures, there's already some discussions in this comment thread, but I'll put some more thoughts about that here. My first reaction to reading the last section was to think of attainable utility: non-manipulation as preservation of attainable utility. Sitting on this idea, I'm not sure this works as a non-manipulation condition, since it lets the AI manipulate us into having what we want. There should be no risk of it changing our utility, since that's a big change in attainable utility; but still, we might not want to be manipulated even for our own good (like some people's reactions to nudges).

Maybe there can be an alternative version of attainable utility, something like "attainable choice", which ensures that other agents (us included) are still able to make choices. Or to put it in terms of free will, that these agents choices are still primarily determined by internal causes, so by them, instead of primarily determined by external causes like the AI.

We can even imagining integrating attainable utility and attainable choice together (by weighting them for example), so that manipulation is avoided in a lot of cases, but the AI still manipulates Petrov to not report if not reporting saves the world (because it maintains attainable utility). So it solves the issue mentioned in this comment thread.

(I have a big google doc analyzing corrigibility & manipulation from the attainable utility landscape frame; I’ll link it here when the post goes up on LW)

When do you plan on posting this? I'm interested in reading it

Ideally within the next month!

So "no manipulation" or "maintaining human free will" seems to require a form of indifference: we want the AI to know how its actions affect our decisions, but not take that influence into account when choosing those actions.

I think the butler can take that influence into account in making its choices, but still reduce its manipulativity by explaining to Petrov what it knows about how breakfast will affect Petrov's later choices.  When they're on equal epistemic footing, Petrov can also take that information into account, and perhaps choose to deliberately resist the influence of breakfast, if he doesn't endorse it.  Of course, there are limits to how much explanation is possible across a substantial intelligence gap between AI and people, so this doesn't dissolve manipulation entirely.

Scenario 5 sounds like something an aligned AI should do.  Actually, taking Petrov hostage would also be the right thing to do, if there was no better way to save people's lives. It seems fine to me to take away someone's option to start a nuclear war?

I think manipulation is bad when it's used to harm you, but it's good if it's used to help you make better decisions. Like that time when banning lead reduced crime by 50%. Isn't this the kind of thing an AI should do? We hire all kinds of people to manipulate us into becoming better: psychotherapists, fitness instructors, teachers. Why would it be wrong for an AI to fill these roles?

Some people (me included) value a certain level of non-manipulation. I'm trying to cash out that instinct. And it's also needed for some ideas like corrigibility. Manipulation also combines poorly with value learning, see eg our paper here https://arxiv.org/abs/2004.13654

I do agree that saving the world is a clearly positive case of that ^_^

Scenario 7: The standard Petrov incident, except Petrov fancies himself a nihilist and would rather as many people as possible died, but a clairvoyant who respects Petrov's agency suspects Petrov is wrong about his own values and sits him down for a respectful, open-ended conversation where some forms of manipulation (e.g. appeal to how they feel about hypothetical scenarios) are fair and others (appeal to shame from insults) are not fair, not only to help Petrov live more in accordance with his deeper values, but also to ensure Petrov will not pass on the report. The clairvoyant follows the rules of the conversation by only performing the manipulations agreed upon as fair, and thereby the clairvoyant succeeds in persuading Petrov.

Scenario 8: The same as 7, except Petrov doesn't listen to anyone's advice, and conditional only on Petrov being so unreasonable, the clairvoyant plays tit-for-tat in Petrov's decision to live by the rules of the jungle, and with similar unreasonability substitutes Petrov's breakfast with porridge, changing his decision

Scenario 9: same as 8 except the clairvoyant fully respects Petrov's agency even when he exercises it unreasonably, and Petrov issues the message to higher command, causing nuclear war.