Knowledge, manipulation, and free will

[-]Slider5y50

I don't get where the assertion that knowledge doesn't lead to manipulation comes from. If you give a child something that looks like a water gun but actually fires a chemical round you would be on the hook to be responcible for any deaths, despite the child puling triggers would be their free choice. It isn't even that hard to imagine that you could cognitively dominate the child in that you could reliably predict what they would be up to. There that your tool for murder is an agent with will doesn't bear that much weight.

Consider neglient manslaughter where you had a duty to do something proper, which infact lead to death but whic you could not reasonably anticipate the specific death happening. Upping your your ability to anticipate things will push things into manslaughter and murder.

In a similar way if you pull a trigger on a gun which you think is loaded but is in fact empty you can be guilty of attempted murder despite there not being a real risk of anyone dying. Thinking (if the thought isn't ridicously insane) that the eggs are connected to the launch success or not would totally make you culpable for the launch (not that anyone would catch you).

[-]Gordon Seidoh Worley5yΩ250

So "no manipulation" or "maintaining human free will" seems to require a form of indifference: we want the AI to know how its actions affect our decisions, but not take that influence into account when choosing those actions.

Two thoughts.

One, this seems likely to have some overlap with notions of impact and impact measures.

Two, it seems like there's no real way to eliminate manipulation in a very broad sense, because we'd expect our AI to be causally entangled with the human, so there's no action the AI could take that would not influence the human in some way. Whether or not there is manipulation seems to require making a choice about what kind of changes in the human's behavior matter, similar to problems we face in specifying values or defining concepts.

[-]TurnTrout5yΩ4110

Not Stuart, but I agree there's overlap here. Personally, I think about manipulation as when an agent's policy robustly steers the human into taking a certain kind of action, in a way that's robust to the human's counterfactual preferences. Like if I'm choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences I had for shoes to begin with, I end up buying blue shoes, then I'm probably being manipulated. A non-manipulative AI would act in a way that increases my knowledge and lets me condition my actions on my preferences.

[-]Pattern5y*40

Like if I'm choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences I had for shoes to begin with, I end up buying blue shoes, then I'm probably being manipulated.

Manipulation 101: tell people "We only have blue shoes in stock. Take it or leave it."

EDIT: This example was intentionally chosen because it could be true. How do we distinguish between 'effects of the truth' and 'manipulation'?

Speculative: It's possible that things we see as maladaptive (why 'resist the truth?' - "it is never rational to do so") may exist because of difficulties we have distinguishing the two.

[-]Gordon Seidoh Worley5yΩ120

Hmm, I see some problems here.

By looking for manipulation on the basis of counterfactuals, you're at the mercy of your ability to find such counterfactuals, and that ability can also be manipulated such that you can't notice either the object level counterfactuals that would make you suspect manipulation of the counterfactuals about your counterfactual reasoning that would make you suspect manipulation. This seems insufficiently robust way to detect manipulation, or even define it since the mechanism of detecting it can itself be manipulated to not notice what would have otherwise been considered manipulation.

Perhaps my point is to generally express doubt that we can cleanly detect manipulation outside the context of the human behavioral norms, and I suspect the cognitive machinery that implements norms is malleable enough that it can be manipulated to not notice what it would have previously thought was manipulation, nor is it clear this is always bad, since in some cases we might be mistaken in some sense about what is really manipulative, although this belies the point that it's not clear what it means to be mistaken about normative claims.

[-]TurnTrout5yΩ120

OK, but there's a difference between "here's a definition of manipulation that's so waterproof you couldn't break it if you optimized against it with arbitrarily large optimization power" and "here's my current best way of thinking about manipulation." I was presenting the latter, because it helps me be less confused than if I just stuck to my previous gut-level, intuitive understanding of manipulation.

Edit: Put otherwise, I was replying more to your point (1) than your point (2) in the original comment. Sorry for the ambiguity!

[-]Charlie Steiner5yΩ360

I agree. The important part of cases 5 & 6, where some other agent "manipulates" Petrov, is that suddenly, to us human readers, it seems like the protagonist of the story (and we do model it as a story) is the cook/kidnapper, not Petrov.

I'm fine with the AI choosing actions using a model of the world that includes me. I'm not fine with it supplanting me from my agent-shaped place in the story I tell about my life.

[-]adamShimi5yΩ230

I was slightly confused by the beginning of the post, but by the end I was on board with the questions asked and the problems posed.

On impacts measures, there's already some discussions in this comment thread, but I'll put some more thoughts about that here. My first reaction to reading the last section was to think of attainable utility: non-manipulation as preservation of attainable utility. Sitting on this idea, I'm not sure this works as a non-manipulation condition, since it lets the AI manipulate us into having what we want. There should be no risk of it changing our utility, since that's a big change in attainable utility; but still, we might not want to be manipulated even for our own good (like some people's reactions to nudges).

Maybe there can be an alternative version of attainable utility, something like "attainable choice", which ensures that other agents (us included) are still able to make choices. Or to put it in terms of free will, that these agents choices are still primarily determined by internal causes, so by them, instead of primarily determined by external causes like the AI.

We can even imagining integrating attainable utility and attainable choice together (by weighting them for example), so that manipulation is avoided in a lot of cases, but the AI still manipulates Petrov to not report if not reporting saves the world (because it maintains attainable utility). So it solves the issue mentioned in this comment thread.

[-]TurnTrout5yΩ470

(I have a big google doc analyzing corrigibility & manipulation from the attainable utility landscape frame; I’ll link it here when the post goes up on LW)

[-]adamShimi5yΩ110

When do you plan on posting this? I'm interested in reading it

[-]TurnTrout5yΩ230

Ideally within the next month!

[-]purge5y30

So "no manipulation" or "maintaining human free will" seems to require a form of indifference: we want the AI to know how its actions affect our decisions, but not take that influence into account when choosing those actions.

I think the butler can take that influence into account in making its choices, but still reduce its manipulativity by explaining to Petrov what it knows about how breakfast will affect Petrov's later choices. When they're on equal epistemic footing, Petrov can also take that information into account, and perhaps choose to deliberately resist the influence of breakfast, if he doesn't endorse it. Of course, there are limits to how much explanation is possible across a substantial intelligence gap between AI and people, so this doesn't dissolve manipulation entirely.

[-]seed5y30

Scenario 5 sounds like something an aligned AI should do. Actually, taking Petrov hostage would also be the right thing to do, if there was no better way to save people's lives. It seems fine to me to take away someone's option to start a nuclear war?

I think manipulation is bad when it's used to harm you, but it's good if it's used to help you make better decisions. Like that time when banning lead reduced crime by 50%. Isn't this the kind of thing an AI should do? We hire all kinds of people to manipulate us into becoming better: psychotherapists, fitness instructors, teachers. Why would it be wrong for an AI to fill these roles?

[-]Stuart_Armstrong5y20

Some people (me included) value a certain level of non-manipulation. I'm trying to cash out that instinct. And it's also needed for some ideas like corrigibility. Manipulation also combines poorly with value learning, see eg our paper here https://arxiv.org/abs/2004.13654

I do agree that saving the world is a clearly positive case of that ^_^

[-]SynthrilMetagon5y10

Scenario 7: The standard Petrov incident, except Petrov fancies himself a nihilist and would rather as many people as possible died, but a clairvoyant who respects Petrov's agency suspects Petrov is wrong about his own values and sits him down for a respectful, open-ended conversation where some forms of manipulation (e.g. appeal to how they feel about hypothetical scenarios) are fair and others (appeal to shame from insults) are not fair, not only to help Petrov live more in accordance with his deeper values, but also to ensure Petrov will not pass on the report. The clairvoyant follows the rules of the conversation by only performing the manipulations agreed upon as fair, and thereby the clairvoyant succeeds in persuading Petrov.

Scenario 8: The same as 7, except Petrov doesn't listen to anyone's advice, and conditional only on Petrov being so unreasonable, the clairvoyant plays tit-for-tat in Petrov's decision to live by the rules of the jungle, and with similar unreasonability substitutes Petrov's breakfast with porridge, changing his decision

Scenario 9: same as 8 except the clairvoyant fully respects Petrov's agency even when he exercises it unreasonably, and Petrov issues the message to higher command, causing nuclear war.

I have no idea what Petrov actually had for breakfast, that day or any other. ↩︎
Even if Petrov himself decided what to have for breakfast, he choose among the options that were possible for him that morning. ↩︎

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

33

Knowledge, manipulation, and free will

33

Ω 10

33

Ω 10

Maintaining free will when knowledge increases

What free will we'd want from an AI