By default, avoid ambiguous distant situations

byStuart_Armstrong2mo21st May 201915 comments

31

Ω 10


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

"The Ood.[...] They're born for it. Basic slave race." Mr. Jefferson, The Impossible Planet.

"And although he may be poor, he shall never be a slave,", from the Battle Cry of Freedom.

I've talked about morally distant situations: situations far removed from our own. This distance is in the moral sense, not the actual sense: ancient China is further from us than Star Wars is (even for modern Chinese). There are possible worlds out there far more alien than anything in most of our fiction.

In these distant situations, our usual web of connotations falls apart, and closely related terms start to mean different things. As shown in this post, in extreme cases, our preferences can become nonsensical.

Note that not all our preferences need become nonsensical, just some of them. Consider the case of a slave race - a willing slave race. In that situation, our connotations about slavery may come apart: willing and slave don't fit well together. However, we would still be clear ourselves that we didn't want to be slaves. So though some preferences lose power, some do not.

But let's return to that willing slave race situation. Eliezer's Harry says:

Whoever had created house elves in the first place had been unspeakably evil, obviously; but that didn't mean Hermione was doing the right thing now by denying sentient beings the drudgery they had been shaped to enjoy.)

Or consider Douglas Adam cow-variant bread for willingly being eaten:

"I just don't want to eat an animal that's standing here inviting me to," said Arthur, "it's heartless."

"Better than eating an animal that doesn't want to be eaten," said Zaphod. [...]

"May I urge you to consider my liver?" asked the animal, "it must be very rich and tender by now, I've been force-feeding myself for months."

At X, the decision is clear. Should we go to X in the first place?

In those situations, some immediate actions are pretty clear. There is no point in freeing a willing slave race; there is no advantage to eating an animal that doesn't want to be eaten, rather than one that does.

The longer-term actions are more ambiguous, especially as they conflict with other of our values: for example, should we forcibly change the preferences of the slave race/edible race so that they don't have those odd preferences any more? Does it make a difference if there are more manipulative paths that achieve the same results, without directly forcing them? We may not want to allow manipulative paths to count as acceptable in general.

But, laying that aside, it seems there is a prima facie case that we shouldn't enter those kinds of situations. That non-conscious robots are better than conscious willing slaves. That vat grown meat is better than conscious willing livestock.

So there seems to be a good rule of thumb: don't go there. Add an axiom A:

  • A: When the web of connotations of a strong preference falls apart, those are situations which should get an automatic penalty. Initially at least, those should be treated as bad situations worth avoiding.

Default weights in distant situations

When a web of connotation unravels, the preferences normally end up weaker than initially, because some of the connotations of those preferences are lost or even opposite. So, normally, preferences in these distant situations are quite weak.

But here I'm suggesting adding an explicit meta-preference to these situations. And one that the human subject might not have themselves. This doesn't fit in the formalism of this post. In the language of the forthcoming research agenda, this is a "Global meta-preferences about the outcome of the synthesis process".

Isn't this an overriding of the person's preference? It is, to some extent. But note the "Initially at least" clause in A. If we don't have other preferences about the distant situation, it should be avoided. But this penalty can be overcome by other considerations.

For example, the previous standard web of connotations for sexuality has fallen apart, while the gender one is unravelling; it's perfectly possible to have meta-preferences that would have told us to respect our reflection on issues like that, and our reflection might be fine with these new situations. Similarly, some (but not all) future changes to the human condition are things that would worry me initially but that I'd be ok with upon reflection; I myself have strong meta-preferences that these should be acceptable.

But for situations where our other preferences and meta-preferences don't weigh in, A would downgrade these distant worlds as a default (dis)preference. This adds an explicit level of status quo bias to our preferences, which I feel is justified: better to be prudent rather than reckless where our preferences and values are concerned. The time for (potential) recklessness is in the implementation of these values, not their definition.

31

Ω 10