Very different, very adequate outcomes

Stuart_Armstrong

Very different, very adequate outcomes — LessWrong

14 Very different, very adequate outcomes

by Stuart_Armstrong

2nd Aug 2019

AI Alignment Forum

2 min read

14 Ω 7

Let $U_{p}$ be the utility function that - somehow - expresses your preferences^[1]. Let $U_{h}$ be the utility function expresses your hedonistic pleasure.

Now imagine an AI is programmed to maximise $U (q) = q U_{p} + (1 - q) U_{h}$ . If we vary $q$ in the range of $5 %$ to $95 %$ , then we will get very different outcomes. At $5 %$ , we will generally be hedonically satisfied, and our preferences will be followed if they don't cause us to be unhappy. At $95 %$ , we will accomplish any preference that doesn't cause us huge amounts of misery.

It's clear that, extrapolated over the whole future of the universe, these could lead to very different outcomes^[2]. But - and this is the crucial point - none of these outcomes are really that bad. None of them are the disasters that could happen if we picked a random utility $U$ . So, for all their differences, they reside in the same nebulous category of "yeah, that's an ok outcome." Of course, we would have preferences as to where $q$ lies exactly, but few of us would risk the survival of the universe to yank $q$ around within that range.

What happens when we push $q$ towards the edges? Pushing $q$ towards $0$ seems a clear disaster: we're happy, but none of our preferences are respected; we basically don't matter as agents interacting with the universe any more. Pushing $q$ towards $1$ might be a disaster: we could end up always miserable, even as our preferences are fully followed. The only thing protecting us from that fate is the fact that our preferences include hedonistic pleasure; but this might not be the case in all circumstances. So moving $q$ to the edges is risky in the way that moving around in the middle is not.

In my research agenda, I talk about adequate outcomes, given a choice of parameters, or acceptable approximations. I mean these terms in the sense of the example above: the outcomes may vary tremendously from one another, given the parameters or the approximation. Nevertheless, all the outcomes avoid disasters and are clearly better than maximising a random utility function.

This being a somewhat naive form of preference utilitarianism, along the lines of "if the human choose it, then its ok". In particular, you can end up in equilibriums where you are miserable, but unwilling to choose not to be (see for example, some forms of depression). ↩︎
This fails to be true if preference and hedonism can be maximised independently; eg if we could take an effective happy pill and still follow all our preferences. I'll focus on the situation where there are true tradeoffs between preference and hedonism. ↩︎

Frontpage

14 Ω 7

Very different, very adequate outcomes

New Comment

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:09 PM

[-]johnswentworth7y70

One potential problem: if the two utilities have different asymptotic behavior, then one of them can dominate decision-making. For instance, suppose we're using 0-1 normalization, but one of the two utilities has a big spike or tail somewhere. Then it's going to have near-zero slope everywhere else.

More concrete example: on the hedonism axis, humans have more capacity for severe pain than extreme pleasure. So that end of the axis has a big downward spike, and the hedonism-utility would be near-flat at the not-severe-pain end (at least for any of the normalizations you suggest, other than max-mean, which has the same problem with the other end of the axis). But if the preferences-utility lacks a big spike like that, then we're liable to end up with constant low-grade hedonic unhappiness.

That's still a lot better than plenty of other possible outcomes - preference-utility still looks good, and we're not in constant severe pain. But it still seems not very good.

[-]Veedrac7y30

Pushing q towards 1 might be a disaster

If I consider satisfaction of my preferences to be a disaster, in what sense can I realistically call them my preferences? It feels like you're more caught up on the difficulty of extrapolating these preferences outside of their standard operation, but that seems like a rather different issue.

[-]Stuart_Armstrong7y*50

I've thinking of a rather naive form of preference utilitarianism, of the sort "if the human agree to it or choose it, then it's ok". In particular, you can end up with some forms of depression where the human is miserable, but isn't willing to change.

I'll clarify that in the post.

[-]johnswentworth7y30

How do you imagine standardizing the utility functions? E.g., if we multiply $U_{h}$ by 2, then it does just as good a job representing our happiness, but gets twice as much weight.

[-]Stuart_Armstrong7y40

https://www.lesswrong.com/posts/hBJCMWELaW6MxinYW/intertheoretic-utility-comparison

Or we could come up with a normalisation method by having people rank the intensity of their preferences versus the intensity of their enjoyments. It doesn't have to be particularly good, just give non-crazy results.

[-]johnswentworth7y40

It doesn't have to be particularly good, just give non-crazy results.

The intertheoretic utility post makes a lot more sense in that light; I had mostly dismissed it as a hack job when I first saw it. But if this is the sort of thing you're trying to do, it seems more useful. Thanks for clarifying.

[-]Wei Dai7yΩ120

This seems way too handwavy. If q being close enough 0 will cause a disaster, why isn't 5% close enough to 0? How much do you expect switching from q=1 to q=5% to reduce $U_{p}$ ? Why?

If moving from q=1 to q=5% reduces $U_{p}$ by a factor of 2, for example, and it turns out that $U_{p}$ is the correct utility function, that would be equivalent to incurring a 50% x-risk. Do you think that should be considered "ok" or "adequate", or have some reason to think that $U_{p}$ wouldn't be reduced nearly this much?

[-]Stuart_Armstrong7yΩ120

I'm finding these "is the correct utility function" hard to parse. Humans have a bit of $U_{p}$ and a bit of $U_{h}$ . But we are underdefined systems; there is no specific value of $q$ that is "true". We can only assess the quality of $q$ using other aspects of human underdefined preferences.

This seems way too handwavy.

It is. Here's an attempt at a more formal definition: humans have collections of underdefined and somewhat contradictory preferences (using preferences in a more general sense than preference utilitarianism). These preferences seem to be stronger in the negative sense than in the positive: humans seem to find the loss of a preference much worse than the gain. And the negative is much more salient, and often much more clearly defined, than that positive.

Given that maximising one preference tends to put the values of others at extreme values, human overall preferences seem better captured by a weighted mix of preferences (or a smooth min of preferences) than by any single preference, or small set of preferences. So it is not a good idea to be too close to the extremes (extremes being places where some preferences have $0 %$ weight put on them).

Now there may be some sense in which these extreme preferences are "correct", according to some formal system. But this formal system must reject the actual preferences of humans today; so I don't see why these preferences should be followed at all, even if they are correct.

Ok, so the extremes are out; how about being very close to the extremes? Here is where it gets wishywashy. We don't have a full theory of human preferences. But, according to the picture I've sketched above, the important thing is that each preference gets some positive traction in our future. So, yes $1 %$ to $5 %$ might no mean much (and smooth min might be better anyway). But I believe I could say:

There are many weighted combinations of human preferences that are compatible with the picture I've sketched here. Very different outcomes, from the numerical perspective of the different preferences, but all falling within an "acceptability" range.

Still a bit too handwavy. I'll try and improve it again.

[-]Charlie Steiner7yΩ110

And of course you can go further and have different $U$ that all have similarly valid claims to be $U_{p}$ , because they're all similarly good generalizations of our behavior into a consistent function on a much larger domain.

[-]Donald Hobson7y10

As far as I am concerned, hedonism is an approximate description of some of my preferences. Hedonism is a utility function close to, but not equal to mine. I see no reason why a FAI should contain a special term for hedonism. Just maximize preferences, anything else is strictly worse, but not necessarily that bad.

I do agree that there are many futures we would consider valuable. Our utility function is not a single sharp spike.

Moderation Log