A theory of human values

by Stuart_Armstrong 8mo13th Mar 20196 min read13 comments

29

Ω 8


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

At the end of my post on needing a theory of human values, I stated that the three components of such a theory were:

  1. A way of defining the basic preferences (and basic meta-preferences) of a given human, even if these are under-defined or situational.
  2. A method for synthesising such basic preferences into a single utility function or similar object.
  3. A guarantee we won't end up in a terrible place, due to noise or different choices in the two definitions above.

To summarise this post, I sketch out methods for 1. and 2., and look at what 3. might look like, and what we can expect from such a guarantee, and some of the issues with it.

Basic human preferences

For the first point, I'm defining a basic preference as existing within the mental models of a human.

Any preference judgement within that model - that some outcome was better than another, that some action was a mistake, that some behaviour was foolish, that someone is to be feared - is defined to be a basic preference.

Basic meta-preferences work in the same way, with meta-preferences just defined to be preferences over preferences (or over methods of synthesising preferences). Also include odd meta-preferences here - such as preferences over beliefs. I'll try to transform these odd preferences in "identity preferences": preferences over the kind of person you want to be.

"Reasonable" situations

To define that, we need to define the class of "reasonable" situations in which to have these mental models. These could be real situations (Mrs X thought that she'd like some sushi as she went past the restaurant) or counterfactual (if Mr Y had gone past that restaurant, he would have wanted sushi). The "one-step hypotheticals post" is about defining these reasonable situations.

Anything that occurs outside of a reasonable situation is discarded as not indicative of genuine basic human preference; this is due to the fact that humans can be persuaded to endorse/unendorse almost anything in the right situation (eg by drugs or brain surgery, if all else fails).

We can have preferences and meta-preferences over non-reasonable situations (what to do in a world where plants were conscious?), as long as these preferences and meta-preferences were expressed in reasonable situations. We can have a CEV style meta-preference ("I wish my preferences were more like what a CEV would generate"), but, apart from that, the preferences a CEV would generate are not directly relevant: the situations where "we knew more, thought faster, were more the people we wished we were, had grown up farther together" are highly non-typical.

We would not want the AI itself manipulating the definition of "reasonable" situations. It's for this that I've looked into ways of quantifying and removing AI rigging and influencing of the learning process.

Synthesising human preferences

The simple preferences and meta-preferences constructed above will be often wildly contradictory (eg we want to be generous and rich), inconsistent across time, and generally underdefined. They can also be weakly or strongly held.

The important thing now is to synthesise all of these into some adequate overall reward or utility function. Not because utility functions are intrinsically good, but because they are stable: if you're not an expected utility maximiser, events may likely push you into becoming one. And it's much better to start off with an adequate utility function, than to hope that random-drift-until-our-goals-are-stable will get us to an adequate outcome.

Synthesising the preference utility function

The idea is to start with three things:

  1. A way of resolving contradictions between preferences (and between meta-preferences, and so on).
  2. A way of applying meta-preferences to preferences (endorsing and anti-endorsing other preferences).
  3. A way of allowing (relevant) meta-preferences to change the methods used in the two points above.

This post showed one method of doing that, with contradictions resolved by weighting the reward/utility function for each preference and then adding them together linearly. The weights were proportional to some measure of the intensity of each preference.

In a more recent post, I realised that linear addition may not be the natural thing to do for some types of preferences (which I dubbed "identity" preferences). The smooth minimum gives another way of combining utilities, though it needs a natural zero as well as a weight. So the human's model of the status quo is relevant here. For preferences combined in a smoothmin, we can just reset the natural zero (raising it to make the preference less important, lowering it to make it more) rather than changing the weight.

I'm distinguishing between identity and world preferences, but the real distinction is between preferences that humans prefer to combine linearly, and those they prefer to combine in a smoothmin. So it could work that along with preference and weight (and natural zero), one thing we could ask of basic preferences is whether they should go in the linear of the smoothmin group.

Also, though I'm very willing to let a linear preference get sent to zero if the human's meta-preferences unendorse them, I'm less sure about those in the other group; it's possible that unendorsing of a smoothmin preference should raise the "natural zero" rather than sending the preference to zero. After all, we've identified these preferences as key parts of our identity, even though we unendorse them.

Meta-changes to the synthesis method

Then finally, on point 3 above, the relevant human meta-preferences can change the synthesis process. Heavily weighted meta-preferences of this type will result in completely different processes than described above; lightly weighted meta-preferences will make only small changes. The original post looked into that in more detail.

Notice that I am making some deliberate and somewhat arbitrary choices: using linear or smoothmin to combine meta-preferences (including those that might want to change the methods of combinations). How much weight a meta-preference must have, before it seriously changes the synthesis method, is somewhat arbitrary.

I'm also starting with two types of preference combinations, linear and smoothmin, rather than many more or just one. The idea is that these two way of combining preferences seem the most salient to me, and our own meta-values can change these ways if we feel strongly about them. It's as if I'm starting the design of a formula one car, before an AI trains itself to complete the design. I know it'll change a lot of things, but if I start with "four wheels, a cockpit and a motor", I'm hoping to get them started on the right path, even if they eventually overrule me.

Or, if you prefer, I think starting with this design is more likely to nudge a bad outcome into a good one, than to do the opposite.

Non-terrible outcomes

Now for the most tricky part of this: given the above, can we expect non-terrible outcomes?

This is a difficult question to answer, because "terrible outcomes" remains undefined (if we had a full definition, it could serve a utility function itself), and, in a sense, there is no principled trade-off between two preferences: the only general optimality measure is Pareto, and that can be reached by any linear combination of utilities.

Scope insensitivity to the rescue?

There are two obvious senses in which an outcome could be terrible:

  1. We could lose something of great value, never to have it again.
  2. We could fall drastically short of maximising a utility function to the upmost.

From the perspective of a utility maximiser, both these outcomes could be equally terrible - it's just a question of computing the expected utility difference between the two scenarios.

However, for actual humans, the first scenario seems to loom much larger. This can be seen as a form of scope insensitivity: we might say that we believe in total utilitarianism, but we don't feel that a trillion people is really a trillion times better than a trillion people, so the larger the numbers grow, the more we are, in practice, willing to trade off total utilitarianism for other values.

Now, we might deplore that state of affairs (that deploring is a valid meta-preference), but that does seem to be how human work. And though there are arguments against scope insensitivity for actually existent beings, it is perfectly consistent to reject them when considering whether we have a duty to create new beings.

What this means is that people's preferences seem much closer to smooth minimums than to linear sums. Some are explicitly setup like that from the beginning (those that go in the smoothmin bucket). Others may be like that in practice, either because meta-preferences want them to be, or because of the vast size of the future: see next section.

The size of the future

The future is vast, with the energy of billions of galaxies, efficiently used, at our disposal. Potentially far, far larger than that, if we're clever about our computations.

That means that it's far easier to reach "agreement" between two utility functions with diminishing marginal returns (as most of them will be, in practice and in theory). Even without diminishing marginal returns, and without using smoothmin, it's unlikely that one utility function will remain highest marginal returns all the way up to all resources being used up. At some point, benefiting a tiny little preference slightly will likely be easier.

The exception of this is if preferences are explicitly opposed to each other; eg masochism versus pain-reduction. But even there, they are unlikely to be completely and exactly negations of one another. The masochist may find some activities that don't fit perfectly under "increased pain" as traditionally understood, so some compromise between the two preferences becomes possible.

The underdefined nature of some preference may be an boon here; if is forbidden, but only in situations in , then going outside of may allow the -loving preferences their space to grow. So, for example, obeying promises might become a general value, but we might allow games, masked balls, or similar situation where lying is allowed, because the advantages of honesty - reputation, ease of coordination - are deliberately absent.

Growth, learning, and following your own preferences

I've argued that our values and preference will soon become stable as we start to self modify.

This is going to be hard for those who put an explicit premium on continual moral growth. Now, it's still possible to have continued moral change withing a narrow band, but

Finally, there's the issue of what happens when the AI tells you "here is , the synthesis of your preferences", and you go "well, I have all these problems with it". Since humans are often contrarian by nature, it may be impossible for an AI to construct a that we would ever explicitly endorse. This is a sort of "self-reference" problem in synthesising preferences.

Tolerance levels

The whole design - with an initial framework, liberal use of smoothmin, a default for standard combinations of preferences, and a vast amount of resources available - is designed to reach an adequate, rather than an optimal solution. Optimal solutions are very subject to Goodhart's law if we don't include everything we care about; if we do include everything we care about, the process may come to resemble the one I've defined here, above.

Conversely, if the human fear that such a synthesis will become badly behaved in certain extreme situations - then that fear will be included in the synthesis. And, if the fear is strong enough, will serve to direct the outcomes away from those extreme situations.

So the whole design is somewhat tolerant to changes in the initial conditions: different starting points may end up in different end points, but all of them will hopefully be acceptable.

Did I think of everything?

With all such methods, there's the risk of not including everything, so ending up in a terrible point by omission. That risk is certainly there, but it seems that we couldn't end up in a terrible hellworlds, or at least no in one that could be meaningfully described/summarised to the human (because avoiding hellworlds is high on human preference and meta-preferences, and there is little explicit force pushing the other way).

And I've argued that it's unlikely that indescribable hellworlds are even possible.

However, there are still a lot of holes to fill, and I have to ensure that this doesn't just end up as a series of patches until I can't think of any further patches. That's my greatest fear, and I'm not yet sure how to address it.

29

Ω 8