Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

At the end of my post on needing a theory of human values, I stated that the three components of such a theory were:

  1. A way of defining the basic preferences (and basic meta-preferences) of a given human, even if these are under-defined or situational.
  2. A method for synthesising such basic preferences into a single utility function or similar object.
  3. A guarantee we won't end up in a terrible place, due to noise or different choices in the two definitions above.

To summarise this post, I sketch out methods for 1. and 2., and look at what 3. might look like, and what we can expect from such a guarantee, and some of the issues with it.

Basic human preferences

For the first point, I'm defining a basic preference as existing within the mental models of a human.

Any preference judgement within that model - that some outcome was better than another, that some action was a mistake, that some behaviour was foolish, that someone is to be feared - is defined to be a basic preference.

Basic meta-preferences work in the same way, with meta-preferences just defined to be preferences over preferences (or over methods of synthesising preferences). Also include odd meta-preferences here - such as preferences over beliefs. I'll try to transform these odd preferences in "identity preferences": preferences over the kind of person you want to be.

"Reasonable" situations

To define that, we need to define the class of "reasonable" situations in which to have these mental models. These could be real situations (Mrs X thought that she'd like some sushi as she went past the restaurant) or counterfactual (if Mr Y had gone past that restaurant, he would have wanted sushi). The "one-step hypotheticals post" is about defining these reasonable situations.

Anything that occurs outside of a reasonable situation is discarded as not indicative of genuine basic human preference; this is due to the fact that humans can be persuaded to endorse/unendorse almost anything in the right situation (eg by drugs or brain surgery, if all else fails).

We can have preferences and meta-preferences over non-reasonable situations (what to do in a world where plants were conscious?), as long as these preferences and meta-preferences were expressed in reasonable situations. We can have a CEV style meta-preference ("I wish my preferences were more like what a CEV would generate"), but, apart from that, the preferences a CEV would generate are not directly relevant: the situations where "we knew more, thought faster, were more the people we wished we were, had grown up farther together" are highly non-typical.

We would not want the AI itself manipulating the definition of "reasonable" situations. It's for this that I've looked into ways of quantifying and removing AI rigging and influencing of the learning process.

Synthesising human preferences

The simple preferences and meta-preferences constructed above will be often wildly contradictory (eg we want to be generous and rich), inconsistent across time, and generally underdefined. They can also be weakly or strongly held.

The important thing now is to synthesise all of these into some adequate overall reward or utility function. Not because utility functions are intrinsically good, but because they are stable: if you're not an expected utility maximiser, events may likely push you into becoming one. And it's much better to start off with an adequate utility function, than to hope that random-drift-until-our-goals-are-stable will get us to an adequate outcome.

Synthesising the preference utility function

The idea is to start with three things:

  1. A way of resolving contradictions between preferences (and between meta-preferences, and so on).
  2. A way of applying meta-preferences to preferences (endorsing and anti-endorsing other preferences).
  3. A way of allowing (relevant) meta-preferences to change the methods used in the two points above.

This post showed one method of doing that, with contradictions resolved by weighting the reward/utility function for each preference and then adding them together linearly. The weights were proportional to some measure of the intensity of each preference.

In a more recent post, I realised that linear addition may not be the natural thing to do for some types of preferences (which I dubbed "identity" preferences). The smooth minimum gives another way of combining utilities, though it needs a natural zero as well as a weight. So the human's model of the status quo is relevant here. For preferences combined in a smoothmin, we can just reset the natural zero (raising it to make the preference less important, lowering it to make it more) rather than changing the weight.

I'm distinguishing between identity and world preferences, but the real distinction is between preferences that humans prefer to combine linearly, and those they prefer to combine in a smoothmin. So it could work that along with preference and weight (and natural zero), one thing we could ask of basic preferences is whether they should go in the linear of the smoothmin group.

Also, though I'm very willing to let a linear preference get sent to zero if the human's meta-preferences unendorse them, I'm less sure about those in the other group; it's possible that unendorsing of a smoothmin preference should raise the "natural zero" rather than sending the preference to zero. After all, we've identified these preferences as key parts of our identity, even though we unendorse them.

Meta-changes to the synthesis method

Then finally, on point 3 above, the relevant human meta-preferences can change the synthesis process. Heavily weighted meta-preferences of this type will result in completely different processes than described above; lightly weighted meta-preferences will make only small changes. The original post looked into that in more detail.

Notice that I am making some deliberate and somewhat arbitrary choices: using linear or smoothmin to combine meta-preferences (including those that might want to change the methods of combinations). How much weight a meta-preference must have, before it seriously changes the synthesis method, is somewhat arbitrary.

I'm also starting with two types of preference combinations, linear and smoothmin, rather than many more or just one. The idea is that these two way of combining preferences seem the most salient to me, and our own meta-values can change these ways if we feel strongly about them. It's as if I'm starting the design of a formula one car, before an AI trains itself to complete the design. I know it'll change a lot of things, but if I start with "four wheels, a cockpit and a motor", I'm hoping to get them started on the right path, even if they eventually overrule me.

Or, if you prefer, I think starting with this design is more likely to nudge a bad outcome into a good one, than to do the opposite.

Non-terrible outcomes

Now for the most tricky part of this: given the above, can we expect non-terrible outcomes?

This is a difficult question to answer, because "terrible outcomes" remains undefined (if we had a full definition, it could serve a utility function itself), and, in a sense, there is no principled trade-off between two preferences: the only general optimality measure is Pareto, and that can be reached by any linear combination of utilities.

Scope insensitivity to the rescue?

There are two obvious senses in which an outcome could be terrible:

  1. We could lose something of great value, never to have it again.
  2. We could fall drastically short of maximising a utility function to the upmost.

From the perspective of a utility maximiser, both these outcomes could be equally terrible - it's just a question of computing the expected utility difference between the two scenarios.

However, for actual humans, the first scenario seems to loom much larger. This can be seen as a form of scope insensitivity: we might say that we believe in total utilitarianism, but we don't feel that a trillion people is really a trillion times better than a trillion people, so the larger the numbers grow, the more we are, in practice, willing to trade off total utilitarianism for other values.

Now, we might deplore that state of affairs (that deploring is a valid meta-preference), but that does seem to be how human work. And though there are arguments against scope insensitivity for actually existent beings, it is perfectly consistent to reject them when considering whether we have a duty to create new beings.

What this means is that people's preferences seem much closer to smooth minimums than to linear sums. Some are explicitly setup like that from the beginning (those that go in the smoothmin bucket). Others may be like that in practice, either because meta-preferences want them to be, or because of the vast size of the future: see next section.

The size of the future

The future is vast, with the energy of billions of galaxies, efficiently used, at our disposal. Potentially far, far larger than that, if we're clever about our computations.

That means that it's far easier to reach "agreement" between two utility functions with diminishing marginal returns (as most of them will be, in practice and in theory). Even without diminishing marginal returns, and without using smoothmin, it's unlikely that one utility function will remain highest marginal returns all the way up to all resources being used up. At some point, benefiting a tiny little preference slightly will likely be easier.

The exception of this is if preferences are explicitly opposed to each other; eg masochism versus pain-reduction. But even there, they are unlikely to be completely and exactly negations of one another. The masochist may find some activities that don't fit perfectly under "increased pain" as traditionally understood, so some compromise between the two preferences becomes possible.

The underdefined nature of some preference may be an boon here; if is forbidden, but only in situations in , then going outside of may allow the -loving preferences their space to grow. So, for example, obeying promises might become a general value, but we might allow games, masked balls, or similar situation where lying is allowed, because the advantages of honesty - reputation, ease of coordination - are deliberately absent.

Growth, learning, and following your own preferences

I've argued that our values and preference will soon become stable as we start to self modify.

This is going to be hard for those who put an explicit premium on continual moral growth. Now, it's still possible to have continued moral change withing a narrow band, but

Finally, there's the issue of what happens when the AI tells you "here is , the synthesis of your preferences", and you go "well, I have all these problems with it". Since humans are often contrarian by nature, it may be impossible for an AI to construct a that we would ever explicitly endorse. This is a sort of "self-reference" problem in synthesising preferences.

Tolerance levels

The whole design - with an initial framework, liberal use of smoothmin, a default for standard combinations of preferences, and a vast amount of resources available - is designed to reach an adequate, rather than an optimal solution. Optimal solutions are very subject to Goodhart's law if we don't include everything we care about; if we do include everything we care about, the process may come to resemble the one I've defined here, above.

Conversely, if the human fear that such a synthesis will become badly behaved in certain extreme situations - then that fear will be included in the synthesis. And, if the fear is strong enough, will serve to direct the outcomes away from those extreme situations.

So the whole design is somewhat tolerant to changes in the initial conditions: different starting points may end up in different end points, but all of them will hopefully be acceptable.

Did I think of everything?

With all such methods, there's the risk of not including everything, so ending up in a terrible point by omission. That risk is certainly there, but it seems that we couldn't end up in a terrible hellworlds, or at least no in one that could be meaningfully described/summarised to the human (because avoiding hellworlds is high on human preference and meta-preferences, and there is little explicit force pushing the other way).

And I've argued that it's unlikely that indescribable hellworlds are even possible.

However, there are still a lot of holes to fill, and I have to ensure that this doesn't just end up as a series of patches until I can't think of any further patches. That's my greatest fear, and I'm not yet sure how to address it.

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 5:41 PM

This seems to assume a fairly specific (i.e., anti-realist) metaethics. I'm quite uncertain about metaethics and I'm worried that if moral realism is true (and say for example that total hedonic utilitarianism is the true moral theory), and what you propose here causes the true moral theory to be able to control only a small fraction of the resources of our universe, that would constitute a terrible outcome. Given my state of knowledge, I'd prefer not to make any plans that imply commitment to a specific metaethical theory, like you seem to be doing here.

What's your response to people with other metaethics or who are very uncertain about metaethics?

However, for actual humans, the first scenario seems to loom much larger.

I don't think this is true for me, or maybe I'm misunderstanding what you mean by the two scenarios.

Leaning on this, someone could write a post about the "infectiousness of realism" since it might be hard to reconcile openness to non-zero probabilities of realism with anti-realist frameworks? :P

For people who believe their actions matter infinitely more if realism is true, this could be modeled as an overriding meta-preference to act as though realism is true. Unfortunately if realism isn't true this could go in all kinds of directions depending on how the helpful AI system would expect to get into such a judged-to-be-wrong epistemic state.

Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we'll know by that point whether we've found the right way to do metaphilosophy (and how approaching that question is different from approaching whichever procedures philosophically sophisticated people would pick to settle open issues in something like the above proposals). It seems like there has to come a point where one has to hand off control to some in-advance specified "metaethical framework" or reflection procedure, and judged from my (historically overconfidence-prone) epistemic state it doesn't feel obvious why something like Stuart's anti-realism isn't already close to there (though I'd say there are many open questions and I'd feel extremely unsure about how to proceed regarding for instance "2. A method for synthesising such basic preferences into a single utility function or similar object," and also to some extent about the premise of squeezing a utility function out of basic preferences absent meta-preferences for doing that). Adding layers of caution sounds good though as long as they don't complicate things enough to introduce large new risks.

Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we’ll know by that point whether we’ve found the right way to do metaphilosophy

I think there's some (small) hope that by the time we need it, we can hit upon a solution to metaphilosophy that will just be clearly right to most (philosophically sophisticated) people, like how math and science were probably once methodologically quite confusing but now everyone mostly agrees on how math and science should be done. Failing that, we probably need some sort of global coordination to prevent competitive pressures leading to value lock-in (like the kind that would follow from Stuart's scheme). In other words, if there wasn't a race to build AGI, then there wouldn't be a need to solve AGI safety, and there would be no need for schemes like Stuart's that would lock in our values before we solve metaphilosophy.

it doesn’t feel obvious why something like Stuart’s anti-realism isn’t already close to there

Stuart's scheme uses each human's own meta-preferences to determine their own (final) object-level preferences. I would less concerned if this was used on someone like William MacAskill (with the caveat that correctly extracting William MacAskill's meta-preferences seems equivalent to learning metaphilosophy from William) but a lot of humans have seemingly terrible meta-preferences or at least different meta-preferences which likely lead to different object-level preferences (so they can't all be right, assuming moral realism).

To put it another way, my position is that if moral realism or relativism (positions 1-3 in this list) is right, we need "metaphilosophical paternalism" to prevent a "terrible outcome", and that's not part of Stuart's scheme.

I would less concerned if this was used on someone like William MacAskill [...] but a lot of humans have seemingly terrible meta-preferences

In those cases, I'd give more weight to the preferences than the meta-preferences. There is the issue of avoiding ignorant-yet-confident meta-preferences, which I'm working on writing up right now (partially thanks to you very comment here, thanks!)

or at least different meta-preferences which likely lead to different object-level preferences (so they can't all be right, assuming moral realism).

Moral realism is ill-defined, and some allow that humans and AI would have different types of morally true facts. So it's not too much of a stretch to assume that different humans might have different morally true facts from each other, so I don't see this as being necessarily a problem.

Moral realism through acausal trade is the only version of moral realism that seems to be coherent, and to do that, you still have to synthesise individual preferences first. So "one single universal true morality" does not necessarily contradict "contingent choices in figuring out your own preferences".

There is the issue of avoiding ignorant-yet-confident meta-preferences, which I’m working on writing up right now (partially thanks to you very comment here, thanks!)

I look forward to reading that. In the meantime can you address my parenthetical point in the grand-parent comment: "correctly extracting William MacAskill’s meta-preferences seems equivalent to learning metaphilosophy from William"? If it's not clear, what I mean is that suppose Will wants to figure out his values by doing philosophy (which I think he actually does), does that mean that under you scheme the AI needs to learn how to do philosophy? If so, how do you plan to get around the problems with applying ML to metaphilosophy that I described in Some Thoughts on Metaphilosophy?

There is one way of doing metaphilosophy this way, which is "run (simulated) William MacAskill until he thinks he's found a good metaphilosophy" or "find a description of metaphilosophy to which WA would say 'yes'."

But what the system I've sketched would most likely do is come up with something to which WA would say "yes, I can kinda see why that was built, but it doesn't really fit together as I'd like and has a some of ad hoc and object level features". That's the "adequate" part of the process.

Uncertainty about metaethics seems a serious source of risk in AI safety, and especially AI alignment. I've written a paper detailing how we might approach such fundamental uncertainty such that we can perform analysis to find positions which minimize risk such that we don't unnecessarily expose ourselves to risk by unnecessarily making assumptions we need not make.

My aim is to find a decent synthesis of human preferences. If someone has a specific metaethics and compelling reasons why we should follow that metaethics, I'd then defer to that. The fact I'm focusing my research on the synthesis is because I find that possibility very unlikely (the more work I do, the less coherent moral realism seems to become).

But, as I said, I'm not opposed to moral realism in principle. Looking over your post, I would expect that if 1, 4, 5, or 6 were true, that would be reflected in the synthesis process. Depending on how I interpret it, 2 would be partially reflected in the synthesis process, and 3 maybe very partially.

If there were strong evidence for 2 or 3, then we could either a) include them in the synthesis process, or b) tell humans about them, which would include them in the synthesis process indirectly.

Since I see the synthesis process as aiming for an adequate outcome, rather than an optimal one (which I don't think exists), I'm actually ok with adding in some moral-realism or other assumptions, as I see this as making a small shift among adequate outcomes.

As you can see in this post, I'm also ok with some extra assumptions in how we combine individual preferences.

There's also some moral-realism-for-humans variants, which assume that there are some moral facts which are true for humans specifically, but not for agents in general; this would be like saying there is a unique synthesis process. For those variants, and some other moral realist claims, I expect the process of figuring out partial preferences and synthesising them, will be useful building blocks.

But mainly, my attitude to most moral realist arguments, is "define your terms and start proving your claims". I'd be willing to take part in such a project, if it seemed realistically likely to succeed.

I don't think this is true for me, or maybe I'm misunderstanding what you mean by the two scenarios.

You may not be the most typical of persons :-) What I mean is that if we divided people's lifetimes by a third, or had a vicious totalitarian takeover, or made everyone live in total poverty, then people would find either of these outcomes quite bad, even if we increased lifetimes/democracy/GDP to compensate for the loss along one axis.

I am relatively new to the (large number of) utility / preference discussions on Lesswrong. Can you please tell me what a reasonable and relatively short introductions to the foundations would be?

My problem is that the discussion or research project seems to be detached from the economics literature. I also do not see any discussion of "contribution to the literature" in your post, so it is hard for me to see the starting point.

Just to give a little background to see where I am starting. The following is my understanding of welfare evaluations in economics. I hope I do not misuse your post too much, because my comment may have little concrete relation to what you write.

In theoretical Microeconomics, there are basically four approaches:

1. Understanding utility as preferences. This is completely ordinal, and it's unclear how utility between people should be compared. From a welfare-maximization perspective, this is very problematic, as shown by Arrow's impossibility theorem.

2. von-Neumann-Morgenstern expected utility. Here, utility functions are cardinal, but expected utility is ordinal and again it's not clear how utility could be compared. So I guess that the impossibility theorem still applies.

3. Welfare economics. Here we just ignore the problem by adding up market surplus, implicitly or explicitly assuming that all utility functions are quasi-linear, and linear in income. And additionally, we implicitly assume almost always that it is not a problem that people are not compensated compared to a pre-policy state of the world, as long as the winners could compensate the losers (Kaldor-Hicks criterion). This is a value assumption, though I have read economists that have claimed that the opposite would be a value assumption. Welfare economics includes an expected-value version, which is no problem because everything is cardinal.

4. Prospect theory and similar approaches that include reference points (of a person's consumption, income, whatever). While there is a lot of evidence that this is more successful at explaing behavior, I am not sure whether there is any accepted welfare theory based on that. I guess the problem is that if reference points and social preferences enter the utility function, strange implications may arise. If there are rich and poor people, then redistribution has to take into account their reference points, which would limit redistribution, which seems unfair. Additionally, if I can somehow convince myself that I deserve more money, and a benevolent utilitarian planner would be omniscient and thus see my conviction, then he should give me more money.

Reading Kahneman's research summarized in Thinking, Fast and Slow also leads to weird conclusions, because when people evaluate their life, their evaluations are weird. Kahneman writes, for example, that people evaluate the pain suffered in some span of time by the pain at the end and the highest value of pain. Which makes people choose "60 seconds of strong pain plus 30 seconds of moderate pain" over "60 seconds of strong pain".

Then there are many welfare discussions that use macroeconomic models, i.e., assuming a cardinal utility function of a representative agent (usually expected utilitarian discounted utility, sometimes max-min / Rawlsian). I think there is no real theoretical foundation.

Finally, there are empirical redistibution preferences that show that people have a preference for given money to people who "deserve" it by some measure. This could be understood as similar to welfare evaluations based on prospect theory, but it additionally tells us where the reference points would come from.

I think in terms of economics, vNM expected utility is closest to how we tend to think about utility/preferences. The problem with vNM (from our perspective) is that it assumes a coherent agent (i.e., an agent that satisfies the vNM axioms) but humans aren't coherent, in part because we don't know what our values are or should be. ("Humans don't have utility functions" is a common refrain around here.) From academia in general, the approach that comes closest to how we tend to think about values is reflective equilibrium, although other meta-ethical views are not unrepresented around here.

For utility comparisons between people, I think a lot of thinking here have been based on or inspired by game theory, e.g., bargaining games.

Of course there is a lot of disagreement and uncertainty between and within individuals on LW, so specific posts may well be based on different foundations or are just informal explorations that aren't based on any theoretical foundations.

In this post, Stuart seems to be trying to construct an extrapolated/synthesized (vNM or vNM-like) utility function out of a single human's incomplete and inconsistent preferences and meta-preferences, which I don't think has much of a literature in economics?

In this post, Stuart seems to be trying to construct an extrapolated/synthesized (vNM or vNM-like) utility function out of a single human's incomplete and inconsistent preferences and meta-preferences

Indeed that's what I'm trying to do. The reasons are that utility functions are often more portable (easier to extend to new situations) and more stable (less likely to change under self-improvement).

This has prompted me to get off my butt and start publishing the more useful bits of what I've been thinking about. Long story short, I disagree with you while still almost entirely agreeing with you.

This isn't really the full explanation of why I think the AI can't just be given a human model and told to fill it in, though. For starters, there's also the issue about whether the human model should "live" in the AI's native ontology, or whether it should live in its own separate, "fictional" ontology.

I've become more convinced of the latter - that if you tell the AI to figure out "human values" in a model that's interacting with whatever its best-predicting ontology is, it will come up with values that include things as strange as "Charlie wants to emit CO2" (though not necessarily in the same direction). Instead, its model of my values might need to be described in a special ontology in which human-level concepts are simple but the AI's overall predictions are worse, in order for a predictive human model to actually contain what I'd consider to be my values.

While reading this, I got a thought, which maybe tangential to all said above, but still is a type of comment.

The thought is that there could be different types of answers to the question "what are human values":

1) One is formal. What are correct types of presenting human values: words, utility functions, equations, choices.

2) Another is factual: what is actual human preferences of this person or general.

3) Third is procedural: what should I do to learn this person's preferences.

4) Forth is philosophical: what are "human values" as a type of objects in ontological sense: moral facts, observations, approximations, predictive models, opinions, self-models, qualia, etc.

5) Last is neurological: how said values are preserved in the brain? What is the neurocorelate of value?

Also, what is axiological value of human values - why they are good at all? Are they end point or starting point, sin or blessing?