This post is an example of how one could go about sythesising contradictory human preferences and meta-preferences, using a somewhat simplified version of the method of this post.

The aim is to illustrate how I imagine the process going. I'll take the example of population ethics, since it's an area with divergent intuition and arguments.

Over-complicated values?

One key point of this synthesis is that I am very reluctant to throw a preference away completely, preferring to keep in around in a very weakened form. The reasoning behind this decision is 1) A weak preference will make little practical difference for most decisions, and 2) Most AI catastrophes involve over-simple utility functions, so simplicity itself is something to be wary of, and there is no clear boundary between "good simplicity" and "bad simplicity".

So we might end up losing most of human value if we go too simple, and be a little bit inefficient if we don't go simple enough.

Basic pieces

Human H has not thought hard about population ethics at all. In term of theory, they have a vague preference for total utilitarianism and they have some egalitarian preferences. They also have a lot of "good/bad" examples drawn from some knowledge of history and some fictional experiences (ie book, movies, and TV shows).

Note that it is wrong to use fiction as evidence of how the world works. However, using fiction as source of values is not intrinsically wrong.

At the meta-level, they have one relevant preference: they have a preference for simple models.

Weights of the preference

H has these preferences with different strengths, denoted by (un-normalised) weights to these preferences. Their weight on total utilitarianism is , their weight on egalitarianism is , their weight on their examples is , and their weight on model simplicity is .

Future arguments

The AI checks how the human would respond if presented with the mere addition argument, the repugnant conclusion, and the very repugnant conclusion.

They accept the logic of the mere addition argument, but feel emotionally against the two repugnant conclusions; most of the examples they can bring to mind are against it.

Synthesis

There are multiple ways of doing the synthesis; here is one I believe is acceptable. The model simplicity weight is ; however, the mere addition argument only works under total-utilitarian preferences, it does not work under egalitarian preferences. The ratio of total to egalitarian is , so the weight given to total utilitarian-style arguments by the mere addition argument is .

(Notice here that egalitarianism has no problems with the repugnant conclusion at all - however, it doesn't believe in the mere addition argument, so this provides no extra weight)

If we tried to do a best fit of utility theory to all of H's mental examples, we'd end up with a utility that is mainly some form of prioritarianism with some extra complications; write this as , for a utility function whose changes in magnitude are small compared with .

Let and be total and egalitarian population ethics utilities. Putting this together, we could get an overall utility function:

  • .

The on is its initial valuation, the is the extra component it picked up from the mere addition argument.

In practice

So what does maximising look like in practice? Well, because the weights of the three main utilities are comparable, they will push towards worlds that align with all three - high total utility, high equality. Mild increases in inequality are acceptable for large increases in total utility or utility of the worst of population, and the other tradeoffs are similar.

What is mainly ruled out are worlds where one factor is maximised strongly, but the others are ruthlessly minimised.

Issues

The above synthesis raises a number of issues (which is why I was keen to write it down).

Utility normalisation

First of all, there's the question of normalising the various utility functions. A slight preference for a utility function can swamp all other preferences, if the magnitude changes in across different choices are huge. I tend to assume a min-max normalisation for any utility , so that

with the optimal policy for , and the optimal policy for (hence the worst policy for ). This min-max normalisation doesn't have particularly nice properties, but then again, neither do any other normalisations.

Factual errors and consequences

I said that fictional evidence and historical examples are fine for constructing preferences. But what if part of this evidence is based on factually wrong suppositions? For example, maybe one of strong examples in favour of egalitarianism is H imagining themselves in very poor situations. But maybe people who are genuinely poor don't suffer as much as H imagines they would. Or, converely, maybe people do suffer from lack of equality, more than H might think.

Things like this seem as if its just a division between factual and preferences beliefs, but it's not so easy to disentangle. A religious fundamentalist might have a picture of heaven in which everyone would actually be miserable. To convince them of this fact, it is not sufficient to point that out; the process of causing them to believe it may break many other aspects of their preferences and world-view as well. More importantly, learning some true facts will likely cause people to change the strength (the weight) with which they hold certain preferences.

This does not seem insoluble, but it is a challenge to be aware of.

Order of operations and underdefined

Many people construct explicit preferences by comparing their consequences with mental examples, then rejecting or accepting the preferences based on this fit. This process is very vulnerable to which examples spring to mind at the time. I showed a process that extrapolated current preferences by imagining H encountering the mere addition argument, and the very repugnant conclusion. But I chose those examples because they are salient to me and to a lot of philosophers in that area, so the choice is somewhat arbitrary. When H's preferences are taken as fixed - before or after which hypothetical arguments - will be important for the final result.

Similarly, I used and separately to consider H's reactions to the mere addition argument. I could instead have used to do so, or even . For these utilities, the mere addition argument doesn't go through at all, so does not give any extra weight to . Why did I do it that way? Because I judged that the mere addition argument sounds persuasive, even to people who should actually reject it based on some synthesis of their current preferences.

So there remains a lot of details to fill in and choices to make.

Meta-meta-preferences

And, of course, H might have higher order preferences about utility normalisation, factual errors, orders of operation, and so on. These provide an extra layer of possible complications to add on.

New to LessWrong?

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 8:31 AM

I don't think all AI catastrophes come from oversimplification of value functions. Suppose we had 1000 weak preferances, , with . Each of which is supposed to be but due to some weird glitch in the definition of , it has an unforeseen maximum of 1000,000, and that maximum is paperclips. In this scenario, the AI is only as friendly as the least friendly piece.

Alternatively, if the value of each is linear or convex in resources spent maximizing it, or other technical conditions hold, then the AI just picks a single to focus all resources on. If some term is very easily satisfied, say is a slight preference that it not wipe out all beetles, then we get a few beetles living in a little beetle box, and 99.99...% of resources turned into whatever kind of paperclip it would otherwise have made.

If we got everyone in the world who is "tech literate" to program a utility function ( in some easy to use utility function programming tool?), bounded them all and summed the lot together, then I suspect that the AI would still do nothing like optimizing human values. (To me, this looks like a disaster waiting to happen)

I agree. (on that issue, I think a soft min is better than a sum). However throwing away 's is still a bad idea; my requirement is neccessary but not sufficient.

Questions:

  1. When you condition on , do you expect every other agent to also implement an optimal policy for , or do they keep doing what they're doing?
  2. Is a humanly realistic policy, or an unboundedly optimal policy? For example, conditional on , should I expect to quickly reduce all (-relevant) x-risks to 0?
  3. In the future we're likely to have much better knowledge about our universe and about logical facts that go into the expected utility computation. Do we keep redoing this normalization as time goes on, or fix it to the current time, or maybe do this normalization while pretending to know less than we actually do?

The way I'm imagining it:

  1. generally consider all other agents do what they would have done anyway. This agent follows some with probability , and follow and with probabilities . The conditionals condition on one of the two unlikely policies being chosen.

  2. I think either works for normalisation purposes, so I'd assume human-realistic. EDIT: "Either works" is wrong, see my next answer, we should using "human-realistic".

  3. The normalisation process is completely time-inconsistent, and so is done once, at a specific time, and not repeated.

Do you have ideas on how the normalisation process can be improved? Because it's very much a "better than all the alternatives I know" at the moment.

  1. I think either works for normalisation purposes, so I’d assume human-realistic.

But they lead to very different normalization outcomes, don't they? Say represents total hedonic utilitarianism. If (and ) are unboundedly optimal then conditional on that, I'd take over the universe and convert everything into hedonium (respectively dolorium). But if is just human-realistic then the difference between and is much smaller. (In one scenario, I get a 1/8 billionth share of the universe and turn that into hedonium/dolorium so the ratio between human-realistic and unbounded optimal is 8 billion.) On the other hand if has strongly diminishing marginal utilities, then taking over the universe isn't such huge improvement over a human-realistic policy. The ratio between human-realistic and unboundedly optimal might be only say 2 or 100 for this . So this leads to different ways to normalize the two utility functions depending on "human-realistic" or "unboundedly optimal".

For human-realistic, there's also a question of realistic for whom? For the person whose values we're trying to aggregate? For a typical human? For the most capable human who currently exist? Each of these leads to different weights amongst the utility functions.

  1. The normalisation process is completely time-inconsistent, and so is done once, at a specific time, and not repeated.

Since this means we'll almost certainly regret doing this, it strongly suggests that something is wrong with the idea.

Do you have ideas on how the normalisation process can be improved? Because it’s very much a “better than all the alternatives I know” at the moment.

Maybe normalization won't be needed if we eventually just figure out what our true/normative values are. I think in the meantime the ideal solution would be keeping our options open rather than committing to a specific process. Perhaps you could argue for considering this idea as a "second best" option (i.e., if we're forced to pick something due to time/competitive pressure), in which case I think it would be good to state that clearly.

But they lead to very different normalization outcomes, don't they?

Apologies, I was wrong in my answer. The normalisation is "human-realistic", in that the agent is estimating "the best they themselves could do" vs "the worse they themselves could do".

Since this means we'll almost certainly regret doing this, it strongly suggests that something is wrong with the idea.

This is an inevitable feature of any normalisation process that depends on the difference in future expected values. Suppose is a utility that can be or within the next day; after that, any action or observation will only increase or decrease by at most . The utility , in contrast, is unless the human does the same action every day for ten years, when it will become . The normalisation of and will be very different depending on whether you normalise now or in two days time.

You might say that there's an idealised time in the past where we should normalise it (a sort of veil of ignorance), but that just involves picking a time, or a counterfactual time.

Lastly, "regret" doesn't quite mean the same thing as usual, since this is regret between weights of preference which we hold.

Now, there is another, maybe more natural way of normalising things: cache out the utilities as examples, and see how intense our approval/disapproval of these examples is. But that approach doesn't allow us to overcome, eg, scope insensitivity.

if we eventually just figure out what our true/normative values are.

I am entirely convinced that there are no such things. There are maps from {lists of assumptions + human behaviour + elements of the human internal process} to sets of values, but different assumptions will give different values, and we have no principled way to distinguish between them, except for using our own contradictory and underdefined meta-preferences.

The normalisation is “human-realistic”, in that the agent is estimating “the best they themselves could do” vs “the worse they themselves could do”.

But this means the normalization depends on how capable the human is, which seems strange, especially in the context of AI. In other words, it doesn't make sense that an AI would obtain different values from two otherwise identical humans who differ only in how capable they are.

I am entirely convinced that there are no such things.

In a previous post, you didn't seem this certain about moral anti-realism:

Even if the moral realists are right, and there is a true R, thinking about it is still misleading. Because there is, as yet, no satisfactory definition of this true R, and it's very hard to make something converge better onto something you haven't defined. Shifting the focus from the unknown (and maybe unknowable, or maybe even non-existent) R, to the actual P, is important.

Did you move further in the anti-realist direction since then? If so, why?

There are maps from {lists of assumptions + human behaviour + elements of the human internal process} to sets of values, but different assumptions will give different values, and we have no principled way to distinguish between them, except for using our own contradictory and underdefined meta-preferences.

I agree this is the situation today, but I don't see how we can be so sure that it won't get better in the future. Philosophical progress is a thing, right?

But this means the normalization depends on how capable the human is, which seems strange, especially in the context of AI.

The min-max normalisation is supposed to measure how much a particular utility function "values" the human moving from being a u-antagonist to a u-maximiser. The full impact of that change is included; so if the human is about to program an AI, the effect is huge. You might see it as the AI asking "utility u - maximise, yes or no?", and the spread between "yes" and "no" is normalised.

Did you move further in the anti-realist direction since then? If so, why?

How I describe my position can vary a lot. Essentially I think that there might be a partial order among sets of moral axioms, in that it seems plausible to me that you could say that set A is almost-objectively better than set B (more rigorously: according to criteria c, A>B, and criteria c seems a very strong candidate for an "objectively true" axiom; something comparable to the basic properties of equality https://en.wikipedia.org/wiki/Equality_(mathematics)#Basic_properties ).

But it seems clear there is not going to be a total order, nor a maximum element.

I agree this is the situation today, but I don't see how we can be so sure that it won't get better in the future. Philosophical progress is a thing, right?

Progress in philosophy involves uncovering true things, not making things easier; mathematics is a close analogue. For example, computational logic would have been a lot simpler if in fact there existed an algorithm that figured out if a given Turing machine would halt. The fact that Turing's result made everything more complicated didn't mean that it was wrong.

Similarly, the only reason to expect that philosophy would discover moral realism to be true, is if we currently had strong reasons to suppose that moral realism is true.