Value Stability and Aggregation

by jimrandomh7 min read6th Feb 201141 comments


Personal Blog

One of the central problems of Friendly Artificial Intelligence is goal system stability. Given a goal system - whether it's a utility function, a computer program, or a couple kilograms of neural tissue - we want to determine whether it's stable, meaning, is there something that might plausibly happen to it which will radically alter its behavior in a direction we don't like? As a first step in solving this problem, let's consider a classic example of goal systems that is not stable.

Suppose you are a true Bentham-Mill Utilitarian, which means you hold that the right thing to do is that which maximizes the amount of happiness minus the amount of pain in the world, summed up moment by moment. Call this HapMax for short. You determine this by assigning each person a happiness-minus-pain score at each moment, based on a complex neurological definition, and adding up the scores of each person-moment. One day, you are interrupted from your job as an antidepressant research chemist by a commotion outside. Rushing out to investigate, you find a hundred-foot tall monster rampaging through the streets of Tokyo, which says:

"I am a Utility Monster. Robert Nozick grew me in his underwater base, and now I desire nothing more than to eat people. This makes me very happy, and because I am so very tall and the volume of my brain's reward center grows with the cube of my height, it makes me *so* happy that it will outweigh the momentary suffering and shortened lifespan of anyone I eat."

As a true HapMaxer (not to be confused with a human, who might claim to be a HapMaxer but can't actually be one), you find this very convincing: the right thing to do is to maximize the number of people the monster can eat, so you heroically stand in front of the line of tanks that is now rolling down main street to buy it time. HapMax seemed like a good idea at first, but this example shows that it is very wrong. What lessons should we learn before trying to build another utility function? HapMax starts by dividing the world into pieces, and the trouble starts when one of those agents doesn't behave as expected.

Dividing and Recombining Utility

Human values are too complex to specify in one go, so like other complex things, we manage the complexity by subdividing the problem, solving the pieces, then recombining them back into a whole solution. Let's call these sub-problems value fragments, and the recombination procedure utility aggregation. If all of the fragments are evaluated correctly and the aggregation procedure is also correct, then this yields a correct solution.

There are plenty of different ways of slicing up utility functions, and we can choose as many of them as desired. You can slice up a utility function by preference type - go through a list of desirable things like "amount of knowledge" and "minus-amount of poverty", assign a score to each representing the degree to which that preference is fulfilled, and assign a weighting to each representing its importance and degree of overlap. You can slice them up by branch - go through all the possible outcomes, assigning a score to each outcome representing how nice a world it is and a weighting for probability. You can slice it up by agent - go through all the people you know about, and assign a score for how good things are for them. And you can slice it up by moment - go through a predicted future step by step, and assign a score for how good the things in the world at that moment are. Any of these slices yields value fragments; a fragment is any reference class that describes a portion of the utility function.

Meta-ethics, then, consists of three parts. First, we choose an overall structure, most popularly a predictor and utility function, and subdivide it into fragments, such as by preference, branch, agent, and moment. Then we specify the subdivided parts - either with a detailed preference-extraction procedure like the one Coherent Extrapolated Volition calls for but doesn't quite specify, or something vague like "preferences". Finally, we add an aggregation procedure.

The aggregation procedure is what determines how stable a utility function is in the face of localized errors. It was a poor choice of aggregation function that made HapMax fail so catastrophically. HapMax aggregates by simple addition, and its utility function is divided by agent. That makes an awful lot of dissimilar fragments. What happens if some of them don't behave as expected? Nozick's Utility Monster problem is exactly that: one of the agents produces utilities that diverge to extremely large values, overpowering the others and breaking the whole utility function.


Aggregation and Error

If human values are as complex as we think, then it is extremely unlikely that we will ever manage to correctly specify every value fragment in every corner case. Therefore, to produce a stable system of ethics and avoid falling for any other sorts of utility monsters, we need to model the sorts of bugs that fragments of our utility function might have, and choose an aggregation function that makes the utility function resilient - that is, we'd like it to keep working and still represent something close to our values even if some of the pieces don't behave as expected. Ideally, every value would be specified multiple times from different angles, and the aggregation function would ensure that no one bug anywhere could cause a catastrophe.

We saw how linear aggregation can fail badly when aggregating over agents - one agent with a very steep utility function gradient can overpower every other concern. However, this is not just a problem for aggregating agents; it's also a problem for aggregating preferences, branches, and moments. Aggregation between branches breaks down in Pascal's Mugging which features a branch with divergent utility, and in Anthropic problems, where the number of branches is not as expected. Aggregation between moments breaks down when considering Astronomical Waste, which features a time range with divergent utility. The effect of linearly aggregating distinct preference types is a little harder to predict, since it depends just what the inputs are and what bugs they have, but they're mostly as bad as tiling with molecular smiley faces, and Goodhart's Law suggests that closing every loophole is impossible.

If linear aggregation is so unstable, then how did it become so popular in the first place? It's not that no other possibilities were considered. For example, there's John Rawls' Maximin Principle, which says that we should arrange society so as to maximize how well off the worst-off person is. Now, the Maximin Principle is extremely terrible - it implies that if we find the one person who's been tortured the most, and we can't stop them from being tortured but can make them feel better about it by torturing everyone else, then we should do so. But there are some aggregation strategies that fail less badly, and aren't obviously insane. For example, we can aggregate different moral rules by giving each rule a veto, for predicted worlds and for possible actions. When this fails - if, for example, every course of action is vetoed - it shuts down, effectively reverting to a mostly-safe default. Unfortunately, aggregation by veto doesn't quite work because it can't handle trolley problems, where every course of action is somehow objectionable and there is no time to shut down and punt the decision to a human. The space of possible aggregation strategies, however, is largely unexplored. There is one advantage which has been proven unique to linear aggregation, which no other strategy can have: Dutch Book resistance. However, this may be less important than mitigating the damage bugs can do, and it may be partially recoverable by having utility be linear within a narrow range, and switching to something else (or calling on humans to clarify) in cases outside that range.


Classifying Types of Errors

I believe the next step in tackling the value system stability problem is to explore the space of possible aggregation strategies, evaluating each according to how it behaves when the values it aggregates fail in certain ways. So here, then, is a classification of possible value-fragment errors. Each of these can apply to any reference class

  • Deletion: The agent forgets about a fragment. A branch is forgotten about, a preference forgotten, is incorrectly deemed inapplicable, or its fulfillment can't be predicted.
  • Insertion: A random extra preference is added; a branch that's actually impossible is predicted as an outcome; an agent that doesn't exist or isn't morally significant is posited.
  • Divergence: A value fragment or its gradient has a value with a much larger magnitude than expected, possibly infinite or as large as an arbitrary value chosen by some agent.
  • Noise: Each fragment's estimated utility has an error term added, from a gaussian, log-normal or other distribution.
  • Scaling: The agent encounters or envisions a scenario in which the number of times a value is tested for is qualitatively different than expected.

A good utility function, if it contains subdivisions, must be able to survive errors in any one or even several of those divisions while still representing something close to our values. What sort of function might achieve that purpose?


41 comments, sorted by Highlighting new comments since Today at 12:27 AM
New Comment

HapMax seemed like a good idea at first, but this example shows that it is very wrong.

This has nothing to do with instability. HapMax is undesirable under your preferences, not its own preferences. HapMax itself just feeds the Utility Monster.

You either entirely misunderstood what I'm saying, or stopped reading before you got to the thesis statement of this article. You also appear to be using a different definition of the word "instability" than I am.

This wasn't meant to be a defense of HapMax; I used it as an example only because it's familiar and simple enough to use without pulling too much focus from the main point, which was about utility functions in general, including ones that are close enough to valid for you to care about them and including algorithmically-constructed utility functions as in CEV, and not about HapMax in particular. I realize that there are many other things wrong with HapMax and that it is not salvageable.

When I say that HapMax is unstable, I mean that a bug in one subdivision drastically alters the whole thing. Even if there were no utility monster, one might imagine a bug or cosmic ray hit causing an ordinary person to be treated as one.

You seem to be thinking of stability under self-modification, as opposed to what I'm talking about which is stability under introduction of localized qualitative errors.

what I'm talking about which is stability under introduction of localized qualitative errors.

A better word for your concept, then, might be "robustness", rather than "stability".

No - there's nothing unstable or buggy about HapMax. The utility monster is a large change in the input that causes a large change in the output. Instability is when a small change in the input causes a large change in the output. HapMax is stable by any measure I can think of. You just don't like HapMax because you don't think you implement it.

(If you could really perceive the vast orgasmic pleasure of the monster, rather than just reading a text description of it, you might find that you do implement HapMax.)

Note that "punt to a human" isn't just infeasible, it's undesirable, unless you can choose a human that gives the answer you want. Why do FAI folks treat humanity as the baseline, when it's clear that many (and perhaps all) humans are not friendly to start with?

I'm starting to consider whether the idea of individuality strongly implies that there can be no objective agreement on some ethical topics, and therefore that conflict simply has to be included in any ethics that accepts independence of agent beliefs.

With that in mind, it's not errors in aggregation that would be the problem, it's accepted (and valued, even: memicide is repugnant at first glance, though I'm not sure I can say that anything is strictly forbidden in all cases) that agents' preferences cannot be aggregated. You then hit Arrow's Theorem and have no choice but to decide which of the desirable aggregation properties you don't need.

I suggest the "There are several voting systems that side-step these requirements by using cardinal utility (which conveys more information than rank orders)" solution.

Fair point, and I now see that this is exactly what the post is about: assuming we have cardinal measurements, the problem becomes weighting in order to make interpersonal utility comparisons.

HapMax seemed like a good idea at first, but this example shows that it is very wrong.

As you may be aware, some ethical philosophers just bite that bullet.

David Pearce calls it a "utilitronium shockwave".

Linear aggregation works just fine for HapMax: it maximizes the complex neurological definition it wants to maximize, just as it wants. All the "catastrophes" you point out are not due to an agent having an internal error, but rather a conflict between what they want and what humans want. Additionally, the aggregation procedure is completely defined if you have a complete utility function.

So the problem is really "how does a human-like utility function look?" Because, as you argue, anything that tries to take a shortcut to human-like behavior can lead to catastrophes.

Yes, linear aggregation in HapMax produces a result that agrees with HapMax. But it does not agree with the intuitions that HapMax's creators had, and I think that utility functions which use linear aggregation will in general tend to produce results that are counterintuitive in dramatic (and therefore likely very bad) ways.

Rather than "counterintuitive," I'd prefer "inhuman" or "unfriendly." If the creators had linear utility functions on the same stuff, HapMax would fit in just fine. If humans have a near-linear utility function on something, then an AI that has a linear utility function there will cause no catastrophes. I can't think of any problems unique to linear weighting - the problem is really when the weighting isn't like ours.

I think we should figure out what our true utility function is, and then, if we have reason to be worried about bugs/errors in the AI, determine how to prevent or mitigate against such errors, which may include programming an utility function into the AI that's not the true utility function, but an approximation that is less vulnerable to errors. I don't see how we can do that without knowing the true utility function, or the likelihood of various kinds of errors. It seems like you're jumping way too far ahead.

I think we know of at least one working idea of the utility function for one individual human in theory - show the human all the possible simulated futures and allow them to pick the best.

Of course that would be a complex vector utility function, and you couldn't hope to map that down to a single simple scalar value that aggregates (even with voting).

The true optimal theoretical 'utility function' for humanity as a whole is probably something akin to simulating the evolution of human culture/civilization forward into many possible futures, disseminating all that knowledge, feeding that back in on itself recursively with numerous rounds of re-simulation and then allowing current political/economic systems to evolve forward with all that processed future-path knowledge.

HapMax seemed like a good idea at first, but this example shows that it is very wrong.

I'm familiar with the Nozick utility monster. Convince me that the monster is wrong, instead of saying "it's obvious". Is there no utility level the monster could have that would justify giving it all resources? That seems very odd to me.

If there are 2 people in a space capsule that will take 1 day to reach Earth, and only enough oxygen for one person to live one day, would you have them both die? If not, how is that different?

"For example, there's John Rawls' Maximin Principle, which says that we should arrange society so as to maximize how well off the worst-off person is. Now, the Maximin Principle is extremely terrible - it implies that if we find the one person who's been tortured the most, and we can't stop them from being tortured but can make them feel better about it by torturing everyone else, then we should do so."

How does torturing everyone else better the condition of the helpless, most-tortured person? Unless somehow torturing others benefits the least well off, then it's just pointless torture, not "maximin."

The original poster did say, "but can make them feel better about it by torturing everyone else". The point is that the maximin principle implies that making the lives or many people much worse in order to make the life of a single person marginally better is justified as long as the one who benefits is the worst off among all people.

But Rawls is talking about distributing resources, not just making miserable people "feel better." He explicitly rejects the idea that basic liberties may be infringed by appeals to greater equality.

For maximin to be used in the way you're saying, we'd need a pretty bizarre scenario, one where, for instance, a single person has a terrible disease that can only be cured by more money and resources than all the money and resources of everyone in his Rawlsian society (anything short of that extreme would mean we could spread the cost around in a less onerous way; it would take a pretty big cost and a pretty small pool of resources to get close to making "many people much worse in order to make the life a single person marginally better"). In addition to being outlandish, it seems like such a specific situation would fall outside the very general considerations of the original position.

Also, decimating the economy to cure one person would likely lead to lots of people being as bad or worse off than the sick person and an inability to cure future instances of disease. OP's discussion of the matter is wrong to the point of being embarrassing. Good catch.

If humans don't like feeding the Utility Monster, does that mean that they don't have the same terminal values as HapMax, or that they just have difficulty understanding just how happy the Utility Monster is?

(There are obvious anti-zombie reasons why "how happy the Utility Monster is" is probably ill-formed, but.)

It's also worth noting that lots of humans have thought that committing large-scale atrocities is okay if it leads to overall better results down the road, and who made in many cases tremendous personal sacrifices to Feed the Utility Monster. Nowadays and hereaplace we tend to call these people "inhuman," but of course they were just as human as you or I.

You can't just give up Dutch book resistance, that's the same thing as not having preferences.

You don't have to linearly aggregate over people or whatever, but if you don't want to do it over probabilities then tell me which of the VNM axioms you disagree with.

VNM utility is a gross oversimplification of the types of complex value judgements a brain uses.

A better approximation of human value would be a vector utility function, and then all the VNM axioms beyond 1 break down.

The closest humans ever get to a scalar utility function is with money, but money never represents our actual utility.

$1 million dollars isn't 1000 better than $1000, for example. A particular X dollar evaluation is a gross approximation of a complex vector utility representing the set of things you could do with that money.

I wasn't saying that money=utility, I was just claiming that utility exists. Are you saying axiom 2 is inappropriate to use when modeling human preferences? Can you give any example of three things that preference is not transitive over? Obviously you could trick irrational brains into violating transitivity but, on a full rational analysis, all our preferences are transitive. Given the choice between irrational-brain-value-judgments and brain-judgments + rational reflection, I'd choose the second every time (except maybe some pathological cases that I don't care enough about to bother checking whether they're actually possible).

Can you give any example of three things that preference is not transitive over?

1. A(paper), B(rock), C(scissors). A>B>C>A

2. Transitivity may not hold in situations where all the choices are not available at once. For example given activity choices A(fishing), B(dancing), C(reading) I may pick A>B>C when made aware of all choices, but in isolate may pick A>B, B>C, C>A.

This becomes more pronounced as you factor in time to evaluate choices (such as complex potential futures)

  1. A(paper), B(rock), C(scissors). A>B>C>A

This doesn't seem to have anything to do with intransitive preferences. Paper beats rock if they are played against each other, but you don't prefer paper over rock. Well, you do if your opponent plays scissors, or if you have some foreknowledge that they're especially likely to play scissors, but in the absence of that...

[-][anonymous]11y 0

Two isolated tribes in the Amazon meet every summer solstice at the Ziggurat of the Nameless Raven-God. In the previous year the elders of each respective tribe have debated frantically amongst themselves which sacrifice to bring the nameless one to win his favor for the coming year.

It is said that the Ashen-Feathered Night prefers its own kind as a sacrifice, at least over an offering of obsidian, for the latter lacks the inky blackness of the feathered kind.

In its aspect as the Demon Gate of Truth, however, the nameless one would rather feast on the rotten, amputated limbs of those maimed over the previous year. This gift would pale in comparison to ravens, who of course speak only lies.

Finally, the nameless one is also the Endless Macuahuitl, which requires precious obsidian blades to lengthen its diabolically long grinding edge. Bound by ancient law to only sever living flesh, it would recoil in anger upon an offering of dead flesh.

Three aspects then, for the same terrible Nameless Raven-God.

So, late into Midsummer's Eve the elders debate which aspect the other tribe will attempt to please. Raven, flesh, or obsidian -- only a gift more favorable to the nameless one will convince it to withhold its baleful curses from one tribe.

I would advise them to write down a probability distribution and calculate the utilities of pleasing vs displeasing the Raven-God; that transitivity holds should then be obvious.

[-][anonymous]11y 0

The god's preferences are intransitive. I don't know how to make this clearer.

The god's preferences depend on its state. It prefers Ashen-Feathered Night+raven over Ashen-Feathered Night+obsidian, but does not in general prefer raven to obsidian. A preference must take into account all relevant factors.

Gods are not humans, nor rational. The only entities making actual choices are the tribes.

[-][anonymous]11y 0

Eh, forget it. I'm turning in my Bardic Conspiracy membership.

Eh. I liked the story, and the imagery, I just didn't find it at all a good argument.

It's kind of silly, but I'm thinking of the subset games where you only ever get 2 options.

If I and my opponent only have A(paper), B(rock) to choose from, then always A > B. Likewise B>C, C>A.

I'm not sure how this maps to larger practical situations, but one may be able to make some analogy out of it.

Actually, the rock papers scissors comes up in strategy games frequently.

The case in strategy games is not intransitive. Given any distribution, there is an optimal play against it. For example, if my opponent played 40% rock, 30% paper, and 30% scissors, I would prefer paper, then rock, then scissors. If your opponent plays all three equally, there are no preferences, not circular preferences. Randomization is used to prevent the opponent from gaining information about you. If you could use a pseudorandom method to exploit failures in their cognition and win in the long run, that is a preferable strategy.

In none of these cases would I decide to throw rock, then realize paper is better and change my choice, then realize scissors it better ad infinitum. Paper is not necessarily a better choice than rock, it would just beat rock in a game. Equating these two concepts is a level confusion.

Transitivity may not hold in situations where all the choices are not available at once. For example given activity choices A(fishing), B(dancing), C(reading) I may pick A>B>C when made aware of all choices, but in isolate may pick A>B, B>C, C>A.

Would you really call that rational? If my brain behaved this way, I would attempt to correct it.

Either way, if axiom 2 were interpreted as referring to choices made when all options were known, for example if you knew you could fish, dance, or read and were asked to rank among all of them, the VNM theorem would still work. In this case, you would never say C is better than A because you would always be aware of B.

If I and my opponent only have A(paper), B(rock) to choose from, then always A > B. Likewise B>C, C>A.

If you and your opponent only have paper and rock to choose from, this is correct. But if that is the case, then you are not considering two options within the existing game, you are considering a different game entirely. To equate your preference for paper over rock in a game of Rock-Paper, with a preference for paper over rock in a game of Rock-Paper-Scissors, is a confusion. In that case, the scenario would read, "My opponent can throw Rock, Paper, or Scissors; if we assume I don't want to go Scissors (but my opponent does not know this), what should I do?" Within the given game, there are no intransitive preferences.

If you have a (VNM expected) utility function and those subdivisions are also (VNM expected) utility functions, the only reasonable way to aggregate them is linear weighting.

Otherwise, the big utility function won't agree with the small utility functions about which lotteries are best.

I acknowledge that this is a problem, but my claim is that this is less of a problem than allowing one broken small utility function to take over the whole utility function by rescaling itself.

Why do you think that the big utility function has to have problems?

I suppose because we're constructing it out of clearly-defined-but-wrong approximations to the small utility functions.

In which case, we should deviate from addition in accordance with the flaws in those approximations.

Suppose that we expect small functions to sometimes break. Then E(actual utility|calculated utility=x) looks similar to x when |x| is small, but is much closer to 0 when |X| is large. If we can estimate this S-curve, we make our method more robust against this particular problem.

Another inference we can make is that, when |x| is large, investigating whether or not expected utility is closely approximating actual utility becomes more useful, and any systems that could do this are better ideas.

We should, usually, construct the analysis of further possible problems, such as problems with this approximation, in the same manner: By looking at what deviations between estimated utility and actual utility occur.

HapMax's aggregates by simple addition its utility function is divided by agent, but that makes an awful lot of dissimilar fragments.

I can't parse this sentence.

Assume some punctuation was intended between "HapMax's aggregates by simple addition" and "its utility function is divided by agent". Possibly change the comma after agent too.

Thanks. That does make sense.

[-][anonymous]11y 0

HapMax's aggregates by simple addition its utility function is divided by agent, but that makes an awful lot of dissimilar fragments.

I can't parse this sentence.