Review

When we talk of aiming for the good future for humanity – whether by aligning AGI or any other way – it's implicit that there are some futures that "humanity" as a whole would judge as good. That in some (perhaps very approximate) sense, humanity could be viewed as an agent with preferences, and that our aim is to satisfy said preferences.

But is there a theoretical basis for this? Could there be? How would it look like?

Is there a meaningful frame in which humanity be viewed as optimizing for its purported preferences across history?

Is it possible or coherent to imagine a wrapper-mind set to the task of maximizing for the utopia, whose activity we'd actually endorse?

This post aims to sketch out answers to these questions. In the process, it also outlines how my current models of basic value reflection and extrapolation work.


Informal Explanation

Basic Case

Is an utopia that'd be perfect for everyone possible?

The short and obvious answer is no. Our civilization contains omnicidal maniacs and true sadists, whose central preferences are directly at odds with the preferences of most other people. Their happiness is diametrically opposed to other people's.

Less extremely, it's likely that most individuals' absolutely perfect world would fail to perfectly satisfy most others. As a safe example, we could imagine someone who loves pizza, yet really, really hates seafood, to such an extent that they're offended by the mere knowledge that seafood exists somewhere in the world. Their utopia would not have any seafood anywhere – and that would greatly disappoint seafood-lovers. If we now postulate the existence of a pizza-hating seafood-lover... Well, it would seem that their utopias are directly at odds.[1]

Nevertheless, there are worlds that would make both of them happy enough. A world in which everyone is free to eat food that's tasty according to their preferences, and is never forced to interact with the food they hate. Both people would still dislike the fact that their hated dishes exist somewhere. But as long as food-hating is not their core value that's dominating their entire personality, they'd end up happy enough.

Similarly, it intuitively feels that worlds which are strictly better according to most people's entire arrays of preferences are possible. Empowerment is one way to gesture at it – a world in which each individual is simply given more instrumental resources, a greater ability to satisfy whatever preferences they happen to have. (With some limitations on impacting other people, etc.)

But is it possible to arrive at this idea from first principles? By looking at humanity and somehow "eliciting"/"agglomerating" its preferences formally? A process like CEV? A target to hit that's "objectively correct" according to humanity's own subjective values, rather than your subjective interpretation of its values?

Paraphrasing, we're looking for an utility function such that the world-state maximizing it is ranked as very high by the standards of most humans' preferences; an utility function that's correlated with the "agglomeration" of most humans' preferences.

Let's consider what we did in the foods example. We discovered two disparate preferences, and then we abstracted up from them – from concrete ideas like "seafood" and "pizza", to an abstraction over them: food-in-general. And we've discover that, although the individuals' preferences disagreed on the concrete level, they ended up basically the same at the higher level. Trivializing, it turned out that a seafood-optimizer and a pizza-optimizer could both be viewed as tasty-food-optimizers.

The hypothesis, then, would go as follows: at some very high abstraction level, the level of global matters and fundamental philosophy, most humans' preferences converge to the same utility function over some variable. For example, "maximize eudaimonia" or "human empowerment" or "human flourishing".

There's a counting argument that slightly supports this. Higher abstraction levels are less expressive: they include fewer objects/variables (fewer countries than people, fewer stars than atoms, fewer galaxies than stars) and these objects have fewer states (fewer moods than the ways your brain's atoms could be arrranged). So the mapping-up of values to them isn't injective. Thus, some conflicting low-level preferences would map to the same preference over the same high-level variable.

That is, of course, a hypothesis. Nevertheless, the mere fact that we can coherently state it is reassuring regarding our ability to eventually test it.

 

Is Humanity an Utopia-Maximizer?

Maybe. I don't strongly believe in this, but here's a sketch:

If human values indeed converge like this, then perhaps humanity can be viewed as an approximate agent that's been approximately optimizing for building an utopia for its entire history. But those "approximately" do a lot of work; there's plenty of noise involved.

Primary issue is that the distribution of power between its constituents is non-uniform and changes dynamically. At different times, people with different preferences amass disproportionate amounts of resources (often by orders of magnitudes so), and "deviate" humanity's path away from the hypothetical averaged-out course, in their individual preferred directions.

But the balance of power frequently changes, and how technologies change it is relatively unpredictable. So potentially these effects actually cancel out on average, and humanity stays roughly on-target? (Pizza-lovers being in charge for 100 years are replaced by seafood-lovers ruling for 100 years; and while they cancel out their specific preferences, both end up advancing humanity towards having tasty food around.)

It would explain why our world, like, actually does mostly get better over time. As well as provide some grounding to the ideas of "moral progress".

Nevertheless, the approximations there may be extremely noisy, to the point that looking at things this way may not be useful.

 

Is an Utopia-Maximizer Desirable?

Assuming this hypothetical utopian utility function exists and we derive it, would it be possible to then plug it into some idealized agent/wrapper-mind, and not be horrified at the results?

On my view, the answer is obviously yes. There's a bunch of confusions around this idea that I'd like to address; mainly around what "a fixed goal" implies.

Consider a paperclip-maximizer. It wants the universe to be full of paperclips. If it gets its way, it'd reassemble all matter, including itself, into them.

Note, however, that it would not necessarily aim to freeze them in time. Intuitively, it would be fine with the paperclips still orbiting each other, impacting each other, and so on. Moreover, by the very definition of "a paperclip", there'd be all sorts of subatomic processes happening within them. The paperclip-maximizer would want those to run their natural course. Its utility would stay constant as that happens; invariant under these transformations of the world-state.

Similarly, the maximum of an utopia-maximizer would be defined over an enormous equivalence class of world-states. It would not aim to freeze humanity in time, or impose some specific unchanging social order, or tile the universe with copies of specific people that it deemed most optimal for experiencing happiness, etc.

Its utility would be invariant under individual humans changing over time, under them forging new relationships, under societal structures changing and events generally moving forward. As long as those processes don't wander into some nightmarish outcomes. That's the main function it'd provide: a sort of "safety net", lower-bounding how bad things could get. (And currently, they are very, very bad.)

Indeed, being a wrapper-mind doesn't even disqualify you from being a person (as nostalgebraist's post claims). Your utility can be invariant and maximal under many possible internal states. You can grow and change as a person, even if you have a fixed hard-wired goal that you ultimately serve.

Similarly, it's not unreasonable to suggest that most humans are (effectively isomorphic to) wrapper-minds.


Formal Model

Suppose that on your hands, you have an agent with a vast array of disparate preferences. It's a mess. They're stored in different formats (explicit vs. implicit, deontological vs. consequentialist, instrumental vs. terminal...), defined on different abstraction levels, often conflict with each other.

You want to optimize them, straighten them out. Resolve whatever conflicts they have, translate them to whatever domains you're working in, extrapolate them (to plan for the long-term), concretize them (to figure out what specific actions a philosophy demands of you), agglomerate them...

Why? Performance optimization. Sure, you could just do babble-and-prune search on your world-model, figuring out what would satisfy those preferences by brute force. But that'd be ruinously compute-intensive. You'd like to cache some of them, derive some heuristics from them, resolve conflicts to stop wasting time on those, etc.

How can you sort it out? What target are you even aiming at?

Well, the purpose of utility functions/preferences is to recommend what actions to take. Indeed, that's their main contribution: they define a preference ordering over candidate plans/actions, either directly (deontology), or by way of looking at what worlds a given action would bring about (consequentialism).

Thus, the correct process of value-system performance-optimization would be made up of transformations such that the preference ordering over actions is invariant under them. I. e., the value-optimized agent would always take the same actions in any given situation as the initial agent (if the latter were given sufficient time to think).

Let's see where that can get us.

 

Deontological Preferences

To start off, deontological preferences are isomorphic to utility functions, and utility functions are isomorphic to deontological preferences. They're related by the softmax function:

Take a given deontological rule, like "killing is bad". Let's say we view it as a constraint on the allowable actions; or, in other words, a probability distribution over your actions that "predicts" that you're very likely/unlikely to take specific actions. The above transform would let us translate it into an utility function over actions.

The other way around, an utility function can be viewed as defining some "target distribution" for the variable over which it's defined. Maximizing expected utility would then be equivalent to minimizing the cross-entropy between that target distribution and the real distribution.

And that's not simply an overly abstract trick: it's how human minds are actually hypothesized to work. See Friston's predictive-processing framework in neuroscience (you can start from these comments).

This also covers shards. They're self-executing heuristics bidding for specific actions over others. Thus, each could be transformed into an utility function without loss of information.

That's not at odds with how deontology is usually presented, either. Deontologists reject utility-maximization in the sense that they refuse to engage in utility-maximizing calculations using their conscious intelligence. But similar dynamics can still be at play "under the hood".

 

Value Conflict Resolution

Imagine an agent having two utility functions,  and . It's optimizing for their sum, . If the values are in conflict, if taking an action that maximizes  hurts  and vice versa — well, one of them almost surely spits out a higher value, so the maximization of  is still well-defined.

That's roughly how humans do work in practice. If we face a value conflict, we hesitate a bit (calculating the sum, the "winner"), but ultimately end up taking some action that we endorse.

... unless we hesitate too long, and time chooses for us. Or if we know we have to take action fast, and so decide to use some very rough approximations – and potentially make a mistake which we later regret it.

Thus, there's purely practical value in reducing the number of internal conflicts. Finding a value  such that, for all situations, it has the same preference ordering as , but its computational complexity is much lower.

 

Value Extrapolation

Value extrapolation seems to be straightforward: it's just the reflection of the fact that the world can be viewed as a series of hierarchical ever-more-abstract models.

  1. Suppose we have a low-level model of reality , with  variables (atoms, objects, whatever).
  2. Suppose we "abstract up", deriving a more simple model of the world , with  variables. Each variable  in it is an abstraction over some set of lower-level variables , such that .
    • Recap: Higher-level variables are, by definition, less expressive, i. e. the number of states they could be in is lower than the number of states the underlying system can be in. By the counting argument, that means their states are defined over (very large in practice) equivalence classes of low-level states.
    • Example: "I'm happy" is a high-level state that correspond to a combinatorially large number of configurations my body's atoms can be in. Stipulating "I'm happy" only constrains my low-level state up to that equivalence class.
  3. We iterate, to , ..., . We derive increasingly more abstract models of the world.
    • Note: . Since each subsequent level is simpler, it contains fewer variables. People to social groups to countries to the civilization; atoms to molecules to macro-scale objects to astronomical objects; etc.
  4. Let's define the function . I. e.: it returns a probability distribution over the low-level variables given the state of a high-level variable that abstracts over them.
    • Note: As per (2), that only constrains the low-level system to a (very large) equivalence class of states. (Though the distribution needn't be uniform.)
    • Example: If the world economy is in this state, how happy my grandmother is likely to be?
  5. If we view our values as an utility function , we can "translate" our utility function from any  to  roughly as follows: 
    • (There's a ton of complications there, but this expression conveys the core idea.)

... and then value extrapolation just naturally falls out of this.

Suppose we have a bunch of values at the th abstraction level. Once we start frequently reasoning at th level, we "translate" our values to it, and cache the resultant functions. Since the th level likely has fewer variables than th, the mapping-up is not injective: some values defined over different low-level variables end up translated to the same higher-level variable ("I like pizza and seafood" -> "I like tasty food", "I like Bob and Alice" -> "I like people"). This effect only strengthens as we go up higher and higher. At , we can plausibly end up with only one variable we value (as previously speculated, "eudaimonia" or something).

 

Putting It Together

Suppose we have a human on our hands, and we want to compile all of their values into a highly abstract utility function that the human would endorse. To do so, we:

  • Transform all values into the same format. (Either utility functions or probability distributions; doesn't really matter.)
  • Translate them around to reveal value conflicts.
  • Resolve those conflicts by finding equivalent-but-simpler utility functions.
  • Extrapolate them upwards, to the highest abstraction level.
  • We end up with[2] a distillation/compilation of that human's entire selfhood, in the format isomorphic to an utility function. The endpoint of their moral philosophy.

... if only it were this easy.

 

Major Problem: Meta-Preferences

Humans have preferences not only about object-level stuff, but also about the way they do the whole value-compilation process. The above model assumed an idealized process, in the sense of deriving an utility function that would always recommend the same actions as the initial array of values, but have dramatically lower computational complexity.

However, humans have meta-values that can express arbitrarily custom preferences regarding the process of value reflection itself. We might have preferences over...

  • ... basic translations. E. g., a deontologist's refusal to take money into account when choosing whose life to save. (Refusing to translate and account for that preference.)
  • ... how we extrapolate things up the abstraction levels. E. g., "I'm not going to let my petty preferences impact the future of humanity", such that you ignore your preference for pizza when defining the AGI's utility function (rather than biasing it towards it).
  • ... how we resolve value conflicts. E. g., if we have  = "I want to be a good person" and  = "I'd get a thrill out of stealing something", we often wouldn't just tweak  such that it still fires, but only when stealing something wouldn't be against the society's interests. No: we just flat-out delete .
  • Etc.

These complications currently have me worried that there's basically no way to elicit and compile a given human's preferences except directly simulating their mind. No shortcuts whatsoever. (And then that simulation would be path-dependent, such that, depending on what stimuli you show the human in what order, they might end up at vastly-different-yet-equally-legitimate endpoints. But that's a whole separate topic.)

Regardless, this doesn't kill the core idea. I'm reasonably sure (something like) the procedures I've defined are still what humans use most of the time. But there are more complex cases where meta-preferences are involved, they're often crucial, and I'm not sure there are elegant ways to handle them.

 

Egalitarian Agglomeration

Now onto the last step: how do we agglomerate values between different people? That is, suppose we've "compiled" the preferences of all individual people into a set of utility functions, and then picked just their most-abstract components, getting this set: . How do we transform that into ?

Well, ideally, it'll turn out that . That's the "strong" version of the "human value convergence hypothesis".

What if not, though?

The naive idea would be to just proceed as we had before, and find a simpler function that recommends the same actions as the individual functions' sum. But that has some undesirable properties, like a sensitivity to "utility monsters". The Geometric Rationality sequence has made that point rather well.

Thus, a better target would be a function that's equivalent to the product of individual humans' utility functions. It effectively maximizes the expected utility of a randomly-chosen human; thus, it aims to uniformly distribute utility across everyone. (I really recommend reading the Geometric Rationality sequence.)

And that result is, theoretically,

  • An utility function that humanity-as-a-whole could be said to have been (very roughly) maximizing throughout its history.
  • An utility function that something like CEV might spit out.
  • An utility function whose maximization would rank high by most individual humans' preferences/utility functions.
  • An utility function we could hook up to a wrapper-mind, and then be happy with the result.
  1. ^

    I'm sure you can come up with less tame examples from, say, politics or social issues. Fill them in as needed.

  2. ^

    Well, that was a simplified description of the process. In practice, you'd need to mix these steps up repeatedly.

New Comment
23 comments, sorted by Click to highlight new comments since:

Sure, every time you go more abstract there are fewer degrees of freedom. But there's no free lunch - there are degrees of freedom in how the more-abstract variables are connected to less-abstract ones.

People who want different things might make different abstractions. E.g. if you're calling some high level abstraction "eat good food," it's not that this is mathematically the same abstraction made by someone who thinks good food is pizza and someone else who thinks good food is fish. Not even if those people independently keep going higher in the abstraction hierarchy - they'll never converge to the same object, because there's always that inequivalence in how they're translated back to the low level description.

Yes, at high levels of abstraction, humans can all recommend the same abstract action. But I don't care about abstract actions, I care about real-world actions.

E.g. suppose we abstract the world to an ontology where there are two states, "good" and "bad," and two actions - stay or swap. Lo and behold, ~everyone who abstracts the world to this ontology will converge to the same policy in terms of abstract actions: make the world good rather than bad. But if two people disagree utterly about which low-level states get mapped onto the "good" state, they'll disagree utterly about which low-level actions get mapped onto the "swap from bad to good" action, and this abstraction hasn't really bought us anything.

People who want different things might make different abstractions

That's a direct rejection of the natural abstractions hypothesis. And some form of it increasingly seems just common-sensically true.

It's indeed the case that one's choice of what system to model is dependent on what they care about/where their values are housed (whether I care to model the publishing industry, say). But once the choice to model a given system is made, the abstractions are in the territory. They fall out of noticing to which simpler systems a given system can be reduced.

(Imagine you have a low-level description of a system defined in terms of individual gravitationally- and electromagnetically-interacting particles. Unbeknownst to you, the system describes two astronomical objects orbiting each other. Given some abstracting-up algorithm, we can notice that this system reduces to these two bodies orbiting each other (under some definition of approximation).

It's not value-laden at all: it's simply a true mathematical fact about the system's dynamics.

The NAH is that this generalizes, very widely.)

Not even if those people independently keep going higher in the abstraction hierarchy - they'll never converge to the same object, because there's always that inequivalence in how they're translated back to the low level description.

I mean, that's clearly not how it works in practice? Take the example in the post literally: two people disagree on food preferences, but can agree on the "food" abstraction and on both of them having a preference for subjectively tasty ones.

suppose we abstract the world to an ontology where there are two states, "good" and "bad," 

If your model is assumed, i. e. that abstractions are inherently value-laden, then yes, this is possible. But that's not how it'd work under the NAH and on my model, because "good" and "bad" are not objective high-level states a given system could be in.

It'd be something like State A and State B. And then the "human values converge" hypothesis is that all human values would converge to preferring one of these states.

Not even if those people independently keep going higher in the abstraction hierarchy - they'll never converge to the same object, because there's always that inequivalence in how they're translated back to the low level description.

I mean, that's clearly not how it works in practice? Take the example in the post literally: two people disagree on food preferences, but can agree on the "food" abstraction and on both of them having a preference for subjectively tasty ones.

I agree with the part of what you just said that's the NAH, but disagree with your interpretation.

Both people can recognize that there's a good abstraction here, where what they care about is subjectively tasty food. But this interpersonal abstraction is no longer an abstraction of their values, it simply happens to be about their values, sometimes. It can no longer be cashed out into specific recommendations of real-world actions in the way someone's values can[1].

  1. ^

    For certain meanings of "values," ofc.

Okay, let's build a toy model.

  1. We have some system with a low-level state , which can take on one of six values: .
  2. We can abstract over this system's state and get a high-level state , which can take on one of two states: 
  3. We have an objective abstracting-up function .
  4. We have the following mappings between states:
  5. We have an utility function , with a preference ordering of , and an utility function , with a preference ordering of .
  6. We translate both utility functions to , and get the same utility function:  whose preference ordering is .

Thus, both  and  can agree on which high-level state they would greatly prefer. No low-level state would maximally satisfy both of them, but they both would be happy enough with any low-level state that gets mapped to the high-level state of . ( is the obvious compromise.)

Which part of this do you disagree with?

I disagree that translating to x and y let you "reduce the degrees of freedom" or otherwise get any sort of discount lunch. At the end you still had to talk about the low level states again to say they should compromise on b (or not compromise and fight it out over c vs. a, that's always an option).

At the end you still had to talk about the low level states again to say they should compromise on b 

"Compromising on " is a more detailed implementation that can easily be omitted. The load-bearing part is "both would be happy enough with any low-level state that gets mapped to the high-level state of ".

For example, the policy of randomly sampling any  such that  is something both utility functions can agree on, and doesn't require doing any additional comparisons of low-level preferences, once the high-level state has been agreed upon. Rising tide lifts all boats, etc.

Suppose the two agents are me and a flatworm.
a = ideal world according to me
b = status quo
c = ideal world according to the flatworm
d, e, f = various deliberately-bad-to-both worlds

I'm not going to stop trying to improve the world just because the flatworm prefers the status quo, and I wouldn't be "happy enough" if we ended up in flatworm utopia.

What bargains I would agree to, and how I would feel about them, are not safe to abstract away.

I wouldn't be "happy enough" if we ended up in flatworm utopia

You would, presumably, be quite happy compared to "various deliberately-bad-to-both worlds".

I'm not going to stop trying to improve the world just because the flatworm prefers the status quo

Because you don't care about the flatworm and the flatworm is not perceived by you as having much bargaining power for you to bend to its preferences.

In addition, your model rules out more fine-grained ideas like "the cubic mile of terrain around the flatworm remains unchanged while I get the rest of the universe". Which is plausibly what CEV would result in: everyone gets their own safe garden, with the only concession the knowledge that everyone else's safe gardens also exist.

It's interesting that part of human value might be having our actions matter. But if you build an AI that can give you all the things, or even if you could've built such an AI but chose not to, then objectively your actions no longer matter much after that. I've no idea how even CEV could approach this problem.

Edit: I think I've figured it out. The AI shouldn't try to build the best world according to CEV, it should take the best action for an AI to take according to CEV. So if the AI notices that humans strongly prefer to be left alone with their problems, it'll just shut down. Or find some other way to ensure that humans can't rely on AIs for everything.

I think the "deontological preferences are isomorphic to utility functions" is wrong as presented.

Firts, the formula has issues with dividing by zero and not summing probabilities to one (and re-using variable as a local variable in the sum). So you probably meant something like Even then, I dont think this describes any isomorphism of deontological preferences to utility functions.

  • Utility functions are invariant when multiplied with a positive constant. This is not reflected in the formula.

  • utility maximizers usually take the action with the best utility with probability , rather than using different probabilities for different utilities.

  • modelling deontological constraints as probability distributions doesnt seem right to me. Let's say I decide between drinking green tea and black tea, and neither of those violate any deontological constraints, then assigning some values (which ones?) to P("I drink green tea") or P("I drink black tea") doesnt describe these deontological constraints well.

  • any behavior can be encoded as utility functions, so finding any isomorphisms to utility functions is usually possible, but not always meaningful.

the formula has issues with dividing by zero and not summing probabilities to one 

Well, that was one embarrassing typo. Fixed, and thanks for pointing it out.

Utility functions are invariant when multiplied with a positive constant. This is not reflected in the formula.

It is. Utility functions are invariant under ordering-preserving transformations. Exponentiation is order-preserving (rises monotonically), and so is multiplying by the constant of .

Let's say I decide between drinking green tea and black tea, and neither of those violate any deontological constraints, then assigning some values (which ones?) to P("I drink green tea") or P("I drink black tea") doesnt describe these deontological constraints well.

Interpreted as a probability distribution, it assigns the same probability to both actions. In practice, you can imagine some sort of infrabayesianism-style imprecise probabilities being involved: the "preference" being indifferent between the vast majority of actions (and so providing no advice one way or another) and only expressing specific for vs. against preferences in a limited set of situations.

Utility functions are invariant under ordering-preserving transformations.

Utility functions in the sense of VNM, Savage, de Finetti, Jeffrey-Bolker, etc. are not invariant under all ordering-preserving transformations, only affine ones. Exponentiation is not affine.

What sort of utility function do you have in mind?

Oops, you're right. I clearly took too many mental shortcuts when formulating that response.

What sort of utility function do you have in mind?

The reason this still works is because in the actual formulation I had in mind, we then plug the utility-function-transformed-into-a-probability-distribution into a logarithm function, canceling out the exponentiation. Indeed, that was the actual core statement in my post: that maximizing expected utility is equivalent to minimizing the cross-entropy between some target distribution and the real distribution.

But evidently I decided to skip some steps and claim that the utility function is directly equivalent to the target distribution. That was, indeed, unambiguously incorrect.

I think this post is mostly about how to do the reflection, consistentising, and so on.

But at the risk of oversimplifying, let's pretend for a moment we just have some utility functions.

Then you can for sure aggregate them into a mega utility function (at least in principle). This is very underspecified!! predominantly as a consequence of the question of how to weight individual utility functions in the aggregation. (Holden has a nice discussion of Harsanyi's aggregation theorem which goes into some more discussion, but yes, we have not found it written in the universe how to weight the aggregation.)

There's also an interesting relationship (almost 1-1 aside from edge-cases) between welfare optima (that is, optima of some choice of weighted aggregation of utilities as above) and Pareto optima[1] (that is, outcomes unimprovable for anyone without worsening for someone). I think this, together with Harsanyi, tells us that some sort of Pareto-ish target would be the result of 'the most coherent' possible extrapolation of humanity's goals. But this still leaves wide open the coefficients/weighting of the aggregation, which in the Pareto formulation corresponds to the position on the Pareto frontier. BTW Drexler has an interesting discussion of cooperation and conflict on the Pareto frontier.

I have a paper+blogpost hopefully coming out soon which goes into some of this detail and discusses where that missing piece (the welfare weightings or 'calibration') come from (descriptively, mainly; we're not very prescriptive unfortunately).


  1. This connection goes back as far as I know to the now eponymous ABB theorem of Arrow, Barankin and Blackwell in 1953, and there's a small lineage of followup research exploring the connection ↩︎

yes, it is not written in the universe how to weight the aggregation

I think it's written, but not in behavior.

Imagine two people whose behavior is encoded by the same utility function - they both behave as if they valued chocolate as 1 and vanilla as 2. But internally, the first person feels very strongly about all of their preferences, while the second one is very even-tempered and mostly feels ok no matter what. (They'd climb the same height of stairs to get vanilla, too, because the second person is more indifferent about vanilla but also is less bothered by climbing stairs.) Then we want to give them different weight in the aggregation, even though they have the same utility function. That means the correct weighting should be inferred from internal feelings, not only from behavior.

Another, more drastic thought experiment: imagine a box that has no behavior at all, but in fact there's a person inside. You have to decide whether to send resources into the box. For that you need to know what's in the box and what feelings it contains.

I swiftly edited that to read

we have not found it written in the universe

but your reply obviously beat me to it! I agree, there is plausibly some 'actual valence magnitude' which we 'should' normatively account for in aggregations.

In behavioural practice, it comes down to what cooperative/normative infrastructure is giving rise to the cooperative gains which push toward the Pareto frontier. e.g.

  • explicit instructions/norms (fair or otherwise)
  • 'exchange rates' between goods or directly on utilities
  • marginal production returns on given resources
  • starting state/allocation in dynamic economy-like scenarios (with trades)
  • differential bargaining power/leverage

In discussion I have sometimes used the 'ice cream/stabbing game' as an example

  • either you get ice cream and I get stabbed
  • or neither of those things
  • neither of us is concerned with the other's preferences

It's basically a really extreme version of your chocolate and vanilla case. But they're preference-isomorphic!

Is an utopia that'd be perfect for everyone possible?

The short and obvious answer is no. Our civilization contains omnicidal maniacs and true sadists, whose central preferences are directly at odds with the preferences of most other people. Their happiness is diametrically opposed to other people's.

If you are constrained to keep omnicidal maniacs in the same world as everyone else, this is obviously true. 

But it doesn't seem to be obviously true in every possible future. Imagine a world where every single sentient mind is wireheaded into their own personal paradise, interacting only with figments designed to maximize their preferences.

Being embedded in a fake reality and fooled into believing it's true would be against many people's preferences.

(I suppose you can imagine some galaxy-brained ideas for emigration into different regions of Tegmark IV, Permutation City-style?)

Being embedded in a fake reality and fooled into believing it's true would be against many people's preferences.

Strongly agree; I have an old, short post about this. See also Contact with reality.

Some people might (under reflection) be locally-caring entities, but most people's preferences are about what the reality actually contains and they (even under reflection) wouldn't want to, for example, press a button that cause them to mistakenly believe that everything is fine.

Being embedded in a fake reality and fooled into believing it's true would be against many people's preferences.

Only if they can see through the illusion in the first place. Suppose that the happiness-maximizing AGI could throw you into a perfect fake reality without you even noticing. One day you wake up, and from that point on everything goes perfectly well for you in some very plausible manner. After a while, your happiness has reached immensely high levels. "This poor schmuck doesn't want to be wireheaded" is not a valid enough justification for not wireheading them in secret (for comparison, consider an angry teenager who break up with their partner and firmly says they want to immediately die.. you are probably going to ignore the stated preference and preventing them from suicide, knowing that this will result in much higher future happiness).

Only if they can see through the illusion in the first place

Well, yes, but the question is what humanity wants. Humanity does not want that to be done to it. It's not what CEV would output, and if you're in the position to implement a wrapper-mind with some utility function of your choice, putting that in it would be a violation of human preferences on a cosmological scale. That's very much not leaving your fingerprints off the future.

consider an angry teenager who break up with their partner and firmly says they want to immediately die

That's probably not what they'd actually want on reflection (i. e., if you ran the value conflict resolution and value extrapolation steps), just what they want in the moment.

"Humanity" is not a single agent. It can have preferences only insofar as the individual humans share similar preferences. If you are a happiness maximizer, for every individual human you look at, you'll probably find that their happiness would be maximized by wireheading (because most whimsical desires like becoming the king of the world are not feasible to satisfy otherwise).

I'm not even that sure that CEV would avoid this problem. In which way being enclosed in a perfect world is not the best thing for you? Because it would be fake? But how do you know that's fake? Imagine that an AGI offers to teleport you on another planet, which is perfectly suitable for you; you'll land there and thrive forever. Now imagine that instead of actually teleporting you to another planet, it just let you to believe the same; you'll (mentally) land there and thrive forever anyway. I mean, your brain is experiencing the very same thing! It's not obvious that the second option is worse than the first one, unless you have an hardcoded limit like "Thou Shalt Not Wirehead People".

we often wouldn't just tweak  such that it still fires, but only when stealing something wouldn't be against the society's interests. No: we just flat-out delete .

Heh maybe. I also enjoy stealing things in RPGs :P