It seems like a good portion of the whole "maximizing utility" strategy which might be used by a sovereign relies on actually being able to consolidate human preferences into utilities. I think there are a few stages here, each of which may present obstacles. I'm not sure what the current state of the art is with regard to overcoming these, and am curious regarding such.

First, here are a few assumptions that I'm using just to make the problem a bit more navigable (dealing with one or two hard problems instead of a bunch at once) - will need to go back and do away with each of these (and each combination thereof) and see what additional problems result.

  1. The sovereign has infinite computing power (and to shorten the list of assumptions, can do 2-6 below)
  2. We're maximizing across the preferences of a single human (Alice for convenience). To the extent that Alice cares about others, we're accounting for their preferences, too. But we're not dealing with aggregating preferences across different sentient beings, yet. I think this is a separate hard problem.
  3. Alice has infinite computing power.
  4. We're assuming that Alice's preferences do not change and cannot change, ever, no matter what happens. So as Alice experiences different things in her life, she has the exact same preferences. No matter what she learns or concludes about the world, she has the exact same preferences. To be explicit, this includes preferences regarding the relative weightings of present and future worldstates. (And in CEV terms, no spread, no distance.)
  5. We're assuming that Alice (and the sovereign) can deductively conclude the future from the present, given a particular course of action by the sovereign. Picture a single history of the universe from the beginning of the universe to now, and a bunch of worldlines running into the future depending on what action the sovereign takes. To clarify, if you ask Alice about any single little detail across any of the future worldlines, she can tell you that detail.
  6. Alice can read minds and the preferences of other humans and sentient beings (implied by 5, but trying to be explicit.)

So Alice can conclude anything and everything, pretty much (and so can our sovereign.) The sovereign is faced with the problem of figuring out what action to take to maximize across Alice's preferences. However, Alice is basically a sack of meat that has certain emotions in response to certain experiences or certain conclusions about the world, and it doesn't seem obvious how to get the preference ordering of the different worldlines out of these emotions. Some difficulties:

  1. The sovereign notices that Alice experiences different feelings in response to different stimuli. How does the sovereign determine which types of feelings to maximize, and which to minimize? There are a bunch of ways to deal with this, but most of them seem to have a chance of error (and the conjunction of p(error) across all the times that the sovereign will need to do this approach 1). For example, could train off an existing data set, could have it simulate other humans with access to Alice's feelings and cognition and have a simulated committee discuss and reach a decision on each one, etc etc. But all of these bootstrap off of the assumed ability of humans to determine which feelings to maximize (just with amped up computing power) - this doesn't strike me as a satisfactory solution.
  2. Assume 1. is solved. The sovereign knows which feelings to maximize. However, it's ended up with a bunch of axes. How does it determine the appropriate trade-offs to make? (Or, to put it another way, how does it determine the relative value of different positions along each axis with different positions along different axes?)

So, to rehash my actual request: what's the state of the art with regards to these difficulties, and how confident are we that we've reached a satisfactory answer?

New Comment
12 comments, sorted by Click to highlight new comments since:

Humans don't. "Utility" is part of the map, not part of the territory. We make choices, but utility theory is only a modeling language used to describe choice-making processes.

One are of research you may want to investigate is "Revealed Preference", a concept developed primarily by Paul Samuelson.

"Revealed Preference" has issues, because of things like circular preferences - although it's a mistake to conclude that circular preferences are proof that humans are irrational. Rather, it demonstrates that utility theory in general is just a model, and an incomplete one.

The fundamental issue is that utility, as a model, attempts to compress a topography of many dimensions - human preferences - into a topography of exactly one - a utility value for each potential choice. Impossible Objects - "contradictions", such as circular preferences - are to be expected in the abbreviated topography.


Why is self-reference expected when reducing the dimensions? Is it because these dimensions might influence each other in a circular way?

Circles are valid two-dimensional objects. What mapping do you use to represent a circle in one dimension?


Well, there is one case in which naive utility theory makes perfect sense: when the utility function is just measuring the value of some real-number random variable inside the epistemic model (ie: when reading a number off your map tells you the utility of the territory). Since utility theory was invented to deal with economics, in which such a random variable exists and is called "money", nobody ever bothered to ask what happened when you didn't have such a convenient real-valued, assumed-monotonic random variable.

True. Although I think most utility theorists would be somewhat horrified if you suggested that money was the only thing worth measuring, when measuring utility.


Well of course, because they conceived of utility theory as giving value to money. They also invented a utility theory that only really applies to measuring money. It was a kind of doublethink in which, if real human preferences don't fit a model constructed to deal with money, then economists conclude that humans are Irrational (in a capital-letter ideological sense) rather than trying to come up with a model of evaluative reasoning that actually explains the data gained from real people.


And utility starts to become absurd when discussing multi-agent systems. For instance, if every person assigns utility prefences to every other person, at difference levels of confidence depending on their mentalising ability... Voting theory constructs some solutions to this problem, but if I've ever come across someone who's familiar enough with voting theory for us to interact in optimal ways, I've never know. I would also recommend looking up the social and cognitive determinants of altruistic behaviour and kindess to supplement the dry economic blub you'll get by looking att he research on revealed preference.

[This comment is no longer endorsed by its author]Reply

This general problem has been studied by Stuart Russell, Andrew Ng, and others. It's called "Inverse Reinforcement Learning", and the general idea is to learn the utility function of an agent A given training data which includes A's actual decisions, and then use that to infer an approximation of A's utility function for use in a RL agent B, where B can satisfy A's goals, perhaps better than A itself (by thinking faster and or predicting the future better).

You need to start with some sensible prior's over A's utility function for the problem to be well formed, but after that it becomes a machine learning problem.

What does this method produce if there is no utility function that accurately models the agent's decisions?

I'm not sure, but I'd guess it wouldn't produce much. For example, if the agent is just making random decisions, well you won't be able to learn from that.

The IRL research so far has used training data provided by humans, and can infer human goal shaped utility functions for at least the fairly simple problem domains tested so far. Most of this research was done almost a decade ago and hasn't been as active recently. In particular if you scaled it up with modern tech, I bet that IRL techniques could learn the score function of Atari from watching human play - for example.

One relevant idea is that there is a duality between assigning utilities to actions (or equivalently, being able to pick your favorite option out of a probabilistic mix of actions), and assigning utilities to outcomes. Acting consistently in one way implies that you are also acting consistently in the other.

Since humans are much better at picking actions than we are at evaluating entire world-states, this is pretty handy (though it comes nowhere near solving the entire problem). Paul Christiano has a writeup of what a naive-ish application of this idea would look like here.

So if I understand this correctly, Alice and the Sovereign are identically omniscient, and the Sovereign additionally has some power and influence upon the world that Alice does not. In the case where Alice herself is the sovereign the problem is solved, right? The sovereign just has to figure out what she prefers and do that. The solution then, is to simulate the scenario where Alice has the power to make the decision herself and then match Alice's decision. This solves both 1 and 2.

My short answer to the broader "How do we know what sacks of meat / circuits / whatever prefer" question is "you look at the behavioral output". Here, if Alice can make the decision herself, the decision represents her behavioral output.

(I'm about halfway through writing about how to make this idea more workable without resorting to omniscient things with consistent preferences, if I still like the idea after writing it out I'll cross post it on lw.)