The human utility hypothesis is much more vague than the others, and seems ultimately context-dependent. To my knowledge, the main argument in its favor is the fact that most of economics is founded on it.

I would say, rather, that the arguments in its favor are the same ones which convinced economists.

Humans aren't well-modeled as perfect utility maximizers, but utility theory is a theory of what we can reflectively/coherently value. Economists might have been wrong to focus only on rational preferences, and have moved toward prospect theory and the like to remedy this. But it may make sense to think of alignment in these terms nonetheless.

I am not saying that it does make sense -- I'm just saying that there's a much better argument for it than "the economists did it", and I really don't think prospect theory addresses issues which are of great interest to alignment.

  • If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent. The argument for this position is the combination of the various arguments for expected utility theory: VNM; money-pump arguments; the various dutch-book arguments; Savage's theorem; the Jeffrey-Bolker theorem; the complete class theorem. One can take these various arguments and judge them on their own terms (perhaps finding them lacking).
  • Arguably, you can't fully align with inconsistent preferences; if so, one might argue that there is no great loss in making a utility-theoretic approximation of human preferences: it would be impossible to perfectly satisfy inconsistent preferences anyway, so representing them by a utility function is a reasonable compromise.
  • In aligning with inconsistent preferences, the question seems to be what standards to hold a system to in attempting to do so. One might argue that the standards of utility theory are among the important ones; and thus, that the system should attempt to be consistent even if humans are inconsistent.
  • To the extent that human preferences are inconsistent, it may make more sense to treat humans as fragmented multi-agents, and combine the preferences of the sub-agents to get an overall utility function -- essentially aligning with one inconsistent human the same way one would align with many humans. This approach might be justified by Harsanyi's theorem.

On the other hand, there are no strong arguments for representing human utility via prospect theory. It holds up better in experiments than utility theory does, but not so well that we would want to make it a bedrock assumption of alignment. The various arguments for expected utility make me somewhat happy for my preferences to be represented utility-theoretically even though they are not really like this; but, there is no similar argument in favor of a prospect-theoretic representation of my preferences. Essentially, I think one should either stick to a more-or-less utility-theoretic framework, or resort to taking a much more empirical approach where human preferences are learned in all their inconsistent detail (without a background assumption such as prospect theory).

That's still a false dichotomy, but I think it is an appropriate response to many critiques of utility theory.

Showing 3 of 8 replies (Click to show all)
6Davidmanheim5moMy current best-understanding is that if we assume people have arbitrary inconsistencies, it will be impossible to do better than satisfice on different human values by creating near-pareto improvements for intra-human values. But inconsistent values don't even allow pareto-improvements! Any change makes things incomparable. Given that, I think we do need a super-prospect theory that explains in a systematic way what humans do "wrong" so that we can pick what an AI should respect of human preferences, and what can be ignored. For instance, I love my children, and I like chocolate. I'm also inconsistent with my preferences in ways that differs; at a given moment of time, I'm much more likely to be upset with my kids and not want them around than I am to not want chocolate. I want the AI to respect my greater but inconsistent preference for my children over the more consistent preference for candy. I don't know how to formalize this in a way that would generalize, which seems like a problem. Similar problems exist for time preference and similar typical inconsistencies - they are either inconsistent, or at least can be exploited by an AI that has a model which doesn't think about resolving those inconsistencies. With a super-prospect theory, I would hope we may be able to define a CEV or similar, which allows large improvements by ignoring the fact that those improvements are bad for some tiny part of my preferences. And perhaps the AI should find the needed super-prospect theory and CEV - but I am deeply unsure about the safety of doing this, or the plausibility of trying to solve it first. (Beyond this, I think we need to expect that between-human values will differ, and we can keep things safe by insisting on a near-pareto improvement, only things that are a pareto improvement with respect to a very large portion of people, and relatively minor dis-improvements for the remainder. But that's a different discussion.)
2abramdemski5moYeah, I think something like this is pretty important. Another reason is that humans inherently don't like to be told, top-down, that X is the optimal solution. A utilitarian AI might redistribute property forcefully, where a pareto-improving AI would seek to compensate people. An even more stringent requirement which seems potentially sensible: only pareto-improvements which both parties both understand and endorse. (IE, there should be something like consent.) This seems very sensible with small numbers of people, but unfortunately, seems infeasible for large numbers of people (given the way all actions have side-effects for many many people).

See my other reply about pseudo-pareto improvements - but I think the "understood + endorsed" idea is really important, and worth further thought.

What are we assuming about utility functions?

by Grue_Slinky 2 min read2nd Oct 201924 comments

18

Ω 8


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I often notice that in many (not all) discussions about utility functions, one side is "for" their relevance, while others tend to be "against" their usefulness, without explicitly saying what they mean. I don't think this is causing any deep confusions among researchers here, but I'd still like to take a stab at disambiguating some of this, if nothing else for my own sake. Here are some distinct (albeit related) ways that utility functions can come up in AI safety, in terms of what assumptions/hypotheses they give rise to:

AGI utility hypothesis: The first AGI will behave as if it is maximizing some utility function

ASI utility hypothesis: As AI capabilities improve well beyond human-level, it will behave more and more as if it is maximizing some utility function (or will have already reached that ideal earlier and stayed there)

Human utility hypothesis: Even though in some experimental contexts humans seem to not even be particularly goal-directed, utility functions are often a useful model of human preferences to use in AI safety research

Coherent Extrapolated Volition (CEV) hypothesis: For a given human H, there exists some utility function V such that if H is given the appropriate time/resources for reflection, H's values would converge to V

Some points to be made:

  • The "Goals vs Utility Functions" chapter of Rohin's Value Learning sequence, and the resulting discussion focused on differing intuitions about the AGI and ASI utility hypotheses. Specifically, the main post there pointed out that seemingly anything can be trivially modeled as being a "utility maximizer" (further discussion here), whereas only some intelligent agents can be described as being "goal-directed" (as defined in this post), and the latter is a more useful concept for reasoning about AI safety.
  • AGI utility doesn't logically imply ASI utility, but I'd be surprised if anyone thinks it's very plausible for the former to be true while the latter fails. In particular, the coherence arguments and other pressures that move agents toward VNM seem to roughly scale with capabilities. A plausible stance could be that we should expect most ASIs to hew close to the VNM ideal, but these pressures aren't quite so overwhelming at the AGI level; in particular, humans are fairly goal-directed but only "partially" VNM, so the goal-directedness pressures on an AGI will likely be at this order of magnitude. Depending on takeoff speeds, we might get many years to try aligning AGIs at this level of goal-directedness, which seems less dangerous than playing sorcerer's apprentice with VNM-based AGIs at the same level of capability.(Note: I might be reifying VNM here too much, in thinking of things having a measure of "goal-directedness" with "very goal-directed" approximating VNM. But this basic picture could be wrong in all sorts of ways.)
  • The human utility hypothesis is much more vague than the others, and seems ultimately context-dependent. To my knowledge, the main argument in its favor is the fact that most of economics is founded on it. On the other hand, behavioral economists have formulated models like prospect theory for when greater precision is required than the simplistic VNM model gives, not to mention the cases where it breaks down more drastically. I haven't seen prospect theory used in AI safety research; I'm not sure if this reflects more a) the size of the field and the fact that few researchers have had much need to explicitly model human preferences, or b) that we don't need to model humans more than superficially. since this kind of research is still at a very early theoretical stage with all sorts of real-world error terms abounding.
  • The CEV hypothesis can be strengthened, consistent with Yudkowsky's original vision, to say that every human will converge to about the same values. But the extra "values converge" assumption seems orthogonal to one's opinions about the relevance of utility functions, so I'm not including it in the above list.
  • In practice a given researcher's opinions on these tend to be correlated, so it makes sense to talk of "pro-utility" and "anti-utility" viewpoints. But I'd guess the correlation is far from perfect, and at any rate, the arguments connecting these hypotheses seem somewhat tenuous.

18

Ω 8