[Epistemic status: half-baked, elucidating an intuition. Possibly what I’m saying here is just wrong, and someone will helpfully explain why.]

Thesis: I now think that utility functions might be a bad abstraction for thinking about the behavior of agents in general, including highly capable agents.

Over the past years, in thinking about agency and AI, I’ve taken the concept of a “utility function” for granted as the natural way to express an entity's goals or preferences. 

Of course, we know that humans don’t have well defined utility functions (they’re inconsistent, and subject to all kinds of framing effects), but that’s only because humans are irrational. According to my prior view, to the extent that a thing acts like an agent, it’s behavior corresponds to some utility function. That utility function might or might not be explicitly represented, but if an agent is rational, there’s some utility function that reflects it’s preferences. 

Given this, I might be inclined to scoff at people who scoff at “blindly maximizing” AGIs. “They just don’t get it”, I might think. “They don’t understand why agency has to conform to some utility function, and an AI would try to maximize expected utility.”

Currently, I’m not so sure. I think that using "my utility function" as a stand in for "my preferences" is biting a philosophical bullet, importing some unacknowledged assumptions. Rather than being the natural way to conceive of preferences and agency, I think utility functions might be only one possible abstraction, and one that emphasizes the wrong features, giving a distorted impression of what agents are actually like.

I want to explore that possibility in this post.

Before I begin, I want to make two notes. 

First, all of this is going to be hand-wavy intuition. I don’t have crisp knock-down arguments, only a vague discontent. But it seems like more progress will follow if I write up my current, tentative, stance even without formal arguments.

Second, I don’t think utility functions being a poor abstraction for agency in the real world has much bearing on whether there is AI risk. It might change the shape and tenor of the problem, but highly capable agents with alien seed preferences are still likely to be catastrophic to human civilization and human values. I mention this because the sentiments expressed in this essay are casually downstream of conversations that I’ve had with skeptics about whether there is AI risk at all. So I want to highlight: I think I was previously mistakenly overlooking some philosophical assumptions, but that is not a crux.

[Thanks to David Deutsch (and other Critical Rationalists on twitter), Katja Grace, and Alex Zhu, for conversations that led me to this posit.]

Is coherence overrated? 

The tagline of the “utility” page on arbital is “The only coherent way of wanting things is to assign consistent relative scores to outcomes.” 

This is true as far as it goes, but to me, at least, that sentence implies a sort of dominance of utility functions. “Coherent” is a technical term, with a precise meaning, but it also has connotations of “the correct way to do things”. If someone’s theory is incoherent, that seems like a mark against it. 

But it is possible to ask, “What’s so good about coherence anyway?"

The standard reply, of course, is that if your preferences are incoherent, you’re dutchbookable, and someone will come along to pump you for money. 

But I’m not satisfied with this argument. It isn’t obvious that being dutch booked is a bad thing.

In, Coherent Decisions Imply Consistent Utilities, Eliezer says, 

Suppose I tell you that I prefer pineapple to mushrooms on my pizza. Suppose you're about to give me a slice of mushroom pizza; but by paying one penny ($0.01) I can instead get a slice of pineapple pizza (which is just as fresh from the oven). It seems realistic to say that most people with a pineapple pizza preference would probably pay the penny, if they happened to have a penny in their pocket. 1

After I pay the penny, though, and just before I'm about to get the pineapple pizza, you offer me a slice of onion pizza instead--no charge for the change! If I was telling the truth about preferring onion pizza to pineapple, I should certainly accept the substitution if it's free.

And then to round out the day, you offer me a mushroom pizza instead of the onion pizza, and again, since I prefer mushrooms to onions, I accept the swap.

I end up with exactly the same slice of mushroom pizza I started with... and one penny poorer, because I previously paid $0.01 to swap mushrooms for pineapple.

This seems like a qualitatively bad behavior on my part.

Eliezer asserts that this is “qualitatively bad behavior.” I think that this is biting a philosophical bullet. I think it isn't obvious that that kind of behavior is qualitatively bad.

As an intuition pump: In the actual case of humans, we seem to get utility not from states of the world, but from changes in states of the world. (This is one of the key claims of prospect theory). Because of this, it isn’t unusual for a human to pay to cycle between states of the world. 

For instance, I could imagine a human being hungry, eating a really good meal, feeling full, and then happily paying a fee to be instantly returned to their hungry state, so that they can enjoy eating a good meal again. 

This is technically a dutch booking ("which do he prefer, being hungry or being full?"), but from the perspective of the agent’s values there’s nothing qualitatively bad about it. Instead of the dutchbooker pumping money from the agent, he’s offering a useful and appreciated service.

Of course, we can still back out a utility function from this dynamic: instead of having a mapping of ordinal numbers to world states, we can have one from ordinal numbers to changes from world state to another. 

But that just passes the buck one level. I see no reason in principle that an agent might have a preference to rotate between different changes in the world, just as well as rotating different between states of the world.

But this also misses the central point. You can always construct a utility function that represents some behavior, however strange and gerrymandered. But if one is no longer compelled by dutch book arguments, this begs the question of why we would want to do that. If coherence is no longer a desiderata, it’s no longer clear that a utility function is that natural way to express preferences.

And I wonder, maybe this also applies to agents in general, or at least the kind of learned agents that humanity is likely to build via gradient descent. 

Maximization behavior

I think this matters, because many of the classic AI risk arguments go through a claim that maximization behavior is convergent. If you try to build a satisficer, there are a number of pressures for it to become a maximizer of some kind. (See this Rob Miles video, for instance.)

I think that most arguments of that sort depend on an agent acting according to an expected utility maximization framework. And utility maximization turns out not to be a good abstraction for agents in the real world, I don't know if these arguments are still correct.

I posit that straightforward maximizers are rare in the distribution of advanced AI that humanity creates across the multiverse. And I suspect that most evolved or learned agents are better described by some other abstraction.  

If not utility functions, then what?

If we accept for the time being that utility functions are a warped abstraction for most agents, what might a better abstraction be?

I don’t know. I’m writing this post in the hopes that others will think about this question and perhaps come up with productive alternative formulations. I've put some of my own half baked thoughts in a comment.

New to LessWrong?

New Comment
29 comments, sorted by Click to highlight new comments since: Today at 2:27 PM

If we accept for the time being that utility functions are a warped abstraction for most agents, what might a better abstraction be?

For AI risk in particular:

The Value Learning sequence answer is:

  1. Replace "utility function" with "goal", and "expected utility maximizer" with "goal-directed system"
  2. Figure out an argument for why the AI systems we build will be goal-directed
  3. Make peace with the fact that you don't get to have formal answers that apply with the certainty of theorems, and you have to rely on intuitions

The Human Compatible answer is:

  1. We use expected utility maximization because that's what the standard model says: all of the AI systems we build today optimize a definite specification that is effectively assumed to be handed down from God, we are simply mimicking that in our expected utility maximization model.
  2. What do you mean, "learned models"? Deep learning is going to hit a brick wall; we're going to build AI systems out of legible algorithms that really are optimizing their specifications, like planning algorithms.

My current answer is basically the Value Learning sequence answer with a more fleshed out version of point 2 (though I would probably prefer to state it differently now).

----

Btw, the original point of utility functions is to compactly describe a given set of preferences. It was originally primarily descriptive; the problem occurs when you treat it as prescriptive. But for the descriptive purpose, utility functions are still great; the value add is that the sentence "my utility function is X" is expected to be much shorter than the sentence "my preferences are X".

I think the key issue here is what you take as an "outcome" over which utility functions are defined. If you take states to be outcomes, then trying to model sequential decisions is inherently a mess. If you take trajectories to be outcomes, then this problem goes away - but then for any behaviour you can very easily construct totally arbitrary utility functions which that behaviour maximises. At this point I really don't know what people who talk about coherence arguments on LW are actually defending. But broadly speaking, I expect that everything would be much clearer if phrased in terms of reward rather than utility functions, because reward functions are inherently defined over sequential decisions.

I don’t think utility functions being a poor abstraction for agency in the real world has much bearing on whether there is AI risk. It might change the shape and tenor of the problem, but highly capable agents with alien seed preferences are still likely to be catastrophic to human civilization and human values.

If argument X plays an important role in convincing you of conclusion Y, and also the proponents of Y claim that X is important to their views, then it's surprising to hear that X has little bearing on Y. Was X redundant all along? Also, you currently state this in binary terms (whether there is AI risk); maybe it'd be clearer to state how you expect your credences to change (or not) based on updates about utility functions.

I think the key issue here is what you take as an "outcome" over which utility functions are defined. If you take states to be outcomes, then trying to model sequential decisions is inherently a mess. If you take trajectories to be outcomes, then this problem goes away

Right, it seems pretty important that utility not be defined over states like that. Besides, relativity tells us that a simple "state" abstraction isn't quite right.

But broadly speaking, I expect that everything would be much clearer if phrased in terms of reward rather than utility functions, because reward functions are inherently defined over sequential decisions.

I don't like reward functions, since that implies observability (at least usually it's taken that way).

I think a reasonable alternative would be to assume that utility is a weighted sum of local value (which is supposed to be similar to reward).

Example 1: reward functions. Utility is a weighted sum over a reward which can be computed for each time-step. You can imagine sliding a little window over the time-series, and deciding how good each step looks. Reward functions are single-step windows, but we could also use larger windows to evaluate properties over several time-steps (although this is not usually important).

Example 2: (average/total) utilitarianism. Utility is a weighted sum over (happiness/etc) of all people. You can imagine sliding a person-sized window over all of space-time, and judging how "good" each view is; in this case, we set the value to 0 (or some other default value) unless there is a person in our window, in which case we evaluate how happy they are (or how much they are thriving, or their preference satisfaction, or what-have-you).

At this point I really don't know what people who talk about coherence arguments on LW are actually defending.

One thought I had: it's true that utility functions had better be a function of all time, not just a frozen state. It's true that this means we can justify any behavior this way. The utility-theory hypothesis therefore doesn't constrain our predictions about behavior. We could well be better off just reasoning about agent policies rather than utility functions.

However, there seems to be a second thing we use utility theory for, namely, representing our own preferences. My complaint about your proposed alternative, "reward", was that it was not expressive enough to represent preferences I can think of, and which seem coherent (EG, utilitarianism).

So it might be that we're defending the ability to represent preferences we think we might have.

(Of course, I think even utility functions are too restrictive.)

Another thought I had:

Although utility theory doesn't strictly rule out any policy, a simplicity assumption over agent beliefs and utility functions yields a very different distribution over actions than a simplicity assumption over policies.

It seems to me that there are cases which are better-represented by utility theory. For example, predicting what humans do in unusual situations, but where they have time to think, I expect "simple goals and simple world-models" is going to generalize better than "simple policies". I suspect this precisely because humans have settled on describing behaviors in terms of goals and beliefs, in addition to habits/norms (which are about policy). If habits/norms did good enough a job of constraining expectations on their own, we probably would not do that.

This also relates to the AI-safety-debate-relevant question, of how to model highly capable systems. If your objection to "utility theory" as an answer is "it doesn't constrain my expectations", then I can reply "use a simplicity prior". The empirical claim made by utility theory here is: highly capable agents will tend to have behavior explainable via simple utility functions. As opposed to merely having simple policies.

OK, but then, what is the argument for this claim? Certainly not the usual coherence arguments?

Well, I'm not sure. Maybe we can modify the coherence arguments to have simplicity assumptions run through them as well. Maybe not.

What I feel more confident about is that the simplicity assumption embodies the content of the debate (or at least an important part of the content).

relativity tells us that a simple "state" abstraction isn't quite right

Hmm, this sentence feels to me like a type error. It doesn't seem like the way we reason about agents should depend on the fundamental laws of physics. If agents think in terms of states, then our model of agent goals should involve states regardless of whether that maps onto physics. (Another way of saying this is that agents are at a much higher level of abstraction than relativity.)

I don't like reward functions, since that implies observability (at least usually it's taken that way).

Hmm, you mean that reward is taken as observable? Yeah, this does seem like a significant drawback of talking about rewards. But if we assume that rewards are unobservable, I don't see why reward functions aren't expressive enough to encode utilitarianism - just let the reward at each timestep be net happiness at that timestep. Then we can describe utilitarians as trying to maximise reward.

I expect "simple goals and simple world-models" is going to generalize better than "simple policies".

 I think we're talking about different debates here. I agree with the statement above - but the follow-up debate which I'm interested in is the comparison is "utility theory" versus "a naive conception of goals and beliefs" (in philosophical parlance, the folk theory), and so this actually seems like a point in favour of the latter. What does utility theory add to the folk theory of agency? Here's one example: utility theory says that deontological goals are very complicated. To me, it seems like folk theory wins this one, because lots of people have pretty deontological goals. Or another example: utility theory says that there's a single type of entity to which we assign value. Folk theory doesn't have a type system for goals, and again that seems more accurate to me (we have meta-goals, etc).

To be clear, I do think that there are a bunch of things which the folk theory misses (mostly to do with probabilistic reasoning) and which utility theory highlights. But on the fundamental question of the content of goals (e.g. will they be more like "actually obey humans" or "tile the universe with tiny humans saying 'good job'") I'm not sure how much utility theory adds.

Hmm, this sentence feels to me like a type error. It doesn't seem like the way we reason about agents should depend on the fundamental laws of physics. If agents think in terms of states, then our model of agent goals should involve states regardless of whether that maps onto physics. (Another way of saying this is that agents are at a much higher level of abstraction than relativity.)

True, but states aren't at a much higher level of abstraction than relativity... states are a way to organize a world-model, and a world-model is a way of understanding the world.

From a normative perspective, relativity suggests that there's ultimately going to be something wrong with designing agents to think in states; states make specific assumptions about time which turn out to be restrictive.

From a descriptive perspective, relativity suggests that agents won't convergently think in states, because doing so doesn't reflect the world perfectly.

The way we think about agents shouldn't depend on how we think about physics, but it accidentally did, in that we accidentally baked linear time into some agent designs. So the reason relativity is able to say something about agent design, here, is because it points out that some agent designs are needlessly restrictive, and rational agents can take more general forms (and probably should).

This is not an argument against an agent carrying internal state, just an argument against using POMDPs to model everything.

Also, it's pedantic; if you give me an agent model in the POMDP framework, there are probably more interesting things to talk about than whether it should be in the POMDP framework. But I would complain if POMDPs were a central assumption needed to prove a significant claim about rational agents, or something like that. (To give an extreme example, if someone used POMDP-agents to argue against the rationality of assenting to relativity.)

Hmm, you mean that reward is taken as observable? Yeah, this does seem like a significant drawback of talking about rewards. But if we assume that rewards are unobservable, I don't see why reward functions aren't expressive enough to encode utilitarianism - just let the reward at each timestep be net happiness at that timestep. Then we can describe utilitarians as trying to maximise reward.

I would complain significantly less about this, yeah. However, the relativity objection stands.

I think we're talking about different debates here. I agree with the statement above - but the follow-up debate which I'm interested in is the comparison is "utility theory" versus "a naive conception of goals and beliefs" (in philosophical parlance, the folk theory), and so this actually seems like a point in favour of the latter. What does utility theory add to the folk theory of agency? 

To state the obvious, it adds formality. For formal treatments, there isn't much of a competition between naive goals and utility theory: utility theory wins by default, because naive goal theory doesn't show up to the debate.

If I thought "goals" were a better way of thinking than "utility functions", I would probably be working on formalizing goal theory. In reality, though, I think utility theory is essentially what you get when you try to do this.

Here's one example: utility theory says that deontological goals are very complicated. To me, it seems like folk theory wins this one, because lots of people have pretty deontological goals. 

So, my theory is not that it is always better to describe realistic agents as pursuing (simple) goals. Rather, I think it is often better to describe realistic agents as following simple policies. It's just that simple utility functions are often enough a good explanation, that I want to also think in those terms.

Deontological ethics tags actions as good and bad, so, it's essentially about policy. So, the descriptive utility follows from the usefulness of the policy view. [The normative utility is less obvious, but, there are several reasons why this can be normatively useful; eg, it's easier to copy than consequentialist ethics, it's easier to trust deontological agents (they're more predictable), etc.]

To state it a little more thoroughly:

  1. A good first approximation is the prior where agents have simple policies. (This is basically treating agents as regular objects, and investigating the behavior of those objects.)
  2. Many cases where that does not work well are handled much better by assuming simple utility functions and simple beliefs. So, it is useful to sloppily combine the two.
  3. An even better combination of the two conceives of an agent as a model-based learner who is optimizing a policy. This combines policy simplicity with utility simplicity in a sophisticated way. Of course, even better models are also possible.

Or another example: utility theory says that there's a single type of entity to which we assign value. Folk theory doesn't have a type system for goals, and again that seems more accurate to me (we have meta-goals, etc).

I'm not sure what you mean, but I suspect I just agree with this point. Utility functions are bad because they require an input type such as "worlds". Utility theory, on the other hand, can still be saved, by considering expectation functions (which can measure the expectation of arbitrary propositions, linear combinations of propositions, etc). This allows us to talk about meta-goals as expectations-of-goals ("I don't think I should want pizza").

To be clear, I do think that there are a bunch of things which the folk theory misses (mostly to do with probabilistic reasoning) and which utility theory highlights. But on the fundamental question of the content of goals (e.g. will they be more like "actually obey humans" or "tile the universe with tiny humans saying 'good job'") I'm not sure how much utility theory adds.

Again, it would seem to add formality, which seems pretty useful.

To state the obvious, it adds formality.

Here are two ways to relate to formality.  Approach 1: this formal system is much less useful for thinking about the phenomenon than our intuitive understanding, but we should keep developing it anyway because eventually it may overtake our intuitive understanding.

Approach 2: by formalising our intuitive understanding, we have already improved it. When we make arguments about the phenomenon, using concepts from the formalism is better than using our intuitive concepts.

I have no problem with the approach 1; most formalisms start off bad, and get better over time. But it seems like a lot of people around here are taking the latter approach, and believe that the formalism of utility theory should be the primary lens by which we think about the goals of AGIs.

I'm not sure if you defend the latter. If you do, then it's not sufficient to say that utility theory adds formalism, you also need to explain why that formalism is net positive for our understanding. When you're talking about complex systems, there are plenty of ways that formalisms can harm our understanding. E.g. I'd say behaviourism in psychology was more formal and also less correct than intuitive psychology. So even though it made a bunch of contributions to our understanding of RL, which have been very useful, at the time people should have thought of it using approach 1 not approach 2. I think of utility theory in a similar way to how I think of behaviourism: it's a useful supplementary lens to see things through, but (currently) highly misleading as a main lens to see things like AI risk arguments through.

If I thought "goals" were a better way of thinking than "utility functions", I would probably be working on formalizing goal theory.

See my point above. You can believe that "goals" are a better way of thinking than "utility functions" while still believing that working on utility functions is more valuable. (Indeed, "utility functions" seem to be what "formalising goal theory" looks like!)

Utility theory, on the other hand, can still be saved

Oh, cool. I haven't thought enough about the Jeffrey-Bolker approach enough to engage with it here, but I'll tentatively withdraw this objection in the context of utility theory.

From a descriptive perspective, relativity suggests that agents won't convergently think in states, because doing so doesn't reflect the world perfectly.

I still strongly disagree (with what I think you're saying). There are lots of different problems which agents will need to think about. Some of these problems (which involve relativity) are more physically fundamental. But that doesn't mean that the types of thinking which help solve them need to be more mentally fundamental to our agents. Our thinking doesn't reflect relativity very well (especially on the intuitive level which shapes our goals the most), but we manage to reason about it alright at a high level. Instead, our thinking is shaped most to be useful for the types of problems we tend to encounter at human scales; and we should expect our agents to also converge to thinking in whatever way is most useful for the majority of problems which they face, which likely won't involve relativity much.

(I think this argument also informs our disagreement about the normative claim, but that seems like a trickier one to dig into, so I'll skip it for now.)

[-]TAG3y50

If agents think in terms of states, then our model of agent goals should involve states regardless of whether that maps onto physics.

Realistic agents don't have the option of thinking in terms of detailed world states anyway, so the relativistic objection is the least of their worries.

I vaguely agree that preference cycles can be a good way to represent some human behavior. A somewhat grim example I like is: serial monogamists subjectively feel like each step of the process is moving toward something good (I imagine), but at no point would they endorse the whole cycle (I imagine).

I think of it this way: evolution isn't "good enough" to bake in a utility function which exactly pursues evolutionary fitness. So why should we expect it to put in a coherent function at all? Sometimes, an incoherent preference system will be the best available way to nudge organisms into fitness-inducing behavior.

However, your binge-eating example has a different character, in that the person may endorse the whole cycle explicitly while engaging in it. This character is essential to your story: you claim that incoherent preferences might be fine in a normative sense.

Here's my intuitive problem with your argument.

I think of the preference relation, >, in VNM, Savage, Jeffrey-Bolker, and other preference representation theorems (all of which justify some form of utility theory) as indicating behavior during deliberation

We're in the process of deciding what to do. Currently, we'd default to A. We notice the option B. We check whether B > A. If yes, we switch to B. Now we notice C, and C > B, so we switch it to our default and keep looking. BUT WAIT: we immediately notice that A > C. So we switch back to A. We then do all these steps over and over again.

If there's a cycle, we get stuck in indecision. This is bad because it wastes cognition. It's strictly better to set A=B=C, so that the computation can move on and look for a D which is better than all of them. So, in my view, the true "money" in the money pump is thinking time.

Eliezer's example with pizza fits with this, more or less: we see someone trying to decide between options, with some switching cost.

Your example does not fit: the person would want to actually eat the food before switching to being hungry again, not just think about it.

Why do I choose to operationalize A>B via deliberation, and think it a mistake to operationalize A>B as a fact about an agent already in state B, as you do? Well, part of the big problem with money-pump arguments is that the way they're usually framed, they seem to require a magical genie who can swap any reality A for a different one B (for a fee, of course). This is dumb. In the absence of such a genie, are incoherent preferences OK?

So it makes a lot more sense to interpret A>B as a fact about cognition. Being "offered" C just means you've thought to examine the possibility C. No magical genie required to define preferences.

Another justification is that, importantly, A, B and C can be lotteries, rather than fully-fleshed-out-worlds. Suppose C is just "the Riemann hypothesis is true". We have a lot of uncertainty about what else is going on in that world. So what does it mean to ask whether an agent "prefers" C or not-C, from the behavioral perspective? We have to come up with a concrete instantiation, like giving them a button which makes the conjecture true or false. But obviously this could skew the results (EG, the agent likes pushing red buttons and hates pushing blue ones).

On the other hand, if we define > cognitively, it becomes just a matter of whether the agent prefers one or the other hypothetically -- IE we only have to suppose that the agent can compare the desirability of abstract statements. This is still a bullet to bite (it constrains what cognition can look like, in a potentially unnecessary way), but it is cleaner.

Bottom line: you're operationalizing A>B as information about what an agent would be willing to do if it was already in situation A, and was offered to switch to situation B. I think this is a misconception propagated by the way philosophers talk about money-pump arguments. I prefer to operationalize A>B as information about deliberation behavior, which I think fits better with most uses of >. Money-pumps are then seen as infinite loops in cognition.

Money-pumps are then seen as infinite loops in cognition.

And setting A=B=C is deciding not to allocate the time to figure out their values (hard to decide -> similar). Usually, such a thing indicates there are multiple things you want (and as bad as 'buy 3 pizzas, one of each might sound' it seems like it resolves this issue).

If someone feels like formalizing a mixed pizza theorem, or just testing this one experimentally, let me know.


This doesn't seem like a problem that shows up a lot, outside of 'this game has strategies that cyclically beat each other/what option should you play in Rock Paper Scissors?'

And setting A=B=C is deciding not to allocate the time to figure out their values (hard to decide -> similar). 

This sentence seems to pre-suppose that they have "values", which is in fact what's at issue (since numerical values ensure transitivity). So I would not want to put it that way. Rather, cutting the loop saves time without apparently losing anything (although to an agent stuck in a loop, it might not seem that way).

Usually, such a thing indicates there are multiple things you want

I think this is actually not usually an intransitive loop, but rather, high standards for an answer (you want to satisfy ALL the desiderata). When making decisions, people learn an "acceptable decision quality" based on what is usually achievable. That becomes a threshold for satisficing. This is usually good for efficiency; once you achieve the threshold, you know returns for thinking about this decision are rapidly diminishing, so you can probably move on.

However, in the rare situations where the threshold is simply not achievable, this causes you to waste a lot of time searching (because your termination condition is not yet met!).

My first half-baked thoughts about what sort of abstraction we might use instead of utility functions:

Maybe instead of thinking about preferences as rankings over worlds, we think of preferences as like gradients. Given the situation that an agent finds itself in, there are some directions to move in state space that it prefers and some that it disprefers. And as the agent moves through the world, and it’s situation changes, it’s preference gradients might change too. 

This allows for cycles, where from a, the agent prefers b, and from b, the agent prefers c, and from c, the agent prefers a.

It also means that preferences are inherently contextual. It doesn’t make sense to ask what an agent wants in the abstract, only what it wants given some situated context. This might be a feature, not a bug, in that it resolves some puzzles about values. 

This implies a sort of non-transitivity of preferences. If you can predict that you’ll want something, in the future, that doesn't necessarily imply that you want it now.

Relaxing independence rather than transitivity is the most explored angle of attack IIRC.

The problem with this is also that it's too expressive. For any policy π, you can encode that policy into this sort of gradient: if π takes action a in state s, you say that your gradient points towards a (or the state s' that results from taking action a), and away from every other action / state.

I happen to agree with this generalization, provided we also respect the constraint "if you can predict that you'll want something in the future, then you want it now". (There might also be other coherence constraints I would want to impose! But this is a central one.)

On the one hand, if we violate this, we will usually prefer to self-modify to remove the violation. They might not entirely stop their preferences from changing, but they'd certainly want to change the method of change, at least. This is very much like a philosopher who doesn't trust their own deliberation process. They might not want to entirely stop thinking (some ways of changing your mind are good), but they would want to modify their reasoning somehow.

(Furthermore, an agent who sees this kind of thing coming, but does not yet inhabit either conflicting camp, would probably want to self-modify in some way to avoid the conflict.)

On the other hand, suppose an agent passes through this kind of belief change without having an opportunity to self-modify. The agent will think its past self was wrong to want to resist the change. It will want to avoid that type of mistake in the future. If we make the assumption that learning will tend to make modifications which would have 'helped' its past self, then such an agent will learn to predict value changes and learn to agree with those predictions.

This gives us something similar to logical induction.

You mentioned in the article that you intuitively want some kind of "dominance" argument which dutch-books/money-pumps don't give you. I would propose logical-induction style dominance. What you have is essentially the guarantee that someone with cognitive powers comparable to yours can't come in and do a better job of satisfying your (future) values.

Why do we want that guarantee?

  1. The usefulness of the current action to future preferences is what's important for learning, since future preferences are the ones which get to decide how to modify things. So this is a notion of "doing the best we can" with respect to learning: we couldn't benefit from the advice of someone with similar cognitive strength to us.
  2. Relatedly, this is important for tiling agents: if (it looks to you like) a different configuration of a similar amount of processing power would do a better job, then you'd prefer to self-modify to that configuration.

I don't have a unified answer at the moment, but a few comments/pointers...

First, I currently think that utility-maximization is the right way to model goal-directedness/optimization-in-general, for reasons unrelated to dutch book arguments. Basically, if we take Flint's generalized-notion-of-optimization (which was pretty explicitly not starting from a utility-maximization assumption), and formalize it in a pretty reasonable way (as reducing the number-of-bits required to encode world-state under some model), then it turns out to be equivalent to expected utility maximization. This definitely seems like the sort of argument which should apply to e.g. evolved systems.

One caveat, though: that argument says that the system is an expected utility maximizer under some God's-eye world-model, not necessarily under the system's own world model (if it even has one). I expect something like the (improved) Good Regulator theorem to be able to bridge that gap, but I haven't written up a full argument for that yet, much less started on the empirical question of whether the Good Regulator conditions actually match the conditions under which agenty systems evolve in practice.

Second, there's the issue from Why Subagents?. Markets of EU maximizers are "inexploitable" in exactly the same sense used in the dutch book theorems, but a well-known result in economics says that a market is not always equivalent to single EU maximizer. What gives? We have an explicit example of a system which is inexploitable but not an EU maximizer, so what loophole in the dutch book theorems is it using? Turns out, the dutch book theorems implicitly assume that the system has no "internal state" - which is clearly false for most real-world agenty systems. I conjecture that markets, rather than single EU maximizers, are the most general inexploitable systems once we allow for internal state.

I still expect evolved systems to be well-modeled as EU maximizers at the level of terminal goals and from an external perspective, for the reasons above. But in terms of instrumental goals/behavior, or internal implementation, I expect to see market-like mechanisms, rather than single-utility-maximization.

Third, there's the issue of type-signatures of goals. You mentioned utilities over world-states or changes-in-world-states, but the possibilities are much broader than that - in principle, the variables which go into a utility function need not be (fully) grounded in physical world-state at all. I could, for instance, care about how elegant the mathematical universe is, e.g. whether P = NP, independent of the real-world consequences of that.

More importantly, the variables which go into a utility function need not be things the agent can or does observe. I think this is true for rather a lot of things humans care about - for instance, I care about the welfare of random people in Mumbai, even if I will never meet them or have any idea how they're doing. This is very different from the dutch-book theorems, which assume not only that we can observe every variable, but that we can even bet on every variable. This is another aspect which makes more sense if we view EU maximization as compression (i.e. reducing the number-of-bits required to encode world-state under some model) rather than as a consequence of dutch-book theorems.

I conjecture that markets, rather than single EU maximizers, are the most general inexploitable systems once we allow for internal state.

I note that this may be very similar to Eli's own proposal, provided we do insist on the constraint that "if you can predict how your values will change then you agree with that change" (aka price today equals expected value of price tomorrow).

Two arguments against "you must not be Dutch-bookable" that feel vaguely relevant here:

1) The _extent_ to which you are Dutch-bookable might matter. IE, if you can pump $X per day from me, that only matters for large X. So viewing Dutch-bookability as binary might be misleading.

2) Even if you are _in theory_ Dutch-bookable, it only matters if you can _actually_ be Dutch-booked. EG, if I am the meanest thing in the universe that controls everything (say, a singleton AI), I could probably ensure that I won't get into situation where my incoherent goals could hurt me.

My takeaway: It shouldn't be necessary to build AI with a utility function. And it isn't sufficient to ony defend against misaligned AIs with a utility function.

This seems somewhat connected to this previous argument. Basically, coherent agents can be modeled as utility-optimizers, yes, but what this really proves is that almost any behavior fits into the model "utility-optimizer", not that coherent agents must necessarily look like our intuitive picture of a utility-optimizer.

Paraphrasing Rohin's arguments somewhat, the arguments for universal convergence say something like "for "most" "natural" utility functions, optimizing that function will mean acquiring power, killing off adversaries, acquiring resources, etc". We know that all coherent behavior comes from a utility function, but it doesn't follow that most coherent behavior exhibits this sort of power-seeking.

I have an intuition that the dutch-book arguments still apply in very relevant ways. I mostly want to talk about how maximization appears to be convergent. Let's see how this goes as a comment.

My main point: if you think an intelligent agent forms and pursues instrumental goals, then I think that agent will be doing a lot of maximization inside, and will prefer to not get dutch-booked relative to its instrumental goals.

---

First, an obvious take on the pizza non-transitivity thing.

If I'm that person desiring a slice of pizza, I'm perhaps desiring it because it will leave me full + taste good + not cost too much.

Is there something wrong with me paying some money to switch the pizza slice back and forth? Well, if the reason I cared about the pizza was that it was low-cost tasty food, then I guess I'm doing a bad job at getting what I care about.

If I enjoy the process of paying for a different slice of pizza, or am indifferent to it, then that's a different story. And it doesn't hurt much to pay 1 cent a couple of times anyway.

----

Second, suppose I'm trying to get to the moon. How would I go about it?

I might start with estimates about how valuable different suboutcomes are, relative to my attempt to get to the moon. For instance, I might begin with the theory that I need to have a million dollars to get to the moon, and that I'll need to acquire some rocket fuel too.

If I'm trying to get to the moon soon, I will be open to plans that make me money quickly, and teach me how to get rocket fuel. I would also like better ideas about how I should get to the moon, and if you told about how calculus and finite-element-analysis would be useful, I'll update my plans. (And if I were smarter, I might have figured that out on my own.)

If I think that I need a much better grasp of calculus, I might then dedicate some time to learning about it. If you offer me a plan for learning more about calculus, better and faster, I'll happily update and follow it. If I'm smart enough to find a better plan on my own, by thinking, I'll update and follow it.

----

So, you might think that I can be an intelligent agent, and basically not do anything in my mind that looks like "maximizing". I disagree! In my above parable, it should be clear that my mind is continually selecting options that look better to me. I think this is happening very ubiquitously in my mind, and also in agents that are generally intelligent.

All models are wrong, some models are useful.  

I think it's unambiguous that mapping perceived/expected state of the universe to a value for instantaneous decision-making is a useful (and perhaps necessary) abstraction to model anything about decision-making.  You don't seem to be arguing about that, but only claiming that a consistent utility function over time is ... unnecessary, incomplete, or incorrect (unsure which).

You also seem to be questioning the idea that more capable/effective agents have more consistent utility functions.  You reject the dutch book argument, which is one of the intuitive supports for this belief, but I don't think you've proposed any case where inconsistency does optimize the universe better than consistent decisions.  Inconsistency opens the identity problem (are they really the same agent if they have different desires of a future world-state?), but even if you handwave that, it's clear that making inconsistent decisions optimizes the universe less than making consistent ones.

I think I'm with you that it may be impossible to have a fully-consistent long-lived agent.  The universe has some irreducible complexity that any portion of the universe can't summarize well enough to evaluate a hypothetical.  But I'm not with you if you say that an agent can be equally or more effective if it changes it's goals all the time.

Perhaps relevant, I wrote a post a while back (all the images are broken and irretrievable; sorry) about the idea that suffering and happiness (and possibly many other states) should be considered separate dimensions around which we intuitively try to navigate, and that compressing all these dimensions onto the single dimension of utility gives you some advantages (you can rank things more easily) but discards a tremendous amount of detail. Fundamentally, forcing yourself to use utility in all circumstances is like throwing away your detailed map in exchange for a single number representing how far you are from your destination. In theory you can still find your way home, but you'll encounter obstacles you might have avoided otherwise.

I can't read it via that link. I have two options on that page: 'publish changes', 'move to drafts', and toggling whether or not 'Moderators can promote to frontpage'.

Did you mean:

https://www.lesswrong.com/posts/3mFmDMapHWHcbn7C6

?

Yes, thanks, I’ll fix it.

My intuitive feeling about the value of the utility function abstraction, is that for systems like humans and current neural networks, the pre-experiential mind is like a rough, rocky mountain, without the effect of wind or rain, while a utility maximizing agent is analogous to a perfectly flat plain. 

Under the influence of consistent incentives, certain sections of this mountain are worn smooth over time, allowing it to be well approximated as a utility maximizing system. We should not expect it to act as a utility maximizer outside of this smoothed area, but we should expect this smooth area to exist even with only weak assumptions about the agent.

Reasoning a little less poetically:

The world of potential preferences is a super-high dimensional space, of which what I care about is only a tiny subset (though still complex in its own right). 

Taking actions or offers which would improve the world according to my preferences has both absolute cost and opportunity cost, meaning I only take actions with at least some threshold positive impact.

Preferences outside my core domain of action are:

  • generally very weak at best, with total indifference being the norm, and
  • chaotically noisy, such that they may vary according to all kinds of situational characteristics unpredictable to myself or an outside observer

Also, my understanding of how my actions affect those areas I truly care about is sufficiently imperfect, that outside of a well understood range the expected value is low, especially with conservative preferences.

These factors greatly reduce the scope of potential dutch books which I would ever actually take, reducing the ability of somebody to exploit any inconsistencies. 

Also, repeated exposure to simple of dutch books and failures to maximize are likely 

We should expect therefore to find agents which are utility-maximizing only as a contingent outcome of the trajectory of their learning to display utility-maximizing behaviors in areas that are both reward-relevant, within the training domain.

While self-modification to create a simple smooth plain is a plausible action, it shouldn't be seen as dominant since less drastic actions are likely to be sufficient to avoid being dutch booked.

A major crux for this view as applied to systems like humans is the explanation of how our (relatively) simple, compressible goals and ideas emerge out of the outrageous complexity of our minds. My feeling is that, once learning has started, adjustments to the nature of the mind pick up broad contours as a way to act but only as a reflection of the world and reward system in which they are placed. If instead there is some kind of core underlying drive towards logical simplicity, or that a logically simple set of drives, once in place, is some how dominant or tends to spread through a network, then I would expect smarter agents to quickly become for agent-like.

It seems to me that the hungry->full Dutch book can be resolved by just considering the utility function one level deeper: we don't value hungriness or fullness (or the transition from hungry to full) as terminal goals themselves. We value moving from hungry to full, but only because doing so makes us feel good (and gives nutrients, etc). In this case, the "feeling good" is the part of the equation that really shows up in the utility function, and a coherent strategy would be one for which this amount of "feeling good" can not be purchased for a lower cost.

alien seed preferences

What (strange) preferences might aliens have?

I think this becomes clearer if we distinguish people from agents. People are somewhat agentic beings of high moral value, while dedicated agents might have little moral value. Maximizing agency of people too well probably starts destroying their value at some point. At present, getting as much agency as possible out of potentially effective people is important for instrumental reasons, since only humans can be agentic, but that will change.

It's useful for agents to have legible values, so that they can build complicated systems that serve much simpler objectives well. But for people it's less obviously important to have much clarity to their values, especially if they are living in a world managed by agents. Agents managing the world do need clear understanding of values of civilization, but even then it doesn't necessarily make sense to compare these values with those of individual people.

(It's not completely obvious that individual people are one of the most valuable things to enact, so a well-developed future might lack them. Even if values of civilization are determined by people actually living through eons of reflection and change, and not by a significantly less concrete process, that gives enough distance from the present perspective to doubt anything about the result.)

You might be interested in reading about aspiration adaptation theory: https://www.sciencedirect.com/science/article/abs/pii/S0022249697912050

To me the most appealing part of it is that goals are incomparable and multiple goals can be pursued at the same time without the need for a function that aggregates them and assigns a single value to a combination of goals.