[Epistemic status: half-baked, elucidating an intuition. Possibly what I’m saying here is just wrong, and someone will helpfully explain why.]
Thesis: I now think that utility functions might be a bad abstraction for thinking about the behavior of agents in general, including highly capable agents.
Over the past years, in thinking about agency and AI, I’ve taken the concept of a “utility function” for granted as the natural way to express an entity's goals or preferences.
Of course, we know that humans don’t have well defined utility functions (they’re inconsistent, and subject to all kinds of framing effects), but that’s only because humans are irrational. According to my prior view, to the extent that a thing acts like an agent, it’s behavior corresponds to some utility function. That utility function might or might not be explicitly represented, but if an agent is rational, there’s some utility function that reflects it’s preferences.
Given this, I might be inclined to scoff at people who scoff at “blindly maximizing” AGIs. “They just don’t get it”, I might think. “They don’t understand why agency has to conform to some utility function, and an AI would try to maximize expected utility.”
Currently, I’m not so sure. I think that using "my utility function" as a stand in for "my preferences" is biting a philosophical bullet, importing some unacknowledged assumptions. Rather than being the natural way to conceive of preferences and agency, I think utility functions might be only one possible abstraction, and one that emphasizes the wrong features, giving a distorted impression of what agents are actually like.
I want to explore that possibility in this post.
Before I begin, I want to make two notes.
First, all of this is going to be hand-wavy intuition. I don’t have crisp knock-down arguments, only a vague discontent. But it seems like more progress will follow if I write up my current, tentative, stance even without formal arguments.
Second, I don’t think utility functions being a poor abstraction for agency in the real world has much bearing on whether there is AI risk. It might change the shape and tenor of the problem, but highly capable agents with alien seed preferences are still likely to be catastrophic to human civilization and human values. I mention this because the sentiments expressed in this essay are casually downstream of conversations that I’ve had with skeptics about whether there is AI risk at all. So I want to highlight: I think I was previously mistakenly overlooking some philosophical assumptions, but that is not a crux.
[Thanks to David Deutsch (and other Critical Rationalists on twitter), Katja Grace, and Alex Zhu, for conversations that led me to this posit.]
Is coherence overrated?
The tagline of the “utility” page on arbital is “The only coherent way of wanting things is to assign consistent relative scores to outcomes.”
This is true as far as it goes, but to me, at least, that sentence implies a sort of dominance of utility functions. “Coherent” is a technical term, with a precise meaning, but it also has connotations of “the correct way to do things”. If someone’s theory is incoherent, that seems like a mark against it.
But it is possible to ask, “What’s so good about coherence anyway?"
The standard reply, of course, is that if your preferences are incoherent, you’re dutchbookable, and someone will come along to pump you for money.
But I’m not satisfied with this argument. It isn’t obvious that being dutch booked is a bad thing.
In, Coherent Decisions Imply Consistent Utilities, Eliezer says,
Suppose I tell you that I prefer pineapple to mushrooms on my pizza. Suppose you're about to give me a slice of mushroom pizza; but by paying one penny ($0.01) I can instead get a slice of pineapple pizza (which is just as fresh from the oven). It seems realistic to say that most people with a pineapple pizza preference would probably pay the penny, if they happened to have a penny in their pocket. 1
After I pay the penny, though, and just before I'm about to get the pineapple pizza, you offer me a slice of onion pizza instead--no charge for the change! If I was telling the truth about preferring onion pizza to pineapple, I should certainly accept the substitution if it's free.
And then to round out the day, you offer me a mushroom pizza instead of the onion pizza, and again, since I prefer mushrooms to onions, I accept the swap.
I end up with exactly the same slice of mushroom pizza I started with... and one penny poorer, because I previously paid $0.01 to swap mushrooms for pineapple.
This seems like a qualitatively bad behavior on my part.
Eliezer asserts that this is “qualitatively bad behavior.” I think that this is biting a philosophical bullet. I think it isn't obvious that that kind of behavior is qualitatively bad.
As an intuition pump: In the actual case of humans, we seem to get utility not from states of the world, but from changes in states of the world. (This is one of the key claims of prospect theory). Because of this, it isn’t unusual for a human to pay to cycle between states of the world.
For instance, I could imagine a human being hungry, eating a really good meal, feeling full, and then happily paying a fee to be instantly returned to their hungry state, so that they can enjoy eating a good meal again.
This is technically a dutch booking ("which do he prefer, being hungry or being full?"), but from the perspective of the agent’s values there’s nothing qualitatively bad about it. Instead of the dutchbooker pumping money from the agent, he’s offering a useful and appreciated service.
Of course, we can still back out a utility function from this dynamic: instead of having a mapping of ordinal numbers to world states, we can have one from ordinal numbers to changes from world state to another.
But that just passes the buck one level. I see no reason in principle that an agent might have a preference to rotate between different changes in the world, just as well as rotating different between states of the world.
But this also misses the central point. You can always construct a utility function that represents some behavior, however strange and gerrymandered. But if one is no longer compelled by dutch book arguments, this begs the question of why we would want to do that. If coherence is no longer a desiderata, it’s no longer clear that a utility function is that natural way to express preferences.
And I wonder, maybe this also applies to agents in general, or at least the kind of learned agents that humanity is likely to build via gradient descent.
I think this matters, because many of the classic AI risk arguments go through a claim that maximization behavior is convergent. If you try to build a satisficer, there are a number of pressures for it to become a maximizer of some kind. (See this Rob Miles video, for instance.)
I think that most arguments of that sort depend on an agent acting according to an expected utility maximization framework. And utility maximization turns out not to be a good abstraction for agents in the real world, I don't know if these arguments are still correct.
I posit that straightforward maximizes are rare in the distribution of advanced AI that humanity creates across the multiverse. And I suspect that most evolved or learned agents are better described by some other abstraction.
If not utility functions, then what?
If we accept for the time being that utility functions are a warped abstraction for most agents, what might a better abstraction be?
I don’t know. I’m writing this post in the hopes that others will think about this question and perhaps come up with productive alternative formulations. I've put some of my own half baked thoughts in a comment.