Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

[Epistemic status: strong intuitions that I have had for a long time, but not yet compared with other people's intuitions, and I am not sure how well I can convey my intuitions.]

So, a lot of agent foundations research seems to be built on the premise that agenthood is about maximizing the expected value of some utility function, or about having actions that bear some relation to some utility function, or something like that. I think that this premise is wrong except in special cases like where there is only one agent in the world. I don't know exactly how to explain this, so I will just say some stuff.

The Ultimatum Game is a standard example of a game in which there are many Pareto optimal outcomes, and no clear way to choose between them. So if we imagine two "perfectly rational" agents playing the Ultimatum Game against each other, what happens? I think that this question is meaningless. OK, so what happens in real life? People use their notions of fairness to resolve situations like the Ultimatum Game. Fairness is a part of human values, so essentially the answer is that an agent's values are used to resolve the ambiguity of multiple Pareto optimal outcomes.

But wait! In the classical understanding, a utility function is supposed to encode all information about an agent's values. So if there is a notion of fairness relevant to a real-life Ultimatum Game based on monetary payouts, then it supposedly means that the utility function is not just the same as the monetary payouts, and the game is not a true Ultimatum Game at all. But then what is a true Ultimatum Game? Does such a mythical beast even exist? Eliezer had to invent a fairly far-fetched scenario before he found something that he was willing to call the "true Prisoner's dilemma".

But even in the "true Prisoner's dilemma", the utility function does not appear to capture all of Eliezer's values -- he seems to still be motivated to say that "cooperate" is the right answer based on symmetry and hope, which again appear to be human values. So I propose that in fact there is no "true Prisoner's dilemma", simply because calling something a dilemma is asking you to resolve it with your own values, but your own values are allegedly encapsulated by the utility function which is subject to the dilemma.

I propose instead that agenthood is a robustness to these sorts of games, a sort of continual supply of values which are sufficient to escape from any logical argument purporting to prove that there is no determinate answer as to what action you should take. There is no "pure essence of rational agenthood" that we can simply graft onto the "utility function" of human values to make an aligned AI, because human values are not only a function on world-states to be learned but also a part of the decisionmaking process itself. This seems to suggest radically different approaches to the alignment problem than what is present so far, but I am not sure what they are yet.

New to LessWrong?

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 1:18 PM

I am very skeptical of this proposal for multiple reasons:

1) The Ultimatum Game is difficult to resolve because agent A's strategy depends on agent B's strategy depends on A's and there isn't an obvious way to shortcut this infinite recursion. The difficulty has nothing to do with utility functions. Perfectly selfish rational agents won't ever add fairness qua fairness - at best they'll add something approximating this in order to prevent the other rejecting the offer/low-balling. Lastly, if we wanted to add in fairness, we could simply include it in the utility function, so that isn't a critique of utility functions.

2) The True Prisoner's Dilemma/Ultimatum Game shouldn't be dismissed. Firstly, there are actually people who actually are that selfish. Secondly, we can end up in situations where we care almost zero for the other players, even if not quite zero. But more generally, you're falling into The Direct Application Fallacy. These games aren't important in and of themselves, but because of what we can deduce from them about more complicated games.

3) Thirdly, in the True Prisoner's Dilemma, I don't see Eliezer saying that co-operate is the right answer. If I'm wrong, can you point out where he says that. The right option on the True Prisoner's Dilemma where your decision can't affect the other player is to defect. Calling something a dilemma doesn't mean to resolve it with your own values - you are only supposed to import your own values when the situation doesn't specify them (see Please Don't Fight the Hypothetical).

1) The notion of a "perfectly selfish rational agent" presupposes the concept of a utility function. So does the idea that agent A's strategy must depend on agent B's which must depend on agent A's. It doesn't need to depend, you can literally just do something. And that is what people do in real life. And it seems silly to call it "irrational" when the "rational" action is a computation that doesn't converge.

2) I think humanity as a whole can be thought of as a single agent. Sure maybe you can have a person who is "approximately that selfish", but if they are playing a game against human values, there is nothing symmetrical about that. Even if you have two selfish people playing against each other, it is in the context of a world infused by human values, and this context necessarily informs their interactions.

I realize that simple games are only a proxy for complicated games. I am attacking the idea of simple games as a proxy for attacking the idea of complicated games.

3) Eliezer definitely says that when your decision is "logically correlated" with your opponent's decision then you should cooperate regardless of whether or not there is anything causal about the correlation. This is the essential idea of TDT/UDT. Although I think UDT does have some valuable insights, I think there is also an element of motivated reasoning in the form of "it would be nice if rational agents played (C,C) against each other in certain circumstances rather than (D,D), how can we argue that this is the case".

I don't know exactly how to explain this, so I will just say some stuff.

Good attitude!

I think I get your general problem with the sorts of dilemas/games you mentioned, but I didn't quite get how it was supposed to point to problems with the idea of utility functions. I will also say some stuff.

I agree that calling something the "Blanks dilemma" is a bit more suggestive than what I'd like. Maybe "The prisoner's game"?

Another thing. There are a handful of games where an agent who "values fairness" does better than your typical CDT agent. FDT and UDT seem to sometimes produce the result of, "Here is how you can get to a higher payout equilibrium, not by terminally valuing fairness, but my being smarter and thinking more broadly."

Also, specifically on EY and True Prisoner's Dilemma, I believe he's making the particular claim, "If one was running FDT, and they knew the other person was as well, they would both choose to cooperate."

I don't understand what it means to say that an agent who "values fairness" does better than another agent. If two agents have different value systems, how can you say that one does better than another? Regarding EY and the Prisoner's Dilemma, I agree that EY is making that claim but I think he is also making the claim "and this is evidence that rational agents should use FDT".

To your first point: If two agents had identical utility functions, except for one or two small tweaks, it feels reasonable to ask "Which of these agents got more utility/actualized it's values more?" This might be hard to actually formalize. I'm mostly running on the intuition that sometimes humans that are pretty similar might look at another and say, "It seems like this other person is getting more of what they want than I am."

Fair enough. Though in this case the valuing fairness is a big enough change that it makes a difference to how the agents act, so it's not clear that it can be glossed over so easily.

Yeah, when Alice and Bob play a game, their interaction determines the outcome. But I'm not sure we should say the interaction of their values determines the outcome. What about their ability to model each other, doesn't it play a role as well? Usually that's not considered part of values...

Sure, their ability to model each other matters. Their inability to model each other also matters, and this is where non-utility values come in.

Fairness is a higher-order value -- it only kicks in if you arleady have agents with conflicting object-level values. Maybe the problem is that there is an infinite stack of possible meta-levels on top of any UF.

It is not the problem, but the solution.

The solution to what?

Games can have multiple Nash equilibria, but agents still need to do something. The way they are able to do something is that they care about something other than what is strictly written into their utility function so far. So the existence of a meta-level on top of any possible level is a solution to the problem of indeterminacy of what action to take.

(Sorry about my cryptic remark earlier, I was in an odd mood)