# Ω 5

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

[Followup to: Conceptual problems with utility functions]

The previous post was me flailing my arms about wildly trying to gesture at something. After some conversations I think I have a somewhat better idea of what it is and can try to explain it somewhat more systematically. Let us see if it works.

The point that I want to make clear I agree with, is that there is a difference between "valuing fairness" (for example) because you want everyone to get something, versus "valuing fairness" for tactical reasons. For example, suppose that you are playing the Ultimatum Game where the point is to divide the profit from a joint project, and each person thinks they put in more effort than the other and so deserves a larger share. Then maybe you think that the "fair outcome" is for you to get 60%, while the other person thinks that the "fair outcome" is for you to get 40%.

I think a reasonable way to deal with this scenario is: each of you plays a mixed strategy with the end result that on average, both players will get 40% (meaning that 20% of the time the players will fail to reach agreement and nobody will get anything). The nice things about this strategy are (A) it gives both players at least some money and (B) it satisfies a "meta-fairness" criterion that agents with more biased notions of fairness aren't able to exploit the system to get more money out of it.

So now in our decisionmaking process there are two notions of "fairness": the initial notion of fairness that made us decide that the fair split is 60/40, and the "meta-fairness" or "inexploitability" that made us sacrifice some of the other person's money so that we prefer 40/40 to 40/60. Certainly these are different. But my point is that they are not as different as they appear, and I am not sure that there is a principled way of separating your decisionmaking process into two components "utility function" and "decision theory". Moreover, even if there is a clean way of making such a separation, it is not clear to me that "meta-fairness" is a much simpler concept than "object-level fairness". And it seems to me that the utility functions/game theory framework presumes not only that it is a simpler concept, but that it is somehow conceptually equivalent to the concept of "maximization" (which doesn't seem to be true at all).

(The example might seem deceptively simple in mathematical terms, but the difficulty in formalizing it in a utility functions framework is that you are comparing different possible worlds in which your opponent has different utility functions, and you are measuring inexploitability in terms of how much resources they take, not how their utility function is affected. So you at least need the concept of a "resource".)

# Ω 5

New Comment

Moreover, even if there is a clean way of making such a separation, it is not clear to me that “meta-fairness” is a much simpler concept than “object-level fairness”. And it seems to me that the utility functions/​game theory framework presumes not only that it is a simpler concept, but that it is somehow conceptually equivalent to the concept of “maximization” (which doesn’t seem to be true at all).

Well, MIRI's framework presumes that. But MIRI's framework is weird. I prefer the framework of utility functions and game theory as it's usually studied, where multiplayer games are treated as their own thing and nobody expects the answer to fall out of single player utility maximization. Your post sounds to me like acceptance of that framework.

Fair enough, maybe I don't have enough familiarity with non-MIRI frameworks to make an evaluation of that yet.

I think a reasonable way to deal with this scenario is: each of you plays a mixed strategy with the end result that on average, both players will get 40% (meaning that 20% of the time the players will fail to reach agreement and nobody will get anything). The nice things about this strategy are (A) it gives both players at least some money and (B) it satisfies a "meta-fairness" criterion that agents with more biased notions of fairness aren't able to exploit the system to get more money out of it.

Just wanted to note that this bit reminds me of this post: https://www.lesswrong.com/posts/z2YwmzuT7nWx62Kfh/cooperating-with-agents-with-different-ideas-of-fairness

(For those who haven't seen it.)

This sounds like fairness as an instrumental value vs. fairness as a terminal value.

I agree with the matching of the concepts, I don't think it means that there is a clear difference between instrumental and terminal values.