Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is related to wireheading, utility functions, taste, and rationality. It is a series of puzzles meant to draw attention to certain tensions in notions of rationality and utility functions for embedded agents. I often skip the particulars of embedded situations. I often show multiple sides of an argument without staking out a clear positions myself. Even though I often write a utility function as if it were unique, this is shorthand---it is of course unique up to affine transformation.


Imagine you are in a situation with three outcomes: A, B, and C. You prefer A to C and C to B, and your preference ordering is transitive. Furthermore, assume that your preferences can be represented by a utility function.

I now give you a choice. You can either choose C, or you can choose to make a second decision between A and B. Obviously if this were the only thing going on, you would choose to choose between A and B and then you would choose A. However, there is something else going on. You know that if you chose to reject C and make the second choice then an Angel will reorder your preferences so that your new preference ordering is . Thus, at the second choice, you would presumably choose B. Remember, this was the worst outcome in your original preference ordering.

Now, my question is whether a rational agent should choose to pick C, or whether it should choose to pick to choose between A and B?


Position 1: You should pick C.

An agent is rational if it makes choices that maximize its expected utility. This utility is relative to a utility function (which is grounded in preferences). Thus, what is rational for one agent may be irrational for another if they have different utility functions.

When you are in the decision problem as described above, you are trying you maximize your utility as generated by your preferences. If you choose C, you will definitely get C. If you choose to choose, then you will definitely get B, which is the worst outcome, relative to your utility function. Thus, being rational, you should choose to not choose, and take C.

Consider the following option. I give you two options: I can either murder everyone you love and give you both a pill that makes you no longer care about them and $100 dollars, or I can give you $10. Assume your utility is linear with money. Then, if you are willing to choose to choose in the earlier case, you have no problems with choosing to have your utility function change, and you should be fine with having everyone you love murdered. You would never actually do this. Thus, you should choose C.


Position 2: You should choose to choose.

As a rational agent, you should maximize your utility. Even if your utility function changes, it is still yours. If you choose C you get a middle ranked option, whereas if you choose to choose then you will choose B, which at that time will be the best outcome for you.

Consider this in the case of choosing a snack. Say there are three options: a piece of chocolate, an apple, and a banana. Say you prefer apples to chocolate, and chocolate to bananas. Now, I give you a choice: you can either have a chocolate, or I can let you take a pill which changes your preferences to preferring bananas to chocolate, and chocolate to apples. Then I let you choose between an apple and a banana.

Naturally, you would prefer to take the pill and then choose the banana. This is the same as the abstract case above, thus you should choose to choose. QED.


My own intuition from the example still tells me to pick C. I wonder if the people who would choose to choose are confusing levels of utility functions.

This is apparent when considering the example given by the person holding position 2. They use the example to try to illustrate how one's utility function changes. I take a pill which changes my preferences. I'd reply by saying that you didn't change my utility function, you just changed my taste function (so to speak).

How would this work? Well, presumably I prefer (in a simple case) one snack to another because of the flavour of the snacks. My taste buds respond in a certain way to the chemical compounds in the snacks, and then secrete such and such chemicals into my brain. It is really these chemicals (or whatever subsequent process they trigger or of which they are a part) that I desire. Thus, the pill doesn't change my utility function, but rather my taste function.

How do we represent this? The arguments to my utility function could be outputs from other functions. Imagine that all I care about is how good the food I am eating tastes, and how good the book I am reading is. Let my taste function be where is the food I am eating, and my book function be where is the book I am reading. Then, my utility function has two inputs, and --- --- where and . If I care equally about books and food and there is some kind of comparability of the utilities, then it might be the case that , .

Now, if changes, has my utility function changed?

I don't think so. My utility function is still , even if what y ends up being in different cases is different.

I can see this being a problem, though. What is to stop us from doing the following. Let be my true utility function and let be my so called *practical* utility function. Furthermore, let so that . If we agree that changing the taste function doesn't alter the utility function, then changing shouldn't alter my utility function --- but this is all it is based on!


Is what matters that it is my utility function? If there is some other agent in the world, Atticus, and he has a utility function, I don't directly care about maximizing Atticus' utility function (I might indirectly care about trying to maximize his utility function if he is my friend or something, and we could incorporate that into my practical utility function).

Atticus cares, ultimately, only about his, and I mine. If someone where to actually change my utility function in a strong way (say, an angel) would I still be I? Or would Whispermute qua rational agent have gone out of existence?

This seems to be getting into difficult questions about identity and whatnot, which is an area away from which I had hoped to stay. Alack, I must venture forth.

If choosing to choose places me in a position in which I know I will have my utility function messed around with, and I think that having my utility function changed ends me qua rational agent, then in some sense it seems that I would cease being I were I to have my utility function changed. If this were the case, I think I would almost certainly be irrational were I to choose to choose.


Maybe for some clarity let us use a thought experiment. Suppose Maltrion is an assasin, and he serves his Master. Maltrion's utility function takes as direct input his Master's utility function. .

However, Maltrion is clever, and he has found a certain spell in his Master's cabinet that he can use to change his own utility function. He contemplates changing it to be maximized by his immediate death, which he could easily fulfill due to his technical assassin skills. But he knows that this would greatly displease his master.

Given that , what should he do?


Position 1: Maltrion should not use the spell. This is easily seen --- if is lowered by his action, then so is which is what is guiding Maltrion's actions. This, Maltrion should quite obviously not use the spell, as this would decrease his utility.


Position 2: Maltrion should use the spell. Maltrion wants to increase his own utility, and there is nothing essential that . He is in a position to change it so that has a domain of , where is the highest could ever be *in any possible form of * (suppose ). Maltrion, being a rational agent, is perfectly able to reason this through. Thus, since he wants maximize he should use the spell.

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 12:28 PM

Nice point. I think most of the time this doesn't apply, because agents don't just try to maximize "utility" as some abstract quantity, but possess (transparently or opaquely) some function which they concretely try to increase. Utility is a device, specific to each agent, that refers to states of the world, and isn't a kind of universal brownie point that everyone likes.

On the other hand, putting "utility" as your utility is fertile ground for wireheading temptations. I define my utility to be ! There, I win.

I can see this being a problem, though. What is to stop us from doing the following. Let Ut(x) be my true utility function and let Up(y) be my so called *practical* utility function. Furthermore, let Up(y)=x so that Ut(x)=Ut(Up(y)). If we agree that changing the taste function doesn't alter the utility function, then changing Up(y)shouldn't alter my utility function --- but this is all it is based on!

In the apple/chocolate/banana case, I prefer worlds in which I have a subjective feeling of good taste, so taking the pill doesn't change my preferences or utility function. In this case, I care directly about y, so if you change Up(y) that is going to change my preferences/utility function. It is not the case that if I can define my utility function in terms of some other function, I can just change that other function and things are still fine -- it depends on a case-by-case basis.

In particular, with my current preferences/utility function, in the apple/chocolate/banana case, I would say "I would prefer to have a banana and the ability to find bananas tasty to having chocolate right now". I wouldn't say the corresponding statement for Up(y).

Also, general note -- when you aren't dealing with probability, "having a utility function" means "having transitive preferences about all world-histories" (or world-states if you don't care about actions or paths). In that case, it's better to stick to thinking about preferences, they are easier to work with. (For example, this comment is probably best understood from a preferences perspective, not a utility function perspective)

Also, Self-Modification of Policy and Utility Function in Rational Agents looks into related issues, and my take on Towards Interactive Inverse Reinforcement Learning is that the problem it points at is similar in flavor to the ideas in this post.

[-]NRW6y110

My personal intuition is that what is 'rational' depends exclusively on your objective function at the time you make the choice.

I may value $10, and avoiding eating bugs a lot; if you offer me $30 to eat a cricket and a pill that gets rid of my sense of disgust and makes me care about money exclusively, I wouldn't take that deal because until I take that deal I still want to avoid eating bugs. That, were I to take that pill, I would no longer regret the decision I made, is not that interesting. If on the other hand I dislike eating bugs, but don't value not eating bugs, well then I would happily take your offer. But these aren't two different arguments about what is 'rational', I see them as entirely different setups.

We don't need to go as far as to posit angels, honestly; Heroin is amazing (I assume). I accept that were I to try it, I would be supremely glad I did - and yet I am perfectly comfortable not trying it.

I think the more interesting question (which I personally wrestle with and have not discovered an answer to), is what to do when you don't know your true objective function, or if your true objective function fluctuates significantly over time, or if your actions/brain/conscious experience is best-modeled by multiple agents with different objective functions which at different times have more control.

Just saw this informally treated on Facebook; check for additional comments. Also, sorry for the weird link, but FB linking is weird.

https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2Frobert.wiblin%2Fposts%2F830658254085&width=500

This seems possibly related to "maximize quantity of positive-expectation-feeling" vs "maximize rationally-predicted-expectation of positive-feeling" as expansions of "maximize utility". For instance, both of them have an "in practice" algorithm and an "in theory" algorithm which give different answers in edge cases. Also, it's a question of when to calculate utility (present or future). My dilemma is because the expectation of utility doesn't perfectly correspond to later utility, and yours is because present and future utility functions don't always resemble each other. I'm not sure how meaningful the connection is, but I thought of my dilemma when considering changing utility functions over time. Hopefully the resolution to one will help the other, then.

Naturally, you would prefer to take the pill and then choose the banana.

I wouldn’t.