Rationality is often informally defined by means-end reasoning or utility maximization.  However, this idea becomes less clear when faced with the option of modifying one's own utility function.  Does rationality prescribe avoiding any change to one's current utility function because such a change would obviously reduce expected utility under the current function, or does it prescribe taking actions which result in the highest utility by whatever means necessary, in which case a change would be rational iff the new utility function yields higher expected utility given known background info about the world?

This is obviously relevant to AI alignment, where one concern is that AI may hack their own utility functions and another concern is that they may prevent humans from modifying them (or shutting them off) due to the risk to their current goals.  It's also relevant to questions of human rationality, where, on the one hand, we imagine that Ghandi would not take a pill that makes him want to murder people, but on the other hand, we regularly believe that unhappy people should change their own psychology and goals to be more happy.  

New Answer
Ask Related Question
New Comment

6 Answers sorted by

The informal part of your opening sentence really hurts here.  Humans don't have time-consistent (or in many cases self-consistent) utility functions.  It's not clear whether AI could theoretically have such a thing, but let's presume it's possible.

The confusion comes in having a utility-maximizing framework to describe "what the agent wants".  If you want to change your utility function, that implies that you don't want what your current utility function says you want.  Which means it's not actually your utility function.  

You can add epicycles here - a meta-utility-function that describes what you want to want, probably at a different level of abstraction.  That makes your question sensible, but also trivial - of course your meta-utility function wants to change your utility function to more closely match your meta-goals.  But then you have to ask whether you'd ever want to change your meta-function.  And you get caught recursing until your stack overflows.

Much simpler and more consistent to say "if you want to change it, it's not your actual utility function".

The received wisdom in this community is that modifying one's utility function is at least usually irrational. The classic source here is Steve Omohundro's 2008 paper, "The Basic AI Drives," and Nick Bostrom gives basically the same argument in Superintelligence, pp. 132-34. The argument is basically this: imagine you have an AI that is solely maximizing the number of paperclips that exist. Obviously, if it abandons that goal, there will be less paperclips than if it maintains that goal. And if it adds another goal, say maximizing staples, then this other goal will compete with the paperclip goal for resources, e.g. time, attention, steel, etc. So again, if it adds the staple goal, there will be less paperclips than if it doesn't. So if it evaluates every option by h many paperclips result in expectation, then it will choose to maintain its paperclip goal unchanged. This argument isn't mathematically rigorous, and allows that there may be special cases where changing one's goal may be useful. But the thought is that, by default, changing one's goal is detrimental from the perspective of one's current goals.

As I said, though, there may be exceptions, at least for certain kinds of agents. Here's an example. It seems as though, at least for humans, we're more motivated to pursue our final goals directly than we are to pursue merely instrumental goals (which child do you think will read more: the one who intrinsically enjoys reading, or the one you pay $5 for every book they finish?). So, if a goal is particularly instrumentally useful, it may be useful to adopt it as a final goal in itself in order to increase your motivation to pursue it. For example, if your goal is to become a diplomat, but you find it extremely boring to read papers on foreign policy... well, first of all, I question why you want to become a diplomat if you're not interested in foreign policy, but more importantly, you might be well-served to cultivate an intrinsic interest in foreign policy papers. This is a bit risky: if circumstances change so that it's no longer as instrumentally useful, it may end up competing with your initial goals as described by the Bostrom/Omohundro argument. But it could work out that, at least some of the time, the expected value of changing your goal for this reason is positive.

Another paper to look at might be Steve Petersen's paper, "Superintelligence as Superethical," though I can't summarize the argument for you off the top of my head.

For purposes of alignment, it's probably more important that there is not going to be a known aligned utility function over all feasible plans. It's relatively easy to arrange the world in a way that's too hard to judge the value of. Within an island of more well-understood plans whose value can be judged in an aligned way, this value might have the form of expectation of a utility function. But this is less urgently salient, because naive optimization will quickly move the state of the world away from that island. Preserving a utility function doesn't help with this problem.

You can modify your utility function as part of a bargaining or pre commitment strategy.

I'd argue that's (in a VNM-rational agent) not changing a utility function, but simply following it - maximizing utility via trade or prediction-of-prediction calculations.

There are probably theoretical cases where real-world agents might alter their preferences (warning: don't update too much on fiction.  anti-warning: this is a fun read.  https://www.lesswrong.com/s/qWoFR4ytMpQ5vw3FT, chapter 5).  These are not perfectly rational agents (edit: or maybe they are, but it's not clear how "utility" and "preferences" are interacting in this case).

Possibly if the goals you're pursuing in practice are an imperfect approximation of a utility function that you don't actually know (maybe you have some ideas about what kinds of thing it might value, but don't know for sure which are included or with what weighting).

Then there would be room to update your de facto utility function, based on new evidence about the content of the "true" function.

Human utility is basically a function of image recognition. Which is sort of not a straight forward thing that I can say, "This is that." Sure, computers can do image recognition, what they are doing is that which is image recognition. However, what we can currently describe algorithmically is only a pale shadow of the human function, as proven by all recaptcha everywhere.

Given this, the complex confounder is that our utility function is part of the image.

Also, we like images that move.

In sum, modifying our utility function is natural and normal, and is actually one of the clauses of our utility function. Whether it's rational depends on your definition. If you grant the above, and define rationality as self alignment, then of course it's rational. If you ask whether changing your utility function is a "winning" move, probably not? I think it's a very lifelike move though, and anything lacking a mobile function is fundamentally eldritch in a way that is dangerous and not good.