Precommitting to paying Omega.

Related to: Counterfactual Mugging, The Least Convenient Possible World

MBlume said:

What would you do in situation X?" and "What would you like to pre-commit to doing, should you ever encounter situation X?" should, to a rational agent, be one and the same question.

Applied to Vladimir Nesov's counterfactual mugging, the reasoning is then:

Precommitting to paying $100 to Omega has expected utility of $4950.p(Omega appears). Not precommitting has strictly less utility; therefore I should precommit to paying. Therefore I should, in fact, pay $100 in the event (Omega appears, coin is tails).

To combat the argument that it is more likely that one is insane than that Omega has appeared, Eliezer said:

So imagine yourself in the most inconvenient possible world where Omega is a known feature of the environment and has long been seen to follow through on promises of this type; it does not particularly occur to you or anyone that believing this fact makes you insane.

My first reaction was that it is simply not rational to give $100 away when nothing can possibly happen in consequence. I still believe that, with a small modification: I believe, with moderately high probability, that it will not be instrumentally rational for my future self to do so. Read on for the explanation.

Suppose we lived in Eliezer's most inconvenient possible world:

  • Omega exists.
  • Omega has never been found untrustworthy.
  • Direct brain simulation has verified that Omega has a 100% success rate in predicting the response to its problem, thus far.
  • Omega claims that no other Omega-like beings exist (so no perverse Omegas that cancel out Omega's actions!).
  • Omega never speaks to anyone except if it is asking them for payment. It never meets anyone more than once
  • Omega claims that actual decisions never have any consequences. It is only what you would have decided that can ever affect its actions.

Did you see a trap? Direct brain simulation instantiates precisely what Omega says does not exist, a "you" whose decision has consequences. So forget that. Suppose Omega privately performs some action for you (for instance, a hypercomputation) that is not simulable. Then direct brain simulation of this circumstance cannot occur. So just assume that you find Omega trustworthy in this world, and assume it does not itself simulate you to make its decisions. Other objections exist: numerous ones, actually. Forget them. If you find that a certain set of circumstances makes it easier for you to decide not to pay the $100, or to pay it, change the circumstances. For myself, I had to imagine knowing that the Tegmark ensemble didn't exist*. If, under the MWI of quantum mechanics, you find reasons (not) to pay, then assume MWI is disproven. If the converse, then assume MWI is true. If you find that both suppositions give you reasons (not) to pay, then assume some missing argument invalidates those reasons.

Under these circumstances, should everyone pay the $100?

No. Well, it depends what you mean by "should".

Suppose I live in the Omega world. Then prior to the coin flip, I assign equal value to my future self in the event that it is heads, and my future self in the event that it is tails. My utility function is, very roughly, the expected utility function of my future self, weighted by the probabilities I assign that I will actually become some given future self. Therefore if I can precommit to paying $100, my utility function will possess the term $4950.p(Omega appears), and if I can only partially precommit, in other words I can arrange that with probablity q I will pay $100, then my utility function will possess the term $4950.q.p(Omega appears). So the dominant strategy is to precommit with probability one. I can in fact do this if Omega guarantees to contact me via email, or a trusted intermediary, and to take instructions thereby received as "my response", but I may have a slight difficulty if Omega chooses to appear to me in bed late one night.

On the principle of the least convenient world, I'm going to suppose that is in fact how Omega chooses to appear to me. I'm also going to suppose that I have no tools available to me in Omega world that I do not in fact possess right now. Here comes Omega:

Hello Nathan. Tails, I'm afraid. Care to pay up?

"Before I make my decision: Tell me the shortest proof that P = NP, or the converse."

Omega obliges (it will not, of course, let me remember this proof - but I knew that when I asked).

"Do you have any way of proving that you can hypercompute to me?"

Yes. (Omega proves it.)

"So, you're really Omega. And my choice will have no other consequences?"

None. Had heads appeared, I would have predicted precisely this current sequence of events and used it to make a decision. But heads has not appeared. No consequences will ensue.

"So you would have simulated my brain performing these actions? No, you don't do that, do you? Can you prove that's possible?"

Yes. (Omega proves it.)

"Right. No, I don't want to give you $100."


What the hell just happened? Before Omega appeared, I wanted this sequence of events to play out quite differently. In fact this was my wish right up to the 't' of "tails". But now I've decided to keep the $100 after all!

The answer is that there is no equivalence between my utility function at time t, where t < timeOmega, and my utility function at time T, where timeOmega < T. Before timeOmega, my utility function contains terms from states of the world where Omega appears and the coin turns up heads; after, it doesn't. Add to that the fact that my utility function is increasing in money possessed, and my preferred action at time T changes (predictably so) at timeOmega. To formalise:

Suppose we index possible worlds with a time, t, and a state, S: a world state is then (S,t). Now let the utility function of 'myself' at time t and in world state S be denoted US,t:AS → R, where AS is my set of actions and R the real numbers. Then in the limit of a small time differential Δt, we can use the Bellman equation to pick an optimal policy π*:S → AS such that we maximise US,t as US,t(π*(S)).

Before Omega appears, I am in (S,t). Suppose that the action "paying $100 to Omega if tails appears" is denoted a100. Then, obviously, a100 is not in my action set AS. Let "not paying $100 to Omega if tails appears" be denoted a0. a0 isn't in AS either. If we suppose Omega is guaranteed to appear shortly before time T (not a particularly restricting assumption for our purposes), then precommitting to paying is represented in our formalism by taking an action ap at (S,t) such that either:

  1. The probability of being a state § in which tails has appeared and for which a0 ∈ A§ at time T is 0, or
  2. For all states § with tails having appeared, with a0 ∈ A§ and with non-zero probability at time T, U§,T(a0) < U§,T(a100) = π*(§). Note that a 'world state' S includes my brain

Then if Omega uses a trusted intermediary, I can easily carry out an action ap = "give bank account access to intermediary and tell intermediary to pay $100 from my account to Omega under all circumstances". This counts as taking option 1 above. But suppose that option 1 is closed to us. Suppose we must take an action such that 2 is satisfied. What does such an action look like?

Firstly, brain hacks. If my utility function in state § at time T is increasing in money, then U§,T(a0) > U§,T(a100), contra the desired property of ap. Therefore I must arrange for my brain in world-state § to be such that my utility function is not so fashioned. But by supposition my utility function cannot "change"; it is simply a mapping from world-states X possible actions to real numbers. In fact the function itself is an abstraction describing the behaviour of a particular brain in a particular world state**. If, in addition, we desire that the Bellman equation actually holds, then we cannot simply abolish the process of determining an optimal policy at some arbitrary point in time T. I propose one more desired property: the general principle of more money being better than less should not cease to operate due to ap, as this is sure to decrease US,t(ap) below optimum (would we really lose less than $4950?). So the modification I make to my brain should be minimal in some sense. This is, after all, a highly exceptional circumstance. What one could do is arrange for my brain to experience strong reward for a short time period after taking action a100. The actual amount chosen should be such that that the reward outweighs the time-discounted future loss in utility from surrendering the $100 (it follows that the shorter the duration of reward, the stronger its magnitude must be). I must also guarantee that I am not simply attaching a label called "reward" to something that does not actually represent reward as defined in the Bellman equation. This would, I believe, require some pretty deep knowledge of the nature of my brain which I do not possess. Add to that the fact that I do not know how to hack my brain, and in a least convenient world, this option is closed to me also***.

It's looking pretty grim for my expected utility. But wait: we do not simply have to increase U§,T(a100). We can also decrease U§,T(a0). Now we could implement a brain hack for this also, but the same arguments against apply. A simple solution might be to use a trusted intermediary for another purpose: give him $1000, and tell him not to give it back unless I do a100. This would, in fact, motivate me, but it reintroduces the factor of how probable it is Omega will appear, which we were previously able to neglect, by altering the utility from time t to time timeOmega. Suppose we give the intermediary our account details instead. This solves the probability issue, but there is a potential for either myself to frustrate him, a solvable problem, or for Omega to frustrate him in order to satisfy the "no further consequences" requirement. And so on: the requirements of the problem are such that only our own utility function is sancrosact to Omega. It is through that mechanism only that we can win.

This is my real difficulty: that the problem appears to require cognitive understanding and technology that we do not possess. Eliezer may very well give $100 whenever he meets this problem; so may Cameron; but I wouldn't, probably not, anyway. It wouldn't be instrumentally rational for me, given my utility function under those circumstances, at least not unless something happens that can put the concepts they carry around with them into my head, and stop me - or rather, make it instrumentally irrational for me, in the sense of being part of a suboptimal policy - from removing those concepts after Omega appears.

However, on the off-chance that Omega~, a slightly less inconvenient version of Omega, appears before me: I hereby pledge one beer to every member of Less Wrong, if I fail to surrender my $100 when asked. Take that, obnoxious omniscient being!


*It's faintly amusing, though only faintly, that despite knowing full well that I was supposed to consider the least convenient possible world, I neglected to think of my least convenient possible world when I first tried to tackle the problem. Ask yourself the question.

**There are issues with identifying what it means for a brain/agent to persist from one world-state to another, but if such a persisting agent cannot be identified, then the whole problem is nonsense. It is more inconvenient for the problem to be coherent, as we must then answer it. I've also decided to use the Bellman equations with discrete time steps, rather than the time-continuous HJB equation, simply because I've never used the latter and don't trust myself to explain it correctly.

***There is the question: would one not simply dehack after Omega arrives announcing 'tails'? If that is of higher utility than other alternatives: but then we must have defined "reward" inappropriately while making the hack, as the reward for being in each state, together with the discounting factor, serves to fully determine the utility function in the Bellman equation.

(I've made a few small post-submission edits, the largest to clarify my conclusion)

27 comments, sorted by
magical algorithm
Highlighting new comments since Today at 3:06 AM
Select new highlight date

I am admittedly amazed that so much intellectual energy is devoted to a question that is not only an extremely improbable hypothetical but one that has absolutely no implications for our daily lives or the rest of our endeavors.

A question I wish to ask you all: why are you thinking about this subject?

I don't think that's true. I mentioned one real-world case that is very close to the hypothesised game in the other post: the Mutually Assured Destruction policy, or ultimatums in general.

First note that Omega's perfection as a predictor is not neccessary. With an appropriate payoff matrix even a 50.1% accurate omega doesn't change the optimal strategy. (One proviso on this is that the method of prediction must be such that it is non-spoofable. For example, I could perhaps play Omega with a 90% success rate, but knowing that I don't have access to brain-scanning abilities, you could probably conclude that I'm using more mundane ways (like reading the responses people give on blog posts about Newcomb's paradox) and so would be able to fool me (though this might not hurt my percentage much if I predict people smart enough to do this will two-box, it does change the optimal strategy because you now know you've already lost no matter what))

With MAD, the situation is similar:

  • In the event that the enemy launch a nuclear attack, it is irrational (from a life valuing sense) to destroy millions of innocent civilians when it won't help you. This corresponds to the "pay $100 when the coin comes up tails".

  • Prior to war, it is advantageous for the enemy to predict that you would destroy the world. If he believes that, then a first attack is a net loss for them, so he doesn't destroy your half of the world (the win $10000 case)

The object then is to convince the enemy that you would make the irrational decision in the loss case. But you must assume an intelligent enemy, with access to whatever technology or situations that might possibly be developed / occur in the future (truth drugs? Kidnapping and fooling a decision maker under a controlled environment and seeing what they do? something we haven't even thought of?) The only way to be sure that no test will reveal you're bluffing is not to bluff.

It follows that you should convince an enemy you actually find killing innocent civilians pleasurable, and are looking for an excuse to do so.

That would seem to be a very easy thing for them to test. Unless we keep committing atrocities every now and again to fool them, they're going to work out that it's false. Even if they do believe us (or it's true), that would itself be a good argument why our leaders would want to start the war - leading to the conclusion that they should do so to get the first strike advantage, maximising their chances.

It would seem better to convince them in some way that doesn't require us to pay such a cost if possible: and to convince the enemy that we're generally rational, reasonable people except in such circumstances where they attack us.

Many countries involved in protracted disputes do commit atrocities against third parties every now and again; perhaps not for this reason, though.

The problem is that "generally rational, reasonable people" will generally remain so even if attacked. It's much easier to convince an enemy that you are irrational, to some extent. If you can hide your level of rationality, then in a game like MAD you increase your expected score and reduce your opponent's by reducing the information available to them.

One difference between MAD and the Omega mugging is that Omega is defined so as to make any such concealment useless.

ETA: This (short and very good) paper by Yamin Htun discusses the kind of irrationality I mean. Quote:

the rational players disguise themselves as irrational; they make others believe they are altruistic, thus forcing others to play cooperatively.

Substitute "anti-altruistic" for "altruistic" and this is what I was aiming at.

But that fooling can only go so far. The better your opponent is at testing your irrational mask, the higher the risk of them spotting a bluff, and thus the closer the gap between acting irrational and being irrational. Only by being irrational can you be sure they won't spot the lie.

Beyond a certain payoff ratio, the risk from being caught out lying is bigger than the chance of having to carry through. For that reason, you end up actually appointing officers who are will actually carry through - even to the point of blind testing them with simulated tests and removing those who don't fire in such positions (even if it was the right choice )and letting your opponent know and verify this as much as possible.

If I can take this back to the "agents maximising their utility" interpretation: this is then a genuine example of a brain hack, the brain in this case being the institutional decision structure of a Cold War government (lets say the Soviets). Having decided that only by massively retaliating in the possible world where America has attacked is there a win, and having realised that as currently constituted the institution would not retaliate under those circumstances, the institution modified itself so that it would retaliate under those circumstances. I find it interesting that it would have to use irrational agents (the retaliatory officers) as part of its decision structure in order to achieve this.

This points to another difference between Omega mugging and MAD: whereas in the former, its assumed you have the chance to modify yourself in between Omega appearing and your making the decision, in the MAD case, it is deliberately arranged that retaliation is immediate and automatic (corresponding to removing the ability not to retaliate from the Soviet command structure).

Yes - it is effectively the organisational level of such a brain hack (though it would be advantageous if the officers were performing such a hack on their own brains, rather than being irrational in general - rationality in other situations is a valuable property in those with their fingers on the button.)

In the MAD case, it is deliberately arranged that retaliation is immediate and automatic

Isn't that exactly the same as the desired effect of your brain-hack in the mugging situation? Instead of removing the ability to not retaliate, we want to remove the ability to not pay. The methods differ (selecting pre-hacked / appropriately damaged brains to make the decisions, versus hacking our own), but the outcome seems directly analogous. Nor is there any further warning: the mugging situation finds you directly in the loss case (as you'd presumably be directly in the win case if the coin flip went differently) potentially before you'd even heard of Omega. Any brain-hacking must occur before the situation comes up unless you're already someone who would pay.

Isn't that exactly the same as the desired effect of your brain-hack in the mugging situation? Instead of removing the ability to not retaliate, we want to remove the ability to not pay... the mugging situation finds you directly in the loss case ... potentially before you'd even heard of Omega.

OK, so to clarify, the problem you're considering is the one where, with no preparation on your part, Omega appears and announces tails?

EDIT: Oops. Clearly you don't mean that. Do you want me to imagine a general hack we can make that increases our expected utility conditional on Omega appearing, but that we can profitably make even without having proof or prior evidence of Omega's existence?

EDIT 2: I do want to answer your question "Isn't that exactly the same as the desired effect of your brain-hack in the mugging situation?", but I'd rather wait on your reply to mine before I formulate it.

Yes, exactly. I think this post by MBlume gives the best description of the most general such hack needed:

If there is an action to which my past self would have precommited, given perfect knowledge, and my current preferences, I will take that action.

By adopting and sticking to such a strategy, I will on average come out ahead in a wide variety of Newcomblike situations. Obviously the actual benefit of such a hack is marginal, given the unlikeliness of an Omega-like being appearing, and me believing it. Since I've already invested the effort through considering the optimal route for the thought experiment though, I believe I am now in fact hacked to hardcode the future-irrational decision if it does occur.

By adopting and sticking to such a strategy, I will on average come out ahead in a wide variety of Newcomblike situations.


I believe I am now in fact hacked to hardcode the future-irrational decision if it does occur.

Here lies my problem. I would like to adopt such a strategy (or a better one if any exists), and not alter my strategy when I actually encounter a Newcomblike situation. Now in the original Newcomb problem, I have no reason to do so: if I alter my strategy so as to two-box, then I will end up with less money (although I would have difficulties proving this in the formalism I use in the article). But in the mugging problem, altering my strategy to "keep $100 in this instance only" will, in an (Omega appears, coin is tails) state, net me more money. Therefore I believe that keeping to my strategy must have intrinsic value to me, greater than that of the $100 I would lose, in order for me to keep it.

Now I can answer your question about how the MAD brain-hack and the mugging brain-hack are related. In the MAD situation, the institutions actions are "hardcoded" to occur. In the case of the mugging brain-hack, this would count as, say, wiring a device to one's brain that takes over in Omega situations. This may well be possible in some situations, but I wanted to deal with the harder problem of how to fashion the brain that, on learning it is in a "tails" state, does not then want to remove such a hack.

Now if I expect to be faced with many Omega mugging problems in the future, then a glimmer of hope appears; although "keep $100 in this instance only" may then seem to be an improved strategy, I know that this conclusion must in fact be incorrect, as whatever process I use to arrive at it is, if allowed to operate, highly likely to lose money for me in the future. In other words, this makes the problem more similar to Newcomb's problem: in the states of the world in which I make the modification, I lose money <-> in the states of the world in which I two-box, I make less money. But the problem as posed involves an Omega turning up and convincing you that this problem is the last Newcomblike problem you will ever face.

ETA: In case it wasn't clear, if I assign intrinsic value > keeping $100 to keeping my strategy, then I will surely keep my strategy. My question is: in the case of Omega appearing and my becoming convinced that I am facing my last ever Newcomblike problem, will keeping my strategy still have intrinsic value to me?

It all depends on how the hack is administered. If future-me does think rationally, he will indeed come to the conclusion that he should not pay. Any brain-hack that will actually be successful must then be tied to a superseding rational decision or to something other than rationality. If not tied to rationality, it needs to be a hardcoded response, immediately implemented, rather than one that is thought about.

There are obvious ways to set up a superseding condition: put $101 in escrow, hire an assassin to kill you if you renege, but obviously the cost from doing this now is far higher than is justified by the probability of the situation, so we need something completely free. One option is to tie it to something internally valued. eg, you value your given word or self-honesty sufficiently that living with yourself after compromising it is worse than a negative $100 utility. (This only scales to the point where you value integrity however: you may be able to live with yourself better after finding you're self deluding than after murdering 15 people to prove a point)

Had we access to our own source code, and capacity for self-modification, we could put a hardcoded path when this decision arises. Currently we have to work with the hardware we have, but I believe our brains do have mechanisms for tying future decisions to then-irrational decisions . Making credible threats requires us to back up what we say, even to someone who we will never encounter again afterwards, so similar situations (without the absolute predictive ability) are quite common in life. I know in the past I have acted perversely against my own self-interest to satisfy a past decision / issued threat. In most cases this should be considered irrationality to be removed from myself, but I think I can reuse the same mechanism to achieve an improvement here.

Obviously I can only guess whether this will in fact work in practice. I believe it will for the $100 case, but suspect that with some of the raised stakes examples given (committing murder etc), my future self may wiggle out of the emotional trap I've set for him. This is a flaw with my brain-hacking methods however - hardcoding would still be the right thing to do if possible, if the payoff were one that I would willingly trade the cost for.

(This only scales to the point where you value integrity however: you may be able to live with yourself better after finding you're self deluding than after murdering 15 people to prove a point)

This is precisely my reasoning too. It doesn't seem at all sensible to me that the principle of "acting as one would formerly have liked to have precommitted to acting" should have unbounded utility.

ETA: When you say:

Making credible threats requires us to back up what we say, even to someone who we will never encounter again afterwards, so similar situations (without the absolute predictive ability) are quite common in life. I know in the past I have acted perversely against my own self-interest to satisfy a past decision / issued threat.

Now this seems a very good point to me indeed. If we have evolved machinery present in our brains that predictably and unavoidably makes us feel good about following through on a threat and bad about not doing so - and I think that we do have that machinery - then this comes close to resolving the problem. But the point about such a mechanism is that it is tuned to have a limited effect - an effect that I am pretty sure would be insufficient to cause me to murder 15 people in the vast majority of circumstances.

It doesn't seem at all sensible to me that the principle of "acting as one would formerly have liked to have precommitted to acting" should have unbounded utility.

Mostly agreed, though I'd quibble that it does have unbounded utility, but that I probably don't have unbounded capability to enact the strategy. If I were capable of (cheaply) compelling my future self to murder in situations where it would be a general advantage to precommit, I would.

In my case, because it helps me understand self-modification and precommittment.

If your reasoning works on edge cases, you can be more confident of reasoning correctly in less difficult cases.

I must say that I'm not completely sure in what is the correct answer to the problem. It is a question of the ultimate source of morality: should you do something because your past self would want you to do so? You judge for yourself, who is your past self to literally dictate your actions? If he didn't precommit, you are free to do as you will. The past self's decision is no more natural than your decision to give up the $100, as it may follow from the similar considerations, rooted in an even earlier game, a perspective on preference order drawn from the assumption of even more counterfactual paths that didn't follow.

At the same time, you are not just determining your action, you are determining what sort of person you are, how you make your decisions, and that goes deeper than the specific action-disagreement.

What if you are a consequence of decisions and morality of some ancestor removed from you by 50000 generations, should you start acting according to his preference order, following from his different psychology? I don't think so. You may judge your current morality to be a mistake, one that you want to correct, using your knowledge of your past self, but if you decide not to, who else is to judge?

Thanks for commenting on my article.

I made a few implicit assumptions when I was writing it that I was not then aware of; these assumptions go directly to what you speak of, so I'll sketch them out.

The 'me' I speak of in the article is not actually me. It is an idealised 'me' that happens to also be perfectly rational. Suppose that my motive is to determine what such an idealised version will do, and then do it. To your question "should you do something because your past self would want you to do so?" the "me" of the article can only reply "if that is a value represented in my then-current utility function." Now I, myself, have to actually introspect and determine if that's a value I hold, but my idealised self just knows. If it expects to be faced with Omega situations, and it doesn't represent such a value, then my article proves that ideal-Nathan will modify itself such that it does represent such a value at the time it decides whether or not to pay Omega. Therefore I should try at all costs to hold such a value when I decide as well, right?

That's the difficulty. Do I really want to hold a super-general principle that covers all Newcomblike situations, and to keep that principle whatever may happen to me in future? Such a principle would mean that my future self would actually feel better, in the end, about killing 15 people because Omega "would have" given me the FAI recipe in other circumstances, than he would if he did not kill those people. Do I want to feel better when that happens? I don't think I do. But if I'm correct about that, then I must have made an error in my previous reasoning. After all, by assumption, ideal-Nathan has the same values as me; it just thinks better.

Where I think I went wrong is in assuming that the modification, (i.e. the action a_p in my article) has no cost. I think that what I was really trying to say in my article is that taking on a Newcomb-busting value system can itself have a very high cost to one's current self, and it is worth considering very carefully if one is willing to pay that cost.

The answer is perhaps that even though you'd want to meta-precommit to preserving the will of your current self on the decisions of your future self, you (as a human) actually can't do that. You can model what your decision should've been, had you been precommited by your past self, but this ideal decision is not what you actually want. This is just an abstract belief planted in your mind by the process that constructed it, from past to the future, but it's not your true preference anymore, no more than the "survival of the fittest" is the utmost concern to our kind. You may in fact believe in this decision, but you are now wrong from your new perspective, and you should probably change your mind on reflection.

and you should probably change your mind on reflection.

Ideal-Nathan would not want to do so. It may seem completely irrational, but if paperclippers can not-want not to paperclip, then ideal-Nathan can not-want not to kill 15 people for no particularly consequential reason. Your reply

but this ideal decision is not what you actually want.

is true - it really is true - but it is true because I cannot with current technology radically alter myself to make it false. Ideal-Nathan can and does - unless it puts a strong disutility on such an action, which means that I myself put a strong disutility on such an action. Which I do.

[...] I cannot with current technology radically alter myself [...] Ideal-Nathan can and does - unless it puts a strong disutility on such an action, which means that I myself put a strong disutility on such an action.

That's a mistake: you are not him. You make your own decisions. If you value following the ideal-self-modifying-you, that's fine, but I don't believe that's in human nature, it's only a declarative construction that doesn't actually relate to your values. You may want to become the ideal-you, but that doesn't mean that you want to follow the counterfactual actions of the ideal-you if you haven't actually become one.

the ideal-self-modifying-you

The ideal-potentially-self modifying me. No such being exists. I know, for a fact, that I am not perfectly rational in the sense that I construe "rational" to mean. That doesn't mean that Omega couldn't write a utility function that, if maximised, would perfectly describe my actions. Now in fact I am going to end up maximising that utility function: that's just mathematics/physics. But I am structured so as to value "me", even if "me" is just a concept I hold of myself. When I talk of ideal-Nathan, I mean a being that has the utility function that I think I have, which is not the same as the utility function that I do have. I then work out what ideal-Nathan does. If I find it does something that I know for a fact I do not want to do, then I'm simply mistaken about ideal-Nathan - I'm mistaken about my own utility function. That means that by considering the behaviour of ideal-Nathan (not looking so ideal now, is he?) I can occasionally discover something about myself. In this case I've discovered:

  • I don't care about my past selves nearly as much as I thought I did
  • I place a stronger premium on not modifying myself in such a way as to find killing pleasurable than I do on human life itself.