Selfish preferences and self-modification

by Manfred1 min read14th Jan 201524 comments


AnthropicsDecision Theory
Personal Blog

One question I've had recently is "Are agents acting on selfish preferences doomed to having conflicts with other versions of themselves?" A major motivation of TDT and UDT was the ability to just do the right thing without having to be tied up with precommitments made by your past self - and to trust that your future self would just do the right thing, without you having to tie them up with precommitments. Is this an impossible dream in anthropic problems?


In my recent post, I talked about preferences where "if you are one of two copies and I give the other copy a candy bar, your selfish desires for eating candy are unfulfilled." If you would buy a candy bar for a dollar but not buy your copy a candy bar, this is exactly a case of strategy ranking depending on indexical information.

This dependence on indexical information is inequivalent with UDT, and thus incompatible with peace and harmony.


To be thorough, consider an experiment where I am forked into two copies, A and B. Both have a button in front of them, and 10 candies in their account. If A presses the button, it deducts 1 candy from A. But if B presses the button, it removes 1 candy from B and gives 5 candies to A.

Before the experiment begins, I want my descendants to press the button 10 times (assuming candies come in units such that my utility is linear). In fact, after the copies wake up but before they know which is which, they want to press the button!

The model of selfish preferences that is not UDT-compatible looks like this: once A and B know who is who, A wants B to press the button but B doesn't want to do it. And so earlier, I should try and make precommitments to force B to press the button.

But suppose that we simply decided to use a different model. A model of peace and harmony and, like, free love, where I just maximize the average (or total, if we specify an arbitrary zero point) amount of utility that myselves have. And so B just presses the button.

(It's like non-UDT selfish copies can make all Pareto improvements, but not all average improvements)


Is the peace-and-love model still a selfish preference? It sure seems different from the every-copy-for-themself algorithm. But on the other hand, I'm doing it for myself, in a sense.

And at least this way I don't have to waste time with precomittment. In fact, self-modifying to this form of preferences is such an effective action that conflicting preferences are self-destructive. If I have selfish preferences now but I want my copies to cooperate in the future, I'll try to become an agent who values copies of myself - so long as they date from after the time of my self-modification.


If you recall, I made an argument in favor of averaging the utility of future causal descendants when calculating expected utility, based on this being the fixed point of selfish preferences under modification when confronted with Jan's tropical paradise. But if selfish preferences are unstable under self-modification in a more intrinsic way, this rather goes out the window.


Right now I think of selfish values as a somewhat anything-goes space occupied by non-self-modified agents like me and you. But it feels uncertain. On the mutant third hand, what sort of arguments would convince me that the peace-and-love model actually captures my selfish preferences?


24 comments, sorted by Highlighting new comments since Today at 9:55 AM
New Comment

I am confident that, in this experiment, my B-copy would push the button, my A-copy would walk away with 60 candies, and shortly thereafter, if allowed to confer, they would both have 30. And that this would happen with almost no angst.

I'm puzzled as to you why you think this is difficult. Are people being primed by fiction where they invariably struggle against their clones to create drama?

Hm, this points out to me that I could have made this post more stand-alone. The idea was that you eat the candy and experience a non-transferrable reward. But let me give an example of what I mean by selfish preferences.

If someone made a copy of me and said they could either take me hang-gliding, or take my copy, I'd prefer that I go hang-gliding. Selfishly :P

Assuming we substitute something I actually want to do for hang-gliding...

("Not the most fun way to lose 1/116,000th of my measure, thanks!" say both copies, in stereo)

...and that I don't specifically want to avoid non-shared experiences, which I probably do...

("Why would we want to diverge faster, anyway?" say the copies, raising simultaneous eyebrows at Manfred)

...that's what coinflips are for!

(I take your point about non-transferability, but I claim that B-me would press the button even if it was impossible to share the profits.)

I think that's a totally okay preference structure to have (or to prefer with metapreferences or whatever).

Delicious reinforcement! Thank you, friend.

I'd throw dice and select based on outcome. Thereby everybody of both me's get half a hang-glide (on average).

But then I guess I'd work together with clones of myself well to. Quite well. Better than with most other people. Not everybody I know would. Some claimed they couldn't live together with another instance of themselves.

How is this different from the Rawlsian fix to selfishness?

If we have some UDT-incompatible selfish preferences, that's fine with me - so most likely this is so different from Rawls that it's not even a fix to selfishness.

But yeah, the peace and harmony model is quite Rawlsian, and is somewhat equivalent to the claim that if we want to self-modify, we should also act as if we self-modified in the past. (Where there is some vagueness in how to cash out 'past self', which probably makes the UDT formulation nicer).

How much do your other selves need to diverge from you for you to stop caring about them?

Obviously you still view the other one as "you" even though your brain contains a pattern that says "I am B" and the other has a pattern that says "I am A".

Can you rigorously define at what point you no longer consider the "other" one as part of you?

What if your alter ego has a long conversation with a philosopher and comes out as no longer selfish, and now wants to help the world, giving you a severe distaste and making you not want to help them any more? (You prefer to use your resources rather than let them use it even if they can produce more out of it.)

Can you rigorously define at what point you no longer consider the "other" one as part of you?

Presumably this is like trying to solve the Sorites paradox. The best you can do is to find a mutually acceptable Schelling point, e.g. 100 grains of sand make a heap, or disagreeing on 10% or more of all decisions means you are different enough.

A gradual falling-off of concern with distance seems more graceful than suddenly going from all to nothing. It's not like the legal driving age, where there's strong practical reason for a small number of sharp cut-offs.

10% or more of all decisions

Then we have the problem of deciding what counts as a decision. Even very minor changes will invalidate a broad definition like "body movements", as most body movements will be different after the 2 diverge.

My prefered diverging point is as soon as the cloning happens. I'm open to accepting that as long as they are identical, they can cooperate, but that can be justified by pure TDT without invoking "caring for the other". But any diverging stops this; that's my Schelling point.

Do you really think your own nature that fragile?

(Please don't read that line in a judgemental tone. I'm simply curious.)

I would automatically cooperate with a me-fork for quite a while if the only "divergence" that took place was on the order of raising a different hand, or seeing the same room from a different angle. It doesn't seem like value divergence would come of that.

I'd probably start getting suspicious in the event that "he" read an emotionally compelling novel or work of moral philosophy I hadn't read.

If we raised different hands, I do think it would quickly cause us to completely diverge in terms of how many body movements are equal. That doesn't mean we would be very different, or that I'm fragile. I'm pretty much the same as I was a week ago, but my movements now are different. I was just pointing out that "decisions" isn't that much more well defined than what it was coming to define (divergent).

I would automatically cooperate

In a True Prisoner's Dilemma, or even in situations like the OP? The divergence there is that one person knows they are "A" and the other "B", in ways relevant to their actions.

Ah, I see. We may not disagree, then. My angle was simply that "continuing to agree on all decisions" might be quite robust versus environmental noise, assuming the decision is felt to be impacted by my values (i.e. not chocolate versus vanilla, which I might settle with a coinflip anyway!)

In the OP's scenario, yes, I cooperate without bothering to reflect. It's clearly, obviously, the thing to do, says my brain.

I don't understand the relevance of the TPD. How can I possibly be in a True Prisoner's Dilemma against myself, when I can't even be in a TPD against a randomly chosen human?

OP is assuming selfishness, which makes this True. Any PD is TPD for a selfish person. Is it still the obvious thing to do if you're selfish?

Yes, for a copy close enough that he will do everything that I will do and nothing that I won't. In simple resource-gain scenarios like the OP's, I'm selfish relative to my value system, not relative to my locus of consciousness.

So we have different models of selfishness, then. My model doesn't care about anything but "me", which doesn't include clones.

any diverging stops this

The trouble is, of course, that if you both predictably (say, with 98% probability) switch to defecting after one sees 'A' and the other sees 'B', you could just as easily (following some flavor of TDT) predictably cooperate.

This issue is basically the oversimplification within TDT where it treats algorithms as atomic causes of actions, rather than as a lossy abstraction from complex physical states. This is a very difficult AI problem that I'm pretending is solved for the purposes of my posts.

I agree, "as soon as the cloning happens" is an obvious Schelling point with regards to caring. However, if you base your decision to cooperate or defect on how similar the other clone is to you in following the same decision theory, then this leads to "not at all similar", resulting in defection as the dominant strategy. If instead you trust the other clone to apply TDT the way you do, then you behave in a way that is equivalent to caring even after you profess that you do not.

I don't think so. When I say I would cooperate, I mean standard Prisoner's Dilemma stuff. I don't have to care about them to do that.

The things I wouldn't care about are the kinds of situations mentioned in the OP. In a one sided Dilemma, where the other person has no choice, TDT does not say you should cooperate. If you cared about them, then you should cooperate as long as you will lose less than they gain. In that case I would not cooperate, even though I might self-modify to cooperating now if given the choice.

I see. I understand what you mean now.

If one cares about their copies because their past self self-modified to a stable point, then what matters are the preferences of this causal ancestor. If I don't want my preferences to be satisfied if I am given a pill that makes me evil, then I will self-modify so that if one of my future copies takes the evil pill, my other future copies will not help them.

In other words, there is absolutely not one true definition here.

However, at a minimum, agents will self-modify so that copies of them with the same values and world-model, but who locate themselves at different places within that model, will sacrifice for each other.

You are just giving yourself a large incentive to lie to your alter ego if you suspect that you are diverging. That doesn't sound good.

On the original post: I don't think that it's practical to commit to something like that right now as a human. I have the same problem with TDT. I can agree that self modifying is best, but still not do as I would wish to have precommitted. But as we're talking about cloning here anyway, we can assume that self-modification is possible, in which the question arises whether this modification has positive expected utility. I think it does, but you seem to be trying to say that you wouldn't need to modify, as each side would stay selfish but still do what they would have preferred in the past. Why would you continue doing something that you committed to if it no longer has positive utility?

Would you pay the traveler in Parfit's hitchhiker as a selfish agent? If not, why cooperate with your alter ego after you find out that you are B? (Yes, I'm comparing this to Parfit's hitchhiker with your commitment to press the button if B analogous to a commitment to give money later. It's a little different as it's symmetrical, but the question of whether you should pay up seems isomorphic. Assuming the traveler isn't reading your mind, in which case TDT enters the picture.)