Simulation argument meets decision theory

by pallas1 min read24th Sep 201457 comments

19

Personal Blog

Person X stands in front of a sophisticated computer playing the decision game Y which allows for the following options: either press the button "sim" or "not sim". If she presses "sim", the computer will simulate X*_1, X*_2, ..., X*_1000 which are a thousand identical copies of X. All of them will face the game Y* which - from the standpoint of each X* - is indistinguishable from Y. But the simulated computers in the games Y* don't run simulations. Additionally, we know that if X presses "sim" she receives a utility of 1, but "not sim" would only lead to 0.9. If X*_i (for i=1,2,3..1000)  presses "sim" she receives 0.2, with "not sim" 0.1. For each agent it is true that she does not gain anything from the utility of another agent despite the fact she and the other agents are identical! Since all the agents are identical egoists facing the apparently same situation, all of them will take the same action.  

Now the game starts. We face a computer and know all the above. We don't know whether we are X or any of the X*'s, should we now press "sim" or "not sim"?

 

EDIT: It seems to me that "identical" agents with "independent" utility functions were a clumsy set up for the above question, especially since one can interpret it as a contradiction. Hence, it might be better to switch to identical egoists whereas each agent only cares about her receiving money (linear monetary value function). If X presses "sim" she will be given 10$ (else 9$) in the end of the game; each X* who presses "sim" receives 2$ (else 1$), respectively. Each agent in the game wants to maximize the expected monetary value they themselves will hold in their own hand after the game. So, intrinsically, they don't care how much money the other copies make. 
To spice things up: What if the simulation will only happen a year later? Are we then able to "choose" which year it is?

19

57 comments, sorted by Highlighting new comments since Today at 1:01 PM
New Comment

So on the face of it it seems that the only accessible outcomes are:

  • original-X chooses "sim" and gets +1; all simulated copies also choose "sim" and get +0.2 (and then get destroyed?)
  • original-X chooses "not sim" and gets +0.9; no simulated copies are made

and it seems like in fact everyone does better to choose "sim" and will do so. This is also fairly clearly the best outcome on most plausible attitudes to simulated copies' utility, though the scenario asks us to suppose that X doesn't care about those.

I'm not sure what the point of this is, though. I'm not seeing anything paradoxical or confusing (except in so far as the very notion of simulated copies of oneself is confusing). It might be more interesting if the simulated copies get more utility when they choose "not sim" rather than less as in the description of the scenario, so that your best action depends on whether you think you're in a simulation or not (and then if you expect to choose "sim", you expect that most copies of you are simulations, in which case maybe you shouldn't choose "sim"; and if you expect to choose "not sim", you expect that you are the only copy, in which case maybe you should choose "sim").

I'm wondering whether perhaps something like that was what pallas intended, and the current version just has "sim" and "not sim" switched at one point...

It's tempting to say that, but I think pallas actually meant what he wrote. Basically, hitting "not sim" gets you a guaranteed 0.9 utility. Hitting "sim" gets you about 0.2 utility, getting closer as the number of copies increases. Even though each person strictly prefers "sim" to "not-sim," and a CDT agent would choose sim, it appears that choosing "not-sim" gets you more expected utility.

Edit: not-sim has higher expected utility for an entirely selfish agent who does not know whether he is simulated or not, because his choice affects not only his utility payout, but also acasually affects his state of simulation. Of course, this depends on my interpretation of anthropics.

Oh, I see. Nice. Preferring "not sim" in this case feels rather like falling victim to Simpson's paradox, but I'm not at all sure that's not just a mistake on my part.

Thanks for the explanation. I had no idea what was actually going on here.

This is a formal version of a real-life problem I've been thinking about lately.

Should we commit to creating ancestor-simulations in the future, where those ancestor-simulations will be granted a pleasant afterlife upon what appears to their neighbors to be death? If we do, then arguably we increase the likelihood that we ourselves have a pleasant afterlife to look forward to.

I'm pretty sure there's something wrong with this argument, but I can't seem to put my finger on what it is. It reminds of this post, in a way.

This is pretty much like one guy destroying evidence about global warming so that everyone else should predict a lower probability of a catastrophe. It fails for the same reasons.

It seems to me that you should only do this if everyone has utility functions that are completely anthropically selfish (i.e. they only care about their own subjective experience). Otherwise, wouldn't it be cruel to intentionally simulate a world with so many unpleasant characteristics that we could otherwise remove if we weren't focused on making the simulation subjectively indistinguishable from our own world?

As such, I don't think we should commit to any such thing.

The point you raise is by far the strongest argument I know of against the idea.

However, it is a moral objection rather than a decision-theory objection. It sounds like you agree with me on the decision theory component of the idea: that if we were anthropically selfish, it would be rational for us to commit to making ancestor-simulations with afterlives. That's an interesting result in itself, isn't it? Let's go tell Ayn Rand.

When it comes to the morality of the idea, I might end up agreeing with you. We'll see. I think there are several minor considerations in favor of the proposal, and then this one massive consideration against it. Perhaps I'll make a post on it soon.

I don't actually have a rigorous answer at the moment, but let me go into what I think of as the "two-fluid model of anthropics."

The two "fluids" are indexical probability measure and anthropic measure. Indexical probability is "how likely you are to be a particular person" - it is determined by what you know about the world. Anthropic measure is magical reality fluid - it's "how much you exist." Or if we project into the future, your probability measure is how likely you are to see a certain outcome. Anthropic measure is how much that outcome will exist.

Usually these two measures correspond. We see things and things exist at about the same rate. But sometimes they diverge, and then we need a two-fluid model.

A simple example of this is the quantum suicide argument. The one says "no matter what, I'll always have some chance of surviving. From my perspective, then, I'll never die - after all, once I die, I stop having a perspective. So, let's play high-stakes Russian roulette!" There are multiple ways to frame this mistake, but the relevant one here is that it substitutes what is seen (your probability that you died) for what is real (whether you actually died).

Another case where they diverge is making copies. If I make some identical copies of you, your probability that you are some particular copy should go down as I increase the number of copies. But you don't exist any less as I make more copies of you - making copies doesn't change your anthropic measure.

Two-fluid model decision-making algorithm: follow the strategy that maximizes the final expected utility (found using anthropic measure and causal structure of the problem) for whoever you think you are (found using indexical probability measure). This is basically UDT, slightly generalized.

But (and this is why I was wrong before), this problem is actually outside the scope of my usual two-fluid model. It doesn't have a well-defined indexical probability (specifically which person you are, not just what situation you're in) to work with. It depends on your decision. We'll need to figure out the correct generalization of TDT to handle this.

Okay, so you choose as if you're controlling the output of the logical node that causes your decision. Non-anthropically you can just say "calculate the causal effect of the different logical-node-outputs, then output the one that causes the best outcome." But our generalization needs to be able to answer the question "best outcome for whom?" I would love to post this comment with this problem resolved, but it's tricky and so I'll have to think about it more / get other people to tell me the answer.

Firstly, let's assume that we deny physical continuity of personal identity, but embrace psychological continuity instead. In that case, you don't want to press the button as it simulation has just as much claim to be the continuation of "you" as the original you, even if each instance knows whether it is in the real world or in a simulation.

The rest of this analysis will assume continuity of physical identity instead. From this perspective, pressing the button reduces your expectation of future utility, but it only does this by changing your epistemic state, rather than actually making things worse for the person who pushed the button. As an example, if I was to completely erase the strongest evidence of global warming from everyone's mind, I would reduce the expectation of a catastrophe, but not the likelihood of it occurring. Here, we've done the same, just by creating simulations so that you no longer are certain whether you are the original or a simulation. Since this is only a change of your epistemic state, you should actually push the button.

This is actually quite an important problem for building an AI as we don't want it to mess with its expectation.

I smiled when I realized why the answer isn't trivially "press sim", but that slight obfuscation is causing a lot of confused people to get downvoted.

I think there's an ambiguity here:

  • if X*_i presses "sim" and X*_i receives .2 utility, then "sim" is the dominant choice and there's nothing more to say.

  • if X*_i presses "sim" and X gets .2 utility, then each X' (X or X*_i) faces a variant of the Absent Minded Driver.

The second interpretation seems though to be contraddicted by OP words:

For every agent it is true that she does not gain anything from the utility of another agent despite the fact she and the other agents are identical!

Your problem setup contains a contradiction. You said that X and X*_i are identical copies, and then you said that they have different utility functions. This happened because you defined the utility function over the wrong domain; you specified it as (world-history, identity)=>R when it should be (world-history)=>R.

How I interpreted the problem- it's not that identical agents have different utility functions, it's just that different things happen to them. In reality, what's behind the door is behind the door, while in the simulation rewards X with something else. X is only unaware of whether or not he's in a simulation before he presses the button- obviously once he actually receives the utility he can tell the difference. Although the fact that nobody else has stated this makes me unsure. OP, can you clarify a little bit more?

Yes, this is how I view the problem as well.

If that is the only way that utility functions can be defined then it means that anthropic egotism is incoherent.

What's the point of utility functions if you can't even in principle know their value for the universe you're actually in? Utility functions are supposed to guide decisions. A utility function that can't be approximated, even a little, even with infinite computing power, can't be linked to a decision theory or used in any other way.

I'm generally inclined to agree with you because there's generally a lot of issues that come up with anthropics, but in order to drop the matter altogether you would need to genuinely dissolve the question.

The steelman response to your point is this: For each possible strategy you could choose, you can evaluate the probability of which "you" you actually are. You can then evaluate the utility values conditional on each possible self, and calculate the expected value over the probability distribution of selves. As such, it is clearly possible to approximate and calculate utilities for such functions, and use them to make decisions.

The question is not whether or not you can do the calculations, the question is whether or not those calculations correspond to something meaningful.

A simpler version of the original post is this. Let there be a single, consistent utility function shared by all copies of the agent (X and all Xi). It assigns these utility values:

  1. X chooses "sim", and then N instances of Xi choose "sim" and 1000-N instances choose "don't sim" -> 1.0 + 0.2N + 0.1(1000-N)
  2. X chooses "don't sim", no Xi gets created -> 0.9

Of course, the post's premise is that the only actually possible universe in category 1 is that where all 1000 Xi instances choose "sim" (because they can't tell if they're in the simulation or not), so the total utility is then 1 + 0.2*1000 = 201.

This is a simple demonstration of TDT giving the right answer which maximizes the utility ("sim") while CDT doesn't (I think?)

What didn't make sense to me was saying X and Xi somehow have "different" utility functions. Maybe this was just confusion generated by imprecise use of words, and not any real difference.

The post then says:

For every agent it is true that she does not gain anything from the utility of another agent despite the fact she and the other agents are identical!

I'm not sure if this is intended to change the situation. Once you have a utility function that gives out actual numbers, you don't care how it works on the inside and whether it takes into accounts another agent's utility or anything else.

The idea is that they have the same utility function, but the utility function takes values over anthropic states (values of "I").

U(I am X and X chooses sim) = 1
U(I am Xi and Xi chooses sim) = 0.2 etc.

I don't like it, but I also don't see an obvious way to reject the idea.

Thanks for mentioning this. I know this wasn't put very nicely.
Imagine you were a very selfish person X only caring about yourself. If I make a really good copy of X which is then placed 100 meters next to X, then this copy X only cares about the spatiotemporal dots of what we define X. Both agents, X and X, are identical if we formalize their algorithms incorporating indexical information. If we don't do that then a disparity remains, namely that X is different to X in that, intrinsically, X only cares about the set of spatiotemporal dots constituting X. The same goes for X accordingly. But this semantical issue doesn't seem to be relevant for the decision problem itself. The kind of similarity that is of interest here seems to be the one that determines similiar behavior in such games. (Probably you could set up games where the non-indexical formalization of the agents X and X are relevantly different, I merely claim that this game is not one of them)

[-][anonymous]6y 1

It should be (world-history, identity)=>R. Different agents have different goals, which give different utility values to actions.

You've then incorporated identity twice: once when you gave each agent its own goals, and again inside of those goals. If an agent's goals have a dangling identity-pointer inside, then they won't stay consistent (or well-defined) in case of self-copying, so by the same argument which says agents should stop their utility functions from drifting over time, it should replace that pointer with a specific value.

So, in other words: If I am D and all I want is to be king of the universe, then before stepping into a copying machine I should self-modify so that my utility function will say "+1000 if D is king of the universe" rather than "+1000 if I am king of the universe", because then my copy D2 will have a utility function of "+1000 if D is king of the universe", and that maximises my chances of being king of the universe.

That is what you mean, right?

I guess the anthropic counter is this: What if, after stepping into the machine, I will end up being D2 instead of being D!? If I was to self-modify to care only about D then I wouldn't end up being king of the universe, D would!

The agent, and the utility function's implementation in the agent, are already part of the world and its world-history. If two agents in two universes cannot be distinguished by any observation in their universes, then they must exhibit identical behavior. I claim it makes no sense to say two agents have different goals or different utility functions if they are physically identical.

There is a difference between X and Xi: the original X can choose to simulate copies of herself, which exist in the world_history and are legitimate subjects to assign utility to.

A copy X_i can't create further copies (pressing "sim" does nothing in the simulation), so her utility for the action is different.

The non-anthropic version is this:

X self-modifies to ensure that any copy of X (i.e. any X*) will only care about X and not X*, presses "sim", and everyone "gets" 1 utility.

It seems like "sim" is the strictly dominant action for X and all X*. Thus we should always press "sim". The more interesting question would be what would happen if the incentives for pressing "sim" were reversed for the agents (i.e., the payoff for an agent choosing "not sim" exceeded "sim"). Then we'd have a cool mixed strategy problem.

If you press Sim, you get 0.2 utilons, but if you press No Sim, you get 0.9. Isn't that interesting?

Either I'm missing something or it isn't interesting. If you aren't a Sim, you get 1 for Sim and .9 for not Sim. So your best play is Sim. If you are a Sim, you get .2 for Sim and .1 for not Sim, so your best play is Sim. Regardless of the piece of information you lack, your best play is Sim.

The question is whether this is analogous to Newcomb's problem. After all, one can make the same argument you make that "no matter what the contents of the box, taking both boxes is the dominant strategy,"

I agree. It seems to me that the speciality of the Necomb Problem is that actions "influence" states and that this is the reason why the dominance principle alone isn't giving the right answer. The same applies to this game. Your action (sim or not sim) determines the probability of which agent you have been all along and therefore "influences" the states of the game, whether you are X or X*. Many people dislike this use of the word "influence" but I think there are some good reasons in favour of a broader use of it (eg. quantum entanglement).

Actually, when working through it, it looks like pre-nap self was wrong! (Okay no distancing, I was wrong :P). I still wrote up a big comment though, you'll see what I neglected before.

Short counterargument - if rather than a mild reward, the simulated copies got a horrible punishment, would that change whether you picked Sim when you might be one of those copies? Then because of how utility works the simulated copies always matter.

Actually, upon consideration, I will endorse answering "Sim." Will explain in a top level comment after nap. But I certainly agree that the resemblance to Newcomb's problem is why this is interesting.

EDIT: Whoops, I was wrong.

Thanks for pointing this out. I (clearly) missed the thinly veiled Newcomb-likeness of this puzzle.

It is analogous if and only if anthropic utility is coherent and reasonable.

If you decide not to press "sim", you know that there are no simulations. It's impossible for there to be an original who presses "sim" only for the simulations to make different decisions. You're the original and will leave with 0.9.

If you decide to press "sim", you know that there are 1000 simulations. You've only got a 1 in 1001 chance of being the original. Your expected utility for pressing the button is slightly more than 0.2.

Oh! I got it. Thanks for patiently sticking this out with me!

Isn't this another case of the halting problem in disguise? When the computer simulates you, it also simulates your attempt to figure out what the computer would do.

I don't follow. The OP only describes one computer simulating people, and it doesn't care if they reach a decision or not. It just performs fixed actions if they do. For a given decision, everyone by assumption already knows what the computer will do. (I assumed that each X* defined zero utility to include non-termination of the decision procedure, though I doubt the number matters.) Perhaps for this reason, my own decision procedure terminates quickly and should be easy to simulate here.

Your utility calculation determines the result of each button, and therefore which button you will press. But the likelihood of being in a simulation determines the result of your utility calculation. And which button you press determines (via the computer simulating you or not) the likelihood of being in a simulation. So your utility calculation is indirectly trying to determine its own result.

Just do it this way:

  • Assume you pick "sim" => calculate the probability of being a simulation conditional on picking "sim" => calculate the expected utility conditional on picking "sim" and on the calculated probabilties.
  • Assume you pick "don't sim" => calculate the probability of being a simulation conditional on picking "don't sim" => calculate the expected utility conditional on picking "don't sim" and on the calculated probabilities.

Then just pick whichever of the two has the highest expected utility. No infinite regress there!

I guess he's positing that you yourself might simulate the computer in order to figure out what happens.

You're right, though; I don't see any reason to actually do that, because you already have a sufficient specification to work out the consequences of all of your available strategies for the problem.

Problems of that nature are pretty easy to resolve. For example:

You have five seconds to make your decision; if you run out of time, the computer chops your head off.

Assuming the subject doesn't want to get his head chopped off, then you're no longer asking the question "what does decision theory say you should do", you're asking "what does decision theory say you should do, given that certain types of analysis to determine what decision is the best are not allowed". Such a question may provide an incentive for the person sitting there in front of a homicidal computer, but doesn't really illuminate decision theory much.

Also, the human can't avoid getting his head chopped off by saying "I'll just not make any decisions that trigger the halting problem"--trying to determine if a line of reasoning will trigger the halting problem would itself trigger the halting problem. You can't think of this as "either the human answers in a split second, or he knows he's doing something that won't produce an answer".

(Of course, the human could say "I'll just not make any decisions that are even close to the halting problem", and avoid triggering the halting problem by also avoiding a big halo of other analyses around it. If he does that, then my first objection is even worse.)

but doesn't really illuminate decision theory much.

I don't know about that. The study of making decisions under significant constraints (e.g. time) looks very useful to me.

You're the one who brought computational constraints into the problem, not me. In the abstract sense, a decision-theoretically optimal agent has to be able to solve the halting problem in order to be optimal.

If we start to consider real-world constraints, such as being unable to solve the halting problem, then real-world constraints like having a limit of five seconds to make a decision are totally reasonable as well.

As for how to avoid getting your head chopped off, it's pretty easy; just press a button within five seconds.

If we start to consider real-world constraints, such as being unable to solve the halting problem

What? Being unable to solve the halting problem is a theoretical constraint, not a real-world constraint.

If it's purely theoretical then why can't I have a hypercomputer? What's wrong with simply solving the halting problem by using an oracle, or by running a Turing machine for infinitely many steps before I make my decision?

If I can't have infinite time, then I might as well have 5 seconds.

If it's purely theoretical then why can't I have a hypercomputer? What's wrong with simply solving the halting problem by using an oracle, or by running a Turing machine for infinitely many steps before I make my decision?

You're asking the same question three times.

Anyway, an oracle can determine whether a program in a Turing machine can halt. It can't determine whether it itself can halt.

Any attempt to use an oracle could lead to X predicting Y who tries to predict X using an oracle. That can be equivalent to the oracle trying to determine whether it itself can halt.

If I can't have infinite time, then I might as well have 5 seconds.

This is of course true, but it just means that both finite time and 5 seconds are bad.

OK, I think I've found a source of confusion here.

There's two fundamentally different questions one could ask:

  • what is the optimal action for X/X* to perform?
  • what computations should X/X* perform in order to work out which action ve should perform?

The first question is the standard decision-theoretic question, and in that context the halting problem is of no relevance because we're solving the problem from the outside, not from the inside.

On the other hand, there is no point to taking the inside or "embedded" view unless we specifically want to consider computational or real-world constraints. In that context, the answer is that it's pretty stupid for the agent to run a simulation of itself because that obviously won't work.

Any decision-making algorithm in the real world has to be smart enough not to go into infinite loops. Of course, such an algorithm won't be optimal, but it would be very silly to expect it to be optimal except in relatively easy cases.

I think the issue may be that the "egoistic" utility functions are incoherent in this context, because you're actually trying to compare the utility functions of two different agents as if they were one.

Let's say, for example, that X is a paperclip maximiser who gets either 10 paperclips or 9 paperclips, and each X* is a human who either saves 2 million lives or 1 million lives.

If you don't know whether you're X or X*, how can you compare 10 paperclips to 2 million lives?

[+][anonymous]6y -9