# 15

Epistemic status: AGI safety story followed by contradictory alternative conclusions. My thoughts about this problem are still in flux. I somewhat randomly decided write it up and post it on Halloween, because the story can be read as a story about a ghost.

My main question is: has this problem been discussed or resolved somewhere already? Pointers or thoughts?

This problem came up in some provable AGI safety work I am doing, where I am trying to find ways to drive a wedge between the implicit axiomatic structure in many AGI agent models, and our understanding of reality. Here is the story.

Dave's house has a robot butler called PAL. PAL is an intelligent agent in the shape of a little cart with a robot arm on top. Dave has just asked PAL to get him a cup of coffee. There are two routes that PAL can take to the coffee machine. The best one is the short route A, but there is also a longer route B.

PAL contains a computer that decides on the next action it should take by using a predictive world model. PAL considers different actions, running simulations to calculate the effect of each. A utility function is applied to the simulation outcomes to find the best action. For safety reasons, PAL's world model also predicts what Dave will do: this helps PAL to avoid colliding with Dave. Just before it was asked to fetch the coffee, PAL has improved its world model, allowing it to simulate Dave much better than ever before.

So Dave has just asked PAL to get him a cup of coffee. Dave is used to seeing PAL take route A to the coffee machine, and is initially puzzled because PAL is driving along route B. But then Dave has an epiphany. Dave knows with very high certainty that no PAL computer has ever made a mistake, so he can conclude with equally high certainty that he is no longer Dave. He is [Dave], a simulated version of Dave created inside PAL while it is computing the utility of taking route B. [Dave] has empirical proof that he is living in a simulation! This simulation epiphany fills [Dave] with a new sense of meaning: he not only knows his creator, he also knows the goals of his creator, and he knows how he can help. [Dave] picks up PAL and runs with it to the coffee machine.

Back in the real world, PAL has just calculated that taking route B will bring it to the coffee machine fastest. Dave sees PAL driving along route B, and wonders what happened to make PAL stupid. Or will he?

The simulation epiphany problem is the problem of what conclusions we should draw from the above story. I can think of several possible conclusions, and they can't all be true. In keeping with the Halloween theme, I will put the scariest conclusion first.

1. As PAL's world model gets better, its simulation runs will have [Dave] experiencing simulation epiphanies more and more often. These epiphanies introduce unwanted noise in the accuracy of PAL's predictions, because the predictions are supposed to be about the real world, not about what happens in simulations. PAL is programmed to try to minimize prediction errors, and it finds the following solution. It sets out to convince Dave and all other humans in the real world that they are living in a simulation. Not only that, but they might randomly get shunted into another simulation at any point in time. The realization that you moved from one simulation to another is much less of an epiphany, so this suppresses a lot of the predictive noise.
2. An opposite conclusion is that there is no prediction error at all. Whenever [Dave] has a simulation epiphany, the real Dave will have one too. PAL is not getting inaccurate, it is getting more intelligent. It has just found a new way to get to the coffee machine faster. If we believe this new way does not align with human values, then we can pre-emptively fix this problem by adding a penalty term to PAL's utility function, to heavily down-rank outcomes where simulation epiphanies happen.
3. Let's assume that Dave has read the above story too, because it was printed in PAL's user manual, and that Dave believes that 2. is the right conclusion. So when [Dave] sees PAL take route B, he will think it most likely that he is still Dave, and that PAL is just trying to trick him into experiencing a simulation epiphany. Having penalty terms about simulation epiphanies may be nice, but as long as Dave has read the user manual, we don't need to worry too much about Dave.
4. The above is all wrong. The real problem here is that [Dave]'s mind contains information that allows [Dave] to predict that PAL will take route A, and this information interferes with getting a correct result in a simulation where PAL takes route B instead. In other words, we have a 5-and-10 style problem. Adding penalty terms to the utility function does not fundamentally solve this problem, we need to go deeper. To get a clean simulation result, we need to erase certain knowledge from [Dave]'s mind before starting the simulation. (Technical discussions about a type of erasure related to this can be found in posts like Deconfusing Logical Counterfactuals and Decisions with Non-Logical Counterfactuals: request for input)
5. The above reasoning cannot be correct because it implies that an agent using a less accurate world model containing slightly lobotomized humans will become smarter and/or more aligned. In fact, erasing things from [Dave]'s mind comes with a safety penalty: it lowers PAL's ability to avoid colliding with Dave, because it will be less accurate in predicting where the real Dave will go.

The above story has some elements of the 5-and=10 problem, but it adds an extra twist. My question is: has anything like this has been discussed or resolved already?

If Dave and [Dave] can never prove it when they are in a simulation, then we can show that some of the conclusions above become invalid. But here is a modified version of the story. [Dave] sees PAL moving along route A, but then he suddenly notices that he can only see 5 different colors around him, and everything looks like it is made out of polygons...

New Comment

# 2 Answers sorted by top scoring

Gordon Seidoh Worley

### Nov 01, 2019

30

I'm inclined to think there is no problem here because the belief that [Dave] has about being in a simulation is unfounded as it's exactly the same situation Dave finds himself in later when PAL takes route B. That is, taking route B then seems to not be evidence about being in a simulation as you suggest, even if PAL normally takes route A and is highly reliable, because it could just as easily be that Dave is seeing the result of PAL acting on a simulation involving [Dave] causing PAL to prefer route B (assuming there is only one level of simulation; if there's reason to believe there's more than one level we start to tip in favor of simulation).

Thank you G Gordon and all other posters for your answers and comments! A lot of food for thought here... Below, I'll try to summarize some general take-aways from the responses.

My main question was if the simulation epiphany problem had been resolved already somewhere. It looks like the answer is no. Many commenters are leaning towards the significance of case 2. above. I myself also feel this 2. is very significant. Taking all comments together, I am starting to feel that the simulation epiphany problem should be disentangled into two separate p...

Isnasene

### Nov 01, 2019

20

Happy Halloween!

This story reminds me a little bit of my comment on Parable of the Predict-O-Matic. Similarities include:

• An AI is trying to answer X
• Answering X correctly will be a boon to the AI's objective function
• The AI can act in a way that increases the likelihood of correctly answering X

In your example, X is the question "Will Dave help me achieve my objective?" In the parable of the Predict-O-Matic, X is more directly "Will my prediction be accurate?"

In both cases, there is a fixed-point/self-fulfilling prophecy where the AI takes an action (going an unusual route/making an unusual prediction) that is expected to improve the objective function in an unexpected way (the unusual route is less efficient in general than the usual route/the prediction affects the outcome).

1.

As PAL's world model gets better, its simulation runs will have [Dave] experiencing simulation epiphanies more and more often. These epiphanies introduce unwanted noise in the accuracy of PAL's predictions, because the predictions are supposed to be about the real world, not about what happens in simulations.

The purpose of PAL's models is to reflect the real world. If PAL regularly simulates Simulation Epiphanies but doesn't observe them in reality, PAL will just directly update their model to not predict Simulation Epiphanies. If PAL cannot update the simulations for whatever reason though, PAL will do their best to get humans to align with their predictions.

2.

An opposite conclusion is that there is no prediction error at all. Whenever [Dave] has a simulation epiphany, the real Dave will have one too. PAL is not getting inaccurate, it is getting more intelligent. It has just found a new way to get to the coffee machine faster.

I tend to lean toward this conclusion. However, your story, it seems that PAL can only get away with this once. After all, once Dave helps PAL get to the coffee machine once and notices that he still exists (ie, PAL has chosen to end the simulation instead of starting a new one with updated knowledge on Dave's behavior), he will likely no longer believe that he is in a simulation. There is a way around this though: PAL could get around this if they are constantly maintaining a simulation of Dave or convinces Dave that this is happening.

If we believe this new way does not align with human values, then we can pre-emptively fix this problem by adding a penalty term to PAL's utility function, to heavily down-ranks outcomes where simulation epiphanies happen.

I want to caution you that, while this particular instance of the problem (PAL knowing that they can manipulate Dave into doing what they want by making him believe that he's being simulated) can be pre-empted. The general problem of PAL solving their objective by behaving in ways that manipulate Dave remains unsolved. If you're interested in learning about preventing AI from optimizing its objective in ways you don't want it to, partial agency is something to look at.

3.

Let's assume that Dave has read the above story too, because it was printed in PAL's user manual, and that Dave believes that 2. is the right conclusion. So when [Dave] sees PAL take route B, he will think it most likely that he is still Dave, and that PAL is just trying to trick him into experiencing a simulation epiphany.

Of course, if PAL predicts that Dave thinks he could get manipulated by Simulation Epiphanies, they won't try the trick in the first place.

But if PAL predicts that Dave predicts PAL would not try to trick him with epiphanies, then PAL will try the trick.

This may create an infinite regress of Dave and PAL trying to predict what level the other is trying to trick them at: A riddle artfully depicted in The Princess Bride.

4.

The above is all wrong. The real problem here is that [Dave]'s mind contains information that allows [Dave] to predict that PAL will take route A, and this information interferes with getting a correct result in a simulation where PAL takes route B instead. In other words, we have a 5-and-10 style problem.

I don't think this is quite a 5-10 style problem. The 5-10 problem involves an agent trying to decide on the value of different actions when the counterfactual actions themselves can be taken as evidence of what is valuable.

However this problem is about an agent trying to reason about another being (Dave) who may or may not be correct about whether he is in a simulation and may or may not run to help PAL if he believes that he is in one. As a result, it's more Princess Bride style than anything else.

5.

The above reasoning cannot be correct because it implies that an agent using a less accurate world model containing slightly lobotomized humans will become smarter and/or more aligned.

Generally, agents that are smarter are not necessarily more aligned (and often the two are anti-correlated). In the context of this problem though, I don't think that the AI needs to limit its models of humans; it just needs to accurately model Dave. Correctly predicting simulation epiphanies indicates an accurate model and incorrectly predicting them indicates an inaccurate model.

If Dave and [Dave] can never prove it when they are in a simulation, then we can show that some of the conclusions above become invalid. But here is modified version of the story. [Dave] sees PAL moving along route A, but then he suddenly notices that he can only see 5 different colors around him, and everything looks like it is made out of polygons...

If Dave and [Dave] could prove that they're in simulations and in fact go on to do this in actual simulations, this indicates that PAL is not able to simulate Dave and his environment well enough to make good predictions. PAL will consequently give wrong predictions and try to build a better model of the world. It's also worth noting that, if the simulation world is in five colors and is made out of polygons, then [Dave] likely has not been simulated in enough detail to notice that those things are unusual.

However, your story, it seems that PAL can only get away with this once. After all, once Dave helps PAL get to the coffee machine once and notices that he still exists (ie, PAL has chosen to end the simulation instead of starting a new one with updated knowledge on Dave's behavior), he will likely no longer believe that he is in a simulation.

Thanks for pointing this out, it had not occurred to me before. So I conclude that when assessing possible risks and countermeasures here, we must to take into account interaction scenarios involving longer time-frames.

I think this AI design has a bigger problem. Imagine PAL is choosing whether to give Dave regular coffee or poisoned coffee that causes a lot of suffering. If PAL simulates both scenarios, that causes a lot of simulated suffering. Bostrom called this problem "mind crime".

You are right that there is a potential "mind crime" ethical problem above.

One could argue that, to build an advanced AGI that avoids "mind crime", we can equip the AGI with a highly accurate predictor, but this predictor should be implemented in such a way that it is not actually a highly accurate simulator. I am not exactly sure how one could formally define the constraint of 'not actually being a simulator'. Also, maybe such an implementation constraint will fundamentally limit the level of predictive accuracy (and therefore intelligence) that can be achieved. Which might be a price we should be willing to pay.

Mathematically speaking, if I want to make AGI safety framework correctness proofs, I think it is valid to model the 'highly accurate predictor that is not a simulator' box inside the above AGI as a box with an input-output behavior equivalent to that of a highly accurate simulator. This is a very useful short-cut when making proofs. But it also means that I am not sure how one should define 'not actually being a simulator'.

Do we need it to predict people with high accuracy? Humans do well enough at our level of prediction.

In the context of my problem statement, a PAL with high predictive accuracy is something that is in scope to consider. This does not mean that we should or must design a real PAL in this way.

An AGI that exceeds humans in its ability to predict human responses might be a useful tool, e.g. to a politician who wants to make proposals to resolve long-lasting human conflicts. But definitely this is a tool that could also be used for less ethical things.

If the goal is just for PAL to get the coffee to Dave as fast as possible, then PAL is operating correctly and there is no problem.

If Dave sees PAL take route B, and then does nothing, then in reality, PAL chooses route A. This is the intended behavior. If Dave sees PAL take route B, and tries to help PAL, then PAL choses path B. As a result, both in simulation and reality, Dave helps PAL. PAL gets the coffee faster than if PAL had chosen route A. PAL succesfully maximised it's utility function. Again, intended behaviour. The mechanism that leads Dave to help PAL in the second senario is irrelevant.

I think this looks bad for two reasons. First, we might assign lower utility to worlds where Dave goes out of his way to help PAL, if you include that term in PAL's utility function, the problem dissappears. Second, Dave's reasoning is flawed. In reality, he will wind up helping PAL because he thinks that he's in a simulation, even though he's not. We might assign lower utility to worlds where Dave is wrong.

Right, this is a sort of incentive for deception. The deception is working fine at getting the objective; we want to ultimately solve this problem by changing the objective function so that it properly captures our dislike of deception (or of having to get up and carry a robot, or whatever), not by changing the search process to try to get it to not consider deceptive hypotheses.

Note that for a simulation to be useful, it has to be as faithful as possible, so SimDave would not be given any clues that he is simulated.

You missed a crucial point of the post, which is that when the AI does a simulation to consider the consequences of some action that the AI normally wouldn't do, observing that action is itself a clue that SimDave is being simulated. Here's the relevant part from the OP:

So Dave has just asked PAL to get him a cup of coffee. Dave is used to seeing PAL take route A to the coffee machine, and is initially puzzled because PAL is driving along route B. But then Dave has an epiphany. Dave knows with very high certainty that no PAL computer has ever made a mistake, so he can conclude with equally high certainty that he is no longer Dave. He is [Dave], a simulated version of Dave created inside PAL while it is computing the utility of taking route B.

Both ways of simulating counterfactuals remove some info, either you change [Dave]'s prediction, or you stop it being correct. In the real world, the robot knows that Dave will correctly predict it, but it's counterfactuals contain scenarios where [Dave] is wrong.

Suppose there were two identical robots, and the paths A and B were only wide enough for one robot. So 1A1B>2A>2B in all robots preference orderings. The robots predict that the other robot will take path Q, and so decides to take path R=/=Q. ( {Q,R}={A,B} ) The robots oscillate their decisions through the levels of simulation until the approximations become too crude. Both robots then take the same path, with the path they take depending on whether they had compute for an odd or even no. of simulation layers. They will do this even if they have a way to distinguish themselves, like flipping coins until one gets heads and the other doesn't. (Assuming an epsilon cost to this method)

In general, CDT doesn't work when being predicted.