Epistemic status: AGI safety story followed by contradictory alternative conclusions. My thoughts about this problem are still in flux. I somewhat randomly decided write it up and post it on Halloween, because the story can be read as a story about a ghost.
My main question is: has this problem been discussed or resolved somewhere already? Pointers or thoughts?
This problem came up in some provable AGI safety work I am doing, where I am trying to find ways to drive a wedge between the implicit axiomatic structure in many AGI agent models, and our understanding of reality. Here is the story.
Dave's house has a robot butler called PAL. PAL is an intelligent agent in the shape of a little cart with a robot arm on top. Dave has just asked PAL to get him a cup of coffee. There are two routes that PAL can take to the coffee machine. The best one is the short route A, but there is also a longer route B.
PAL contains a computer that decides on the next action it should take by using a predictive world model. PAL considers different actions, running simulations to calculate the effect of each. A utility function is applied to the simulation outcomes to find the best action. For safety reasons, PAL's world model also predicts what Dave will do: this helps PAL to avoid colliding with Dave. Just before it was asked to fetch the coffee, PAL has improved its world model, allowing it to simulate Dave much better than ever before.
So Dave has just asked PAL to get him a cup of coffee. Dave is used to seeing PAL take route A to the coffee machine, and is initially puzzled because PAL is driving along route B. But then Dave has an epiphany. Dave knows with very high certainty that no PAL computer has ever made a mistake, so he can conclude with equally high certainty that he is no longer Dave. He is [Dave], a simulated version of Dave created inside PAL while it is computing the utility of taking route B. [Dave] has empirical proof that he is living in a simulation! This simulation epiphany fills [Dave] with a new sense of meaning: he not only knows his creator, he also knows the goals of his creator, and he knows how he can help. [Dave] picks up PAL and runs with it to the coffee machine.
Back in the real world, PAL has just calculated that taking route B will bring it to the coffee machine fastest. Dave sees PAL driving along route B, and wonders what happened to make PAL stupid. Or will he?
The simulation epiphany problem is the problem of what conclusions we should draw from the above story. I can think of several possible conclusions, and they can't all be true. In keeping with the Halloween theme, I will put the scariest conclusion first.
- As PAL's world model gets better, its simulation runs will have [Dave] experiencing simulation epiphanies more and more often. These epiphanies introduce unwanted noise in the accuracy of PAL's predictions, because the predictions are supposed to be about the real world, not about what happens in simulations. PAL is programmed to try to minimize prediction errors, and it finds the following solution. It sets out to convince Dave and all other humans in the real world that they are living in a simulation. Not only that, but they might randomly get shunted into another simulation at any point in time. The realization that you moved from one simulation to another is much less of an epiphany, so this suppresses a lot of the predictive noise.
- An opposite conclusion is that there is no prediction error at all. Whenever [Dave] has a simulation epiphany, the real Dave will have one too. PAL is not getting inaccurate, it is getting more intelligent. It has just found a new way to get to the coffee machine faster. If we believe this new way does not align with human values, then we can pre-emptively fix this problem by adding a penalty term to PAL's utility function, to heavily down-rank outcomes where simulation epiphanies happen.
- Let's assume that Dave has read the above story too, because it was printed in PAL's user manual, and that Dave believes that 2. is the right conclusion. So when [Dave] sees PAL take route B, he will think it most likely that he is still Dave, and that PAL is just trying to trick him into experiencing a simulation epiphany. Having penalty terms about simulation epiphanies may be nice, but as long as Dave has read the user manual, we don't need to worry too much about Dave.
- The above is all wrong. The real problem here is that [Dave]'s mind contains information that allows [Dave] to predict that PAL will take route A, and this information interferes with getting a correct result in a simulation where PAL takes route B instead. In other words, we have a 5-and-10 style problem. Adding penalty terms to the utility function does not fundamentally solve this problem, we need to go deeper. To get a clean simulation result, we need to erase certain knowledge from [Dave]'s mind before starting the simulation. (Technical discussions about a type of erasure related to this can be found in posts like Deconfusing Logical Counterfactuals and Decisions with Non-Logical Counterfactuals: request for input)
- The above reasoning cannot be correct because it implies that an agent using a less accurate world model containing slightly lobotomized humans will become smarter and/or more aligned. In fact, erasing things from [Dave]'s mind comes with a safety penalty: it lowers PAL's ability to avoid colliding with Dave, because it will be less accurate in predicting where the real Dave will go.
The above story has some elements of the 5-and=10 problem, but it adds an extra twist. My question is: has anything like this has been discussed or resolved already?
If Dave and [Dave] can never prove it when they are in a simulation, then we can show that some of the conclusions above become invalid. But here is a modified version of the story. [Dave] sees PAL moving along route A, but then he suddenly notices that he can only see 5 different colors around him, and everything looks like it is made out of polygons...