Transparent Newcomb's Problem and the limitations of the Erasure framing

by Chris_Leong3 min read28th Nov 201925 comments


Ω 2

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

One of the aspects of the Erasure Approach that always felt kind of shaky was that in Transparent Newcomb's Problem it required you to forget that you'd seen that the the box was full. Recently come to believe that this really isn't the best way of framing the situation.

Let's begin by recapping the problem. In a room there are two boxes, with one-containing $1000 and the other being a transparent box that contains either nothing or $1 million. Before you entered the room, a perfect predictor predicted what you would do if you saw $1 million in the transparent box. If it predicted that you would one-boxed, then it put $1 million in the transparent box, otherwise it left the box empty. If you can see $1 million in the transparent box, which choice should you pick?

The argument I provided before was as follows: If you see a full box, then you must be going to one-box if the predictor really is perfect. So there would only be one decision consistent with the problem description and to produce a non-trivial decision theory problem we'd have to erase some information. And the most logical thing to erase would be what you see in the box.

I still mostly agree with this argument, but I feel the reasoning is a bit sparse, so this post will try to break it down in more detail. I'll just note in advance that when you start breaking it down, you end up performing a kind of psychological or social analysis. However, I think this is inevitable when dealing with ambiguous problems; if you could provide a mathematical proof of what an ambiguous problem meant then it wouldn't be ambiguous.

As I noted in Deconfusing Logical Counterfactuals, there is only one choice consistent with the problem (one-boxing), so in order to answer this question we'll have to construct some counterfactuals. A good way to view this is that instead of asking what choice should the agent make, we will ask whether the agent made the best choice.

Now, in order to construct these counterfactuals we'll have to consider situations with at least one of the above assumptions missing. Now we want to consider counterfactuals involving both one-boxing and two-boxing. Unfortunately, it is impossible for a two-boxer to a) see $1 million in a box if b) the money is only in the box if the predictor predicts the agent will one-box in this situation and c) the predictor is perfect. So we'll have to relax at least one of these assumptions.

Speaking very roughly, it is typically understood that the way to resolve this is to relax the assumption that the agent must really be in that situation and to allow the possibility that the agent may only be simulated as being in such as situation by the predictor. I want to reiterate that what counts as the same problem is really just a matter of social convention.

Another note: I said I was speaking very roughly because many people claim that the agent could actually be in the simulation. In my mind these people are confused; in order to predict an agent, we may only need to simulate the decision theory parts of its mind, not all the other parts that make you you. A second reason why this isn't precise is because it isn't defined how to simulate an impossible situation; one of my previous posts points out that we can get around this by simulating what an agent would do when given input representing an impossible situation. There may also be some people have doubts about whether a perfect predictor is possible even in theory. I'd suggest that these people read one of my past posts on why the sense in which you "could have chosen otherwise" doesn't break the prediction and how there's a sense that you are pre-commited to every action you take.

In any case, once we have relaxed this assumption, the consistent counterfactuals become either a) the agent actually seeing the full box and one-boxing b) the agent seeing the empty box. In case b), it is actually consistent for the agent to one-box or two-box since the predictor only predicts what would happen if the agent saw a full box. It is then trivial to pick the best counterfactual.

This problem actually demonstrates a limitation of the erasure framing. After all, we didn't justify the counterfactuals by removing the assumption that you saw a full box; instead modified it to seeing a full box OR being simulated seeing a full box. In one sense, this is essentially the same thing - since we already knew you were being simulated by the predictor, we essentially just removed the assumption. On the other hand, it is easier to justify that it is the same problem by turning it into an OR than by just removing the assumption.

In other words, thinking about counterfactuals in terms of erasure can be incredibly misleading and in this case actively made it harder justify our counterfactuals. The key question seems to be not, "What should I erase?", but, "What assumption should I erase or relax?". I'm beginning to think that I'll need to choose a better term, but I reluctant to rename this approach until I have a better understanding of what exactly is going on.

At risk of repeating myself, the fact that it is natural to relax this assumption is a matter of social convention and not mathematics. My next post on this topic will try to help clarify how certain aspects of a problem may make it seem natural to relax or remove certain assumptions.



Ω 2