# Transparent Newcomb's Problem and the limitations of the Erasure framing

4 min read25 comments

# Ω 2

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

One of the aspects of the Erasure Approach that always felt kind of shaky was that in Transparent Newcomb's Problem it required you to forget that you'd seen that the the box was full. Recently come to believe that this really isn't the best way of framing the situation.

Let's begin by recapping the problem. In a room there are two boxes, with one-containing \$1000 and the other being a transparent box that contains either nothing or \$1 million. Before you entered the room, a perfect predictor predicted what you would do if you saw \$1 million in the transparent box. If it predicted that you would one-boxed, then it put \$1 million in the transparent box, otherwise it left the box empty. If you can see \$1 million in the transparent box, which choice should you pick?

The argument I provided before was as follows: If you see a full box, then you must be going to one-box if the predictor really is perfect. So there would only be one decision consistent with the problem description and to produce a non-trivial decision theory problem we'd have to erase some information. And the most logical thing to erase would be what you see in the box.

I still mostly agree with this argument, but I feel the reasoning is a bit sparse, so this post will try to break it down in more detail. I'll just note in advance that when you start breaking it down, you end up performing a kind of psychological or social analysis. However, I think this is inevitable when dealing with ambiguous problems; if you could provide a mathematical proof of what an ambiguous problem meant then it wouldn't be ambiguous.

As I noted in Deconfusing Logical Counterfactuals, there is only one choice consistent with the problem (one-boxing), so in order to answer this question we'll have to construct some counterfactuals. A good way to view this is that instead of asking what choice should the agent make, we will ask whether the agent made the best choice.

Now, in order to construct these counterfactuals we'll have to consider situations with at least one of the above assumptions missing. Now we want to consider counterfactuals involving both one-boxing and two-boxing. Unfortunately, it is impossible for a two-boxer to a) see \$1 million in a box if b) the money is only in the box if the predictor predicts the agent will one-box in this situation and c) the predictor is perfect. So we'll have to relax at least one of these assumptions.

Speaking very roughly, it is typically understood that the way to resolve this is to relax the assumption that the agent must really be in that situation and to allow the possibility that the agent may only be simulated as being in such as situation by the predictor. I want to reiterate that what counts as the same problem is really just a matter of social convention.

Another note: I said I was speaking very roughly because many people claim that the agent could actually be in the simulation. In my mind these people are confused; in order to predict an agent, we may only need to simulate the decision theory parts of its mind, not all the other parts that make you you. A second reason why this isn't precise is because it isn't defined how to simulate an impossible situation; one of my previous posts points out that we can get around this by simulating what an agent would do when given input representing an impossible situation. There may also be some people have doubts about whether a perfect predictor is possible even in theory. I'd suggest that these people read one of my past posts on why the sense in which you "could have chosen otherwise" doesn't break the prediction and how there's a sense that you are pre-commited to every action you take.

In any case, once we have relaxed this assumption, the consistent counterfactuals become either a) the agent actually seeing the full box and one-boxing b) the agent seeing the empty box. In case b), it is actually consistent for the agent to one-box or two-box since the predictor only predicts what would happen if the agent saw a full box. It is then trivial to pick the best counterfactual.

This problem actually demonstrates a limitation of the erasure framing. After all, we didn't justify the counterfactuals by removing the assumption that you saw a full box; instead modified it to seeing a full box OR being simulated seeing a full box. In one sense, this is essentially the same thing - since we already knew you were being simulated by the predictor, we essentially just removed the assumption. On the other hand, it is easier to justify that it is the same problem by turning it into an OR than by just removing the assumption.

In other words, thinking about counterfactuals in terms of erasure can be incredibly misleading and in this case actively made it harder justify our counterfactuals. The key question seems to be not, "What should I erase?", but, "What assumption should I erase or relax?". I'm beginning to think that I'll need to choose a better term, but I reluctant to rename this approach until I have a better understanding of what exactly is going on.

At risk of repeating myself, the fact that it is natural to relax this assumption is a matter of social convention and not mathematics. My next post on this topic will try to help clarify how certain aspects of a problem may make it seem natural to relax or remove certain assumptions.

# Ω 2

Mentioned in
New Comment
25 comments, sorted by Click to highlight new comments since:
There may also be some people [who] have doubts about whether a perfect predictor is possible even in theory.

While perfect predictors are possible, perfect predictors who give you some information about their prediction are often impossible. Since you learn of their prediction, you really can just do the opposite. This is not a problem here, because Omega doesn't care if he leaves the box empty and you one-box anyway, but its not something to forget about in general.

The trick in open box Newcomb's is that it either predicts whether you will one-box if you see a full box or not. If you are the kind of agent who always does "the opposite" you'll see an empty box and one-box. Which isn't actually a problem as it only predicted whether you'd one-box if you saw a full-box.

Thats... exacty what my last sentence meant. Are you repeating on purpose or was my explanation so unclear?

Oh sorry, hadn't fully woken up when I read your comment

I still don't understand the fascination with this problem. A perfect predictor pretty strongly implies some form of determinism, right? If it predicts one-boxing and it's perfect, you don't actually have a choice - you are going to one-box, and justify it to yourself however you need to.

Thanks for this comment. I accidentally left a sentence out of the original post: "A good way to view this is that instead of asking what choice should the agent make, we will ask whether the agent made the best choice"

If you see a full box, then you must be going to one-box if the predictor really is perfect.

Huh? If I'm a two-boxer, the predictor can still make a simulation of me, show it a simulated full box, and see what happens. It's easy to formalize, with computer programs for the agent and the predictor.

I've already addressed this in the article above, but my understanding is as follows: This is one of those circumstances where it is important to differentiate between you being in a situation and a simulation of you being in a situation. I really should write a post about this - but in order for a simulation to be accurate it simply has to make the same decisions in decision theory problems. It doesn't have to have anything else the same - in fact, it could be an anti-rational agent with the opposite utility function.

Note, that I'm not claiming that an agent can ever tell whether it is in the real world or in a simulation, but that's not the point. I'm adopting the viewpoint of an external observer which can tell the difference.

I think the key here is to think about what is happening both in terms of philosophy and mathematics, but you only seem interested in the former?

I (somewhat) agree that there are cases where you need to keep identity separate between levels of simulation (which "you" may or may not be at the outermost of). But I don't think it matters to this problem. When you add "perfect" to the descriptor, it's pretty much just you. It makes every relevant decision identically.

When you are trying to really break down a problem, I think it is good practise to assume they are separate at the start. You can then immediately justify talking about a simulation as you in a certain sense, but starting with them separate is key.

I may not have gotten to the part where it matters that they're separate (in perfect simulation/alignment cases). But no harm in it. Just please don't obscure the fundamental implication that in such a universe, free will is purely an illusion.

I haven't been defending free will in my posts at all

I couldn't understand your comment, so I wrote a small Haskell program to show that two-boxing in the transparent Newcomb problem is a consistent outcome. What parts of it do you disagree with?

Okay, I have to admit that that's kind of cool; but on the other hand, that also completely misses the point.

I think we need to backtrack. A maths proof can be valid, but the conclusion false if at least one premise is false right? So unless a problem has already been formally defined it's not enough to just throw down a maths proof, but you also have to justify that you've formalised it correctly.

Well, the program is my formalization. All the premises are right there. You should be able to point out where you disagree.

In other words, the claim isn't that your program is incorrect, it's that it requires more justification than you might think in order to persuasively show that it correctly represents Newcomb's problem. Maybe you think understanding this isn't particularly important, but I think knowing exactly what is going on is key to understanding how to construct logical-counterfactuals in general.

I actually don't know Haskell, but I'll take a stab at decoding it tonight or tomorrow. Open-box Newcomb's is normally stated as "you see a full box", not "you or a simulation of you sees a full box". I agree with this reinterpretation, but I disagree with glossing it over.

My point was that if we take the problem description super-literally as you seeing the box and not a simulation of you, then you must one-box. Of course, since this provides a trivial decision problem, we'll want to reinterpret it in some way and that's what I'm providing a justification for.

I see, thanks, that makes it clearer. There's no disagreement, you're trying to justify the approach that people are already using. Sorry about the noise.

Not at all. Your comments helped me realise that I needed to make some edits to my post.

in fact, it could be an anti-rational agent with the opposite utility function.

These two people might look the same, the might be identical on a quantum level, but one of them is a largely rational agent, and the other is an anti-rational agent with the opposite utility function.

I think that calling something an anti-rational agent with the opposite utility function is a wierd description that doesn't cut reality at its joints. The is a simple notion of a perfect sphere. There is also a simple notion of a perfect optimizer. Real world objects aren't perfect spheres, but some are pretty close. Thus "sphere" is a useful approximation, and "sphere + error term" is a useful description. Real agents aren't perfect optimisers, (ignoring contived goals like "1 for doing whatever you were going to do anyway, 0 else") but some are pretty close, hence "utility function + biases" is a useful description. This makes the notion of an anti-rational agent with opposite utility function like an inside out sphere with its surface offset inwards by twice the radius. Its a cack handed description of a simple object in terms of a totally different simple object and a huge error term.

This is one of those circumstances where it is important to differentiate between you being in a situation and a simulation of you being in a situation.

I actually don't think that there is a general procedure to tell what is you, and what is a simulation of you. Standard argument about slowly replacing neurons with nanomachines, slowly porting it to software, slowly abstracting and proving theorems about it rather than running it directly.

It is an entirely meaningful utility function to only care about copies of your algorithm that are running on certain kinds of hardware. That makes you a "biochemical brains running this algorithm" mazimizer. The paperclip maximizer doesn't care about any copy of its algorithm. Humans worrying about whether the predictors simulation is detailed enough to really suffer is due to specific features of human morality. From the perspective of the paperclip maximizer doing decision theory, what we care about is logical correlation.

"I actually don't think that there is a general procedure to tell what is you, and what is a simulation of you" - Let's suppose I promise to sell you an autographed Michael Jackson CD. But then it turns out that the CD wasn't signed by Michael, but by me. Now I'm really good at forgeries, so good in fact that my signature matches his atom to atom. Haven't I still lied?

Imagine sitting outside the universe, and being given an exact description of everything that happened within the universe. From this perspective you can see who signed what.

You can also see whether your thoughts are happening in biology or silicon or whatever.

My point isn't "you can't tell whether or not your in a simulation so there is no difference", my point is that there is no sharp cut off point between simulation and not simulation. We have a "know it when you see it" definition with ambiguous edge cases. Decision theory can't have different rules for dealing with dogs and not dogs because some things are on the ambiguous edge of dogginess. Likewise decision theory can't have different rules for you, copies of you and simulations of you as there is no sharp cut off. If you want to propose a continuous "simulatedness" parameter, and explain where that gets added to decision theory, go ahead. (Or propose some sharp cutoff)

Some people want to act as though a simulation of you is automatically you and my argument is that it is bad practise to assume this. I'm much more open to the idea that some simulations might be you in some sense than the claim that all are. This seems compatible with a fuzzy cut-off.

[-]jmh10

I did a quick search but didn't find any nuggets so perhaps someone here might know. What is the back story here, why was the problem even constructed?

I think it's more often called Transparent Newcomb's? That might be why its hard to find