When trying to understand a problem, it is often helpful to reduce it to something simpler. Even if the problem seems as simple as possible, it may still be able to be simplified further. This post will demystify Newcomb's Problem by reducing it to the Prediction Problem, which works as follows:
- Step 1: An AI called Alpha scans you and makes a prediction about what action you are going to take. A thousand people have already played this game and Alpha was correct every time.
- Step 2: You pull either the left or the right lever. Neither does anything.
- Step 3: Alpha gives you $1 million if it predicted that you'd pull the left lever, otherwise it gives you $1000.
The empirical answer seems to be that you ought to pull the left lever. On the other hand, some strictly following Causal Decision Theory ought to be indifferent to the two solutions. After all, the reasoning goes, Alpha has already made their prediction nothing you do now can change this.
At this point someone who thinks they are smarter than they actually are might decide that pulling the left lever may have an upside, but doesn't have a downside, so you may as well pull it and then go about their lives without thinking about this problem any more. This is the way to win if you were actually thrust into such a situation, but a losing strategy if your goal is to actually understand decision theory. I've argued before that practise problems don't need to be realistic, it's also fine if they are trivial. If we can answer why exactly you ought to pull the left lever, then we should also be able to justify one-boxing for Newcomb's Problem and also Timeless Decision Theory.
"Decision Theory" is misleading
The name "decision theory" seems to suggest a focus on making an optimal decision, which then causes the optimal outcome. For the Prediction Problem, the actual decision does absolutely nothing in and of itself, while if I'm correct, the person who pulls the left lever gains 1 million extra dollars. However this is purely as a result of the kind of agent that they are; all the agent has to do in order to trigger this is exist. The decision doesn't actually have any impact apart from the fact that it would be impossible to be the kind of agent that always pulls the left lever without actually pulling the left lever.
The question then arises: do you (roughly) wish to be the kind of agent that gets good outcomes or the kind of agent that makes good decisions? I need to clarify this before it can be answered. "Good outcomes" is evaluated by the expected utility that an agent receives, with the counterfactual being that an agent with a different decision making apparatus encountered this scenario instead. To avoid confusion, we'll refer to these counterfactuals as timeless-counterfactuals and the outcomes as holistic-outcomes and the optimal such counterfactual as holistically-optimal. I'm using "good decisions" to refer to the casual impact of a decision on the outcome. The counterfactuals are the agent "magically" making a different decision at that point, with everything else that happened before being held static, even the decision making faculties of the agent itself. To avoid confusion, we'll refer to these counterfactuals as point-counterfactuals and the decisions over these as point-decisions and the optimal such counterfactual as point-optimal.
I will argue that we should choose good outcomes as the method by which this is obtained is irrelevant. In fact, I would almost suggest using the term Winning Theory instead of Decision Theory. Eliezer made a similar case very elegantly in Newcomb's Problem and Regret of Rationality, but this post aims to identify the exact flaw in the two-boxing argument. Since one-boxing obtains the holistically-optimal outcome, but two-boxing produces the point-optimal decision, I need to show why the former deserves preference.
At this point, we can make two interesting observations. Firstly, two-boxers gain $1000 extra dollars as a result of their decision, but miss out on $1 million dollars as a result of who they are. Why are these two figures accounted for differently? Secondly, both of these approaches are self-affirming after the prediction. At this point, the point-optimal decision is to choose point-optimality and the holistically-optimal decision being to choose holistic-optimality. This might appear a stalemate, but we can resolve this conflict by investigating why point-optimality is usually considered important.
Why do we care about point-optimality anyway?
Both the Prediction Problem and Newcomb's Problem assume that agents don't have libertarian free will; that is, the ability to make decisions unconstrained by the past. For if they did, Alpha wouldn't be able to perfectly or near perfectly predict the agent's future actions from their past state without some kind of backwards causation which would then make one-boxing the obvious choice. So we can assume a deterministic or probabilistically deterministic universe. For simplicity, we'll just work with the former and assume that agents are deterministic.
The absence of free will is important because it affects what exactly we mean by making a decision. Here's what a decision is not: choosing from a variety of options all of which were (in the strictest sense) possible at the time given the past. Technically, only one choice was possible and that was the choice taken. The other choices only become strictly possible when we imagine the agent counter-factually having a different brain state.
The following example may help: Suppose a student has a test on Friday. Reasoning that determinism means that the outcome is already fixed, the student figures that they may as well not bother to study. What's wrong with this reasoning?
The answer is that the outcome is only known to be fixed because whether or not the student studies is fixed. When making a decision, you don't loop over all of the strictly possible options, because there is only one of them and that is whatever you actually choose. Instead, you loop over a set of counterfactuals (and the one actual factual, though you don't know it at the time). While the outcome of the test is fixed in reality, the counterfactuals can have a different outcome as they aren't reality.
So why do we actually care about the point-optimal decision if it can't actually strictly change what you choose as this was fixed from the beginning of time? Well even if you can't strictly change your choice, you can still be fortunate enough to be an agent that always was going to try to calculate the best point-decision and then carry it out (this is effective for standard decision theory problems). If such an agent can't figure out the best point-decision itself, it would choose to pay a trivial amount (say 1 cent) to an oracle find out this out, assuming that the differences in the payoffs aren't similarly trivial. And over a wide class of problems, so long as this process is conducted properly, the agent ends up in the world with the highest expected utility.
So what about the Prediction Problem?
The process described for point-optimality assumes that outcomes are purely a result of actions. But for the Prediction Problem, the outcome isn't dependent on actions at all, but instead on the internal algorithm at time of prediction. Even if our decision doesn't cause the past state that is analysed by Alpha to create its prediction, these are clearly linked in some kind of manner. But point-optimality assumes outcomes are fixed independently of our decision algorithm. The outcomes are fixed for a given agent, but it is empty to say fixed for a given agent whatever its choice as each agent can only make one choice. So allowing any meaningful variation over choices requires allowing variation over agents in which case we can no longer assume that the outcomes are fixed. At this point, whatever the specific relationship, we are outside the intended scope of point-optimal decision making.
Taking this even further, asking, "What choice ought I make?" is misleading because given who you are, you can only make a single choice. Indeed, it seems strange that we care about point-optimality, even in regular decision theory problems, given that point-counterfactuals indicate impossible situations. An agent cannot be such that it would choose X, but then magically choose Y instead, with no casual reason. In fact, I'd suggest that only reason why we care about point-counterfactuals is that they are equivalent to the actually consistent timeless-counterfactuals in normal decision theory problems. After all, in most decision theory problems, we can alter an agent to carry out a particular action at a particular point of time without affecting any other elements of the problem.
Getting more concrete, for the version of the Prediction Problem where we assume Alpha is perfect, you simply cannot pull the right lever and have Alpha predict the left lever. This counterfactual doesn't correspond to anything real, let alone anything that we care about. Instead, it makes much more sense to consider the timeless-counterfactuals, which are the most logical way of producing consistent counterfactuals from point-counterfactual. In this example, the timeless-counterfactuals are pulling the left lever and having Alpha predicting left; or pulling the right lever and having Alpha predicting right.
In the probabilistic version where Alpha correctly identifies you pulling the right lever 90% of the time and the left lever 100% of the time, we will imagine that a ten-sided dice is rolled and Alpha correctly identifies you pulling the right lever as long as the dice doesn't show a ten. You simply cannot pull the right lever with the dice showing a number that is not ten and have Alpha predict you will pull the left lever. Similarly, you cannot pull the right level with the dice showing a ten and have Alpha predict the correct result. The point counterfactuals allow this, but these situations are inconsistent. In contrast, the timeless counterfactuals insist on consistency between the dice-roll and your decision, so actually correspond to something meaningful.
If you are persuaded to reject point-optimality, I would suggest a switch to a metric built upon a notion of good outcomes instead for two reasons. Firstly, point-optimality is ultimately motivated by the fact that it provides good outcomes within a particular scope. Secondly, both the one-boxers and two-boxers see their strategy as producing better outcomes.
In order to make this work, we just need to formulate good outcomes in a way that accounts for agents being predestined to perform strategies as opposed to agents exercising some kind of libertarian free will. The natural way to do this is to work with holistic-counterfactuals instead of point-counterfactuals.
But doesn't this require backwards causation?
How can a decision affect a prediction at an earlier time? Surely this should be impossible. If human adopts the timeless approach in the moment it's because either:
a) They were fooled into it by reasoning that sounded convincing, but was actually flawed
b) They realised that the timeless approach best achieves their intrinsic objectives, even accounting for the past being fixed. For example, they value whatever currency is offered in the experiment and they ultimately value achieving the best outcome in these terms, then they realise that Timeless Decision Theory delivers this.
Remembering that agent's "choice" of what decision theory to adopt is already predestined, even if the agent only figured this out when faced with this situation. You don't really make a decision in the sense we usually think about it; instead you are just following inevitable process. For an individual who ultimately values outcomes as per b), the only question is whether the individual will carry out this process of producing a decision theory that matches their intrinsic objectives correctly or incorrectly. An individual who adopts the timeless approach wins because Alpha knew that they were going to carry out this process correctly, while an individual who adopts point-optimality loses because Alpha knew they were always going to make a mistake in this process.
The two-boxers are right that you can only be assured of gaining the million if you are pre-committed in some kind of manner, although they don't realise that determinism means that we are all pre-committed in a general sense to whatever action we end up taking. That is, in addition to explicit pre-commitments, we can also talk about implicit pre-commitments. An inevitable flaw in reasoning as per a) is equivalent to pre-commitment, although from the inside it will feel as though you could have avoided it. So are unarticulated intrinsic objectives that are only identified and clarified at the point of the decision as per b); clarifying these objectives doesn't cause you to become pre-committed, it merely reveals what you were pre-committed to. Of course, this only works with super-human predictors. Normal people can't be relied upon to pick up on these deep aspects of personality and so require more explicit pre-commitment in order to be convinced (I expanded this into a full article here).
What about agents that are almost pre-committed to a particular action? Suppose 9/10 times you follow the timeless approach, but 1/10 you decided to do the opposite. More specifically, we'll assume that when a ten-sided dice roll shows a 10, you experience a mood that convinces you to take the later course of action. Since we're assuming determinism, Alpha will be aware of this before they make their prediction. When the dice shows a ten, you feel really strongly that you have exercised free will as you would have acted differently in the counterfactual where your mood was slightly different. However, given that the dice did show a ten, your action was inevitable. Again, you've discovered your decision rather than made it. For example, if you decide to be irrational, the predictor knew that you were in that mood at the start, even if you did not.
Or going further, a completely rational agent that wants to end up in the world with the most dollars doesn't make that decision in the Prediction Problem so that anything happens, it makes that decision because it can make no other. If you make another decision, you either have different objectives or you have an error in your reasoning, so you weren't the agent that you thought you were.
When you learn arguments in favour of one side or another, it changes what your choice would have been in the counterfactual where you were forced to make a decision just before you made that realisation, but what happens in reality is fixed. It doesn't change the past either, but it does change your estimation of what Alpha would have predicted. When you lock in your choice, you've finalised your estimate of the past, but this looks a lot like changing the past, especially if you had switched to favouring a different decision at the last minute. Additionally, when you lock in your choice it isn't like the future was just locked in at that time as it were already fixed. Actually, making a decision can be seen as a process that makes the present line up with the past predictions and again this can easily be mistaken for changing the past.
But further than this, I want to challenge the question: "How does my decision affect a past prediction?" Just like, "What choice ought I make?", if we contemplate a fixed individual, then we must fix the decision as well. If instead we consider a variety of individuals taking a variety of actions, then the question becomes, "How does a individual/decision pair affect a prediction prior to the decision?", which isn't exactly a paradox.
Anna Salamon started writing an incomplete sequence on this problem. I only read them after finishing the first version of this post, but she provides a better explanation that I do of why we need to figure out what kind of counterfactual we are talking about, what exactly "should", "would" and "could" mean, and what the alternatives are.