Functional Decision Theory (FDT), (see also causal, evidential, timeless, updateless, and anthropic decision theories) recommends taking cooperative, non-greedy actions in twin prisoners dilemmas, Newcombian problems, Parfit’s hitchhiker-like games, and counterfactual muggings but not smoking lesion situations. It’s a controversial concept with important implications for designing agents that have optimal behavior when embedded in environments in which they may potentially interact with models of themselves. Unfortunately, I think that FDT is sometimes explained confusingly and misunderstood by its proponents and opponents alike. To help dissolve confusion about FDT and address key concerns of its opponents, I refute the criticism that FDT assumes that causation can happen backward in time and offer two key principles that provide a framework for clearly understanding it:
- Questions in decision theory are not questions about what choices you should make with some sort of unpredictable free will. They are questions about what type of source code you should be running.
- I should consider predictor P to “subjunctively depend” on agent A to the extent that P makes predictions of A’s actions based on correlations that cannot be confounded by my choice of what source code A runs.
Getting Up to Speed
I think that functional decision theory (FDT) is a beautifully counterintuitive and insightful framework for instrumental rationally. I will not make it my focus here to talk about what it is and what types of situations it is useful in. To gain a solid background, I recommend this post of mine or the original paper on it by Eliezer Yudkowsky and Nate Soares.
Additionally, here are four different ways that FDT can be explained. I find them all complimentary for understanding and intuiting it well.
- The decision theory that tells you to act as if you were setting the output to an optimal decision-making process for the task at hand.
- The decision theory that has you cooperate in situations similar to a prisoners’ dilemma against a model of yourself--including when your opponent locks in their choice and shows it to you before you make yours.
- The decision theory that has you one-box it in situations similar to Newcombian games--including when the boxes are transparent; see also Parfit’s Hitchhiker.
- The decision theory that shifts focus from what type of decisions you should make to what type of decision-making agent you should be.
I’ll assume a solid understanding of FDT from here on. I’ll be arguing in favor of it, but it’s fairly controversial. Much of what inspired this post was an AI Alignment Forum post called A Critique of Functional Decision Theory by Will MacAskill which raised several objections to FDT. Some of his points are discussed below. The rest of this post will be dedicated to discussing two key principles that help to answer criticisms and dissolve confusions around FDT.
1. Acknowledging One’s own Predictability
Opponents of FDT, usually proponents of causal decision theory (CDT), will look at a situation such as the classic Newcombian game and reason as so:
I can choose to one-box it and take A or two-box it and take A+B. Regardless of the value of A, A+B is greater, so it can only be rational to take both. After all, when I’m sitting in front of these boxes, what’s in them is already in them regardless of the choice I make. The functional decision theorist’s perspective requires assuming that causation can happen backwards in time! Sure, one-boxers might do better at these games, but non-smokers do better in smoking lesion problems. That doesn’t mean they are making the right decision. Causal decision theorists may be dealt a bad hand in Newcombian games, but it doesn’t mean they play it badly.
The problem with this argument, I’d say, is subtle. I actually fully agree with the perspective that for causal decision theorists, Newcombian games are just like smoking lesion problems. I also agree with the point that causal decision theorists are dealt a bad hand in these games but don’t play it badly. The problem with the argument is some subtle confusion about the word ‘choice’ plus how it says that FDT assumes that causation can happen backwards in time.
The mistake that a causal decision theorist makes isn’t in two-boxing. It’s in being a causal decision theorist in the first place. In Newcombian games, the assumption that there is a highly-accurate predictor of you makes it clear that you are, well, predictable and not really making free choices. You’re just executing whatever source code you’re running. If this predictor thinks that you will two-box it, your fate is sealed and the best you can do is then to two-box it. The key is to just be running the right source code. And hence the first principle:
Questions in decision theory are not questions about what choices you should make with some sort of unpredictable free will. They are questions about what type of source code you should be running.
And in this sense, FDT is actually just what happens when you use causal decision theory to select what type of source code you want to enter a Newcombian game with. There’s no assumption that causation can occur backwards. FDT simply acknowledges that the source code you’re running can have a, yes, ***causal*** effect on what types of situations you will be presented with when models of you exist. FDT, properly understood, is a type of meta-causal theory. I, in fact, lament that FDT was named "functional" and not "meta-causal."
Instead of FDT assuming causal diagrams like these:
It really only assumes ones like these:
I think that many proponents of FDT fail to make this point: FDT’s advantage is that it shifts the question to what type of agent you want to be--not misleading questions of what types of “choices” you want to make. But this isn’t usually how functional decision theorists explain FDT, including Yudkowsky and Soares in their paper. And I attribute some unnecessary confusion and misunderstandings like “FDT requires us to act as if causation happens backward in time,” to it.
To see this principle in action, let’s look at a situation presented by Will MacAskill. It’s similar to a Newcombian game with transparent boxes. And I say “similar” instead of “isomorphic” because of some vagueness which will be discussed soon. MacAskill presents this situation as follows:
You face two open boxes, Left and Right, and you must take one of them. In the Left box, there is a live bomb; taking this box will set off the bomb, setting you ablaze, and you certainly will burn slowly to death. The Right box is empty, but you have to pay $100 in order to be able to take it.
A long-dead predictor predicted whether you would choose Left or Right, by running a simulation of you and seeing what that simulation did. If the predictor predicted that you would choose Right, then she put a bomb in Left. If the predictor predicted that you would choose Left, then she did not put a bomb in Left, and the box is empty.
The predictor has a failure rate of only 1 in a trillion trillion. Helpfully, she left a note, explaining that she predicted that you would take Right, and therefore she put the bomb in Left.
You are the only person left in the universe. You have a happy life, but you know that you will never meet another agent again, nor face another situation where any of your actions will have been predicted by another agent. What box should you choose?
Macaskill claims that you should take right because it results in a “guaranteed payoff”. Unfortunately, there is some vagueness here about what it means for a long-dead predictor to have run a simulation of you and for it to have an error rate of one in a trillion trillion. Is this simulation true to your actual behavior? What type of information about you did this long dead predictor have access to? What is the reference class for the error rate?
Let’s assume that your source code was written long ago, that the predictor understood how it functioned, that it ran a true-to-function simulation, and that you were given an unaltered version of that source code. Then this situation isomorphic to a transparent-box Newcombian game in which you see no money in box A (albeit more dramatic), and the confusion goes away! If this is the case then there are only two possibilities.
- You are a causal decision theorist (or similar), the predictor made a self-fulfilling prophecy by putting the bomb in the left box alongside a note, and you will choose the right box.
- You are a functional decision theorist (or similar), the predictor made an extremely rare, one in a trillion-trillion mistake, and you will unfortunately take the left box with a bomb (just as a functional decision theorist in a transparent box Newcombian game would take only box A).
So what source code would you rather run when going into a situation like this? Assuming that you want to maximize expected value and that you don’t value your life at more than 100 trillion trillion dollars, then you want to be running the functional decision theorist’s source code. Successfully navigating this game, transparent-box Newcombian games, twin-opponent-reveals-first prisoners’ dilemmas, Parfit’s Hitchiker situations, and the like all require you have source code that would tell you to commit to making the suboptimal decision in the rare case in which the predictor/twin made a mistake.
Great! But what if we drop our assumptions? What if we don’t assume that this predictor’s simulation was functionally true to your behavior? Then it becomes unclear how this prediction was made, and what the reference class of agents is for which this predictor is supposedly only wrong one in a trillion trillion times. And this leads us to the second principle.
2. When a Predictor is Subjunctively Entangled with an Agent
An alternate title for this section could be “when statistical correlations are and aren’t mere.”
As established above, functional decision theorists need not assume that causation can happen backwards in time. Instead, they only need to acknowledge that a prediction and an action can both depend on an agent’s source code. This is nothing special whatsoever: an ordinary correlation between an agent and predictor that arises from a common factor: the source code.
However, Yudkowsky and Soares give this type of correlation a special name in their paper: subjunctive dependence. I don’t love this term because it gives a fancy name to something that is not fancy at all. I think this might be responsible for some of the confused criticism that FDT assumes that causation can happen backward in time. Nonetheless, “subjunctive dependence” is at least workable. Yudkowsky and Soares write:
When two physical systems are computing the same function, we will say that their behaviors “subjunctively depend” upon that function.
This concept is very useful when a predictor actually knows your source code and runs it to simulate you. However, this notion of subjunctive dependence isn’t very flexible and quickly becomes less useful when a predictor is not doing this. And this is a bit of a problem that MacAskill pointed out. A predictor could make good predictions without potentially querying a model of you that is functionally equivalent to your actions. He writes:
...the predictor needn’t be running your algorithm, or have anything like a representation of that algorithm, in order to predict whether you’ll one box or two-box. Perhaps the Scots tend to one-box, whereas the English tend to two-box. Perhaps the predictor knows how you’ve acted prior to that decision. Perhaps the Predictor painted the transparent box green, and knows that’s your favourite colour and you’ll struggle not to pick it up. In none of these instances is the Predictor plausibly doing anything like running the algorithm that you’re running when you make your decision. But they are still able to predict what you’ll do. (And bear in mind that the Predictor doesn’t even need to be very reliable. As long as the Predictor is better than chance, a Newcomb problem can be created.)
Here, I think that MacAskill is getting at an important point, but one that’s hard to see clearly with the wrong framework. On its face though, there’s a significant problem with this argument. Suppose that in Newcombian games, 99% of brown-eyed people one-boxed it, and 99% of blue-eyed people two-boxed it. If a predictor only made its prediction based on your eye color, then clearly the best source code to be running would be the kind that always made you two-box it regardless of your eye color. There’s nothing Newcombian, paradoxical, or even difficult about this case. And pointing out these situations is essentially how critics of MacAskill’s argument have answered it. Their counterpoint is that unless the predictor is querying a model of you that is functionally isomorphic to your decision making process, then it is only using “mere statistical correlations,” and subjunctive dependence does not apply.
But this counterpoint and Yudkoswky and Soares’ definition of subjunctive dependence miss something! MacAskill had a point. A predictor need not know an agent’s decision-making process to make predictions based on statistical correlations that are not “mere”. Suppose that you design some agent who enters an environment with whatever source code you gave it. Then if the agent’s source code is fixed, a predictor could exploit certain statistical correlations without knowing the source code. For example, suppose the predictor used observations of the agent to make probabilistic inferences about its source code. These could even be observations about how the agent acts in other Newcombian situations. Then the predictor could, without knowing what function the agent computes, make better-than-random guesses about its behavior. This falls outside of Yudkowsky and Soares’ definition of subjunctive dependence, but it has the same effect.
So now I’d like to offer my own definition of subjunctive dependence (even though still, I maintain that the term can be confusing, and I am not a huge fan of it).
I should consider predictor P to “subjunctively depend” on agent A to the extent that P makes predictions of A’s actions based on correlations that cannot be confounded by my choice of what source code A runs.
And hopefully, it’s clear why this is what we want. When we remember that questions in decision theory are really just questions about what type of source code we want to enter an environment using, then the choice of source code can only affect predictions that depend in some way on the choice of source code. If the correlation can’t be confounded by the choice of source code, the right kind of entanglement to allow for optimal updateless behavior is present.
Consider what I call a Mind Police situation: Suppose that there is a powerful mind policing agent that is about to encounter agent A and read its mind (look at its source code). Afterward, if the mind policer judges A to be using decision theory X, they will destroy A. Else they will do nothing.
Suppose that decision theory X is FDT (but it could be anything) and that you are agent A who happens to use FDT. If you were given the option of overwriting your source code to implement some alternative, tolerated decision theory, would you? You’d be better off if you did, and it would be the output of an optimal function for the decision making task at hand, but it’s sort of unclear whether this is a very functional decision theorist thing to do. Because of situations like these, I think that we should consider decision theories to come in two flavors: static which will never overwrite itself, and autoupdatable, which might.
Also, note that the example above is only a first-order version of this type of problem, but there are higher-order ones too. For example, what if the mind police destroyed agents using autoupdatable decision theories?
Why Roko’s Basilisk is Nonsense
A naive understanding of FDT has led some people to ask whether a superintelligent sovereign, if one were ever developed, would be rational to torture everyone who didn’t help to bring it into existence. The idea would be that this sovereign might consider this to be part of an updateless strategy to help it come into existence more quickly and accomplish its goals more effectively.
Fortunately, a proper understanding of subjunctive dependence tells us that an optimally-behaving embedded agent doesn’t need to pretend that causation can happen backward in time. Such a sovereign would not be in control of its source code, and it can’t execute an updateless strategy if there was nothing there to not-update on in the first place before that source code was written. So Roko’s Basilisk is only an information hazard if FDT is poorly understood.
It's all about the source code.