I found Joe Carlsmith’s discussion of the “Perfect deterministic twin prisoner’s dilemma” (section II at this link) to be quite convincing on the narrow question of whether there’s something wrong with CDT. Can you read that and share what you think?
Second, FDT doesn’t always get you the most utility. For example, consider the following exotic possible world: the actual world. In this one, if you hang around academic philosophers, they will think you’re silly if you adopt FDT. This will make you sad. So adopting FDT gets you less utility. Additionally, in the actual world, I would get less utility if I were an FDTer, because I find it fun to argue with FDTers about decision-theory. Or imagine that the government passed a law where they tortured everyone who thought FDT was the right view. FDTers wouldn’t be better off.
What? This is a bad argument, because this doesn't depend on the decision theory in question at all. For any decision theory XDT, it is possible to construct a world where Omega gives you one bazillion utilons if you don't follow XDT, and murders you if you do follow XDT. This is part of why these problems are called "unfair". The point of FDT is that for fair problems like Parfit's hitchhiker (isomorphic to transparent Newcomb and similar to William MacAskill's version of Bomb! but not as contrived) FDT wins.
I'm going to spend some time on Parfit's hitchhiker because it illustrates the issue with EDT + CDT: they can't commit to anything. I claim that lots of problems like Parfit's hitchhiker come up in real life all the time. Blackmail is just evil Parfit's hitchhiker. Lots of employment situations are Parfit's hitchhiker: someone might hire you if and only if they think you'll actually do the work (and have limited recourse to stop you if not). FDT is the only decision theoretical framework which lets you commit to anything at all.[1]
Yes there are non-decision theoretic frameworks which do something like commitment (virtue-ish ethics or deontology) but these aren't mathematically formulated.
As to your point that FDT isn't well-defined mathematically yet... uhh... yeah, everyone knows that. That's one of the main points of the logical uncertainty research agenda. That's why thousands of keystrokes have been spilled over Lob's theorem. There are lots of Lobian obstacles to get around when thinking about thinking. It's possible that something like Logical Inductors (which can handle logical uncertainty) can solve FDT if given the right series of inputs, but I don't know.
I'm aware that you endorse giving in to blackmail in extremis, which I disagree with as a general position (and I would definitely caution against posting it publicly on the internet).
it illustrates the issue with EDT + CDT: they can't commit to anything
What does this mean? Of course an agent who endorses EDT or CDT can commit to things — commitments are actions they decide between, like anything else.
In this case "commitment" means something specific.
Suppose you are a selfish CDT agent, and I am considering whether to hire you to clean my house. Once you're inside my house, you might steal my stuff instead of cleaning my house. Suppose that California Labour Laws require that I pay you up-front and I know I have no chance of getting my money or stuff back. Say your preference order is "Steal" > "Do the job" > "Don't get hired"
You, before being hired, might say "Oh JB, I promise not to steal, please give me this job" but, once they're inside the house, the only causal effect they can have on the outcome is steal or don't steal. And since CDT only considers utilities downstream from each individual decision at any point in time, CDT will always steal. A CDT-operating agent is incapable of committing not to steal from me, in this case.
Therefore I will not hire you to clean my house, and you get minimal utility.
An FDT agent reasons thusly: suppose FDT endorses stealing. In that universe, JB knows this and does not hire me, so I do not get hired and get minimal utility. If FDT endorses doing the job, JB knows this and does hire me, so I do get hired and then do the job. Therefore I will do the job.
Therefore I will hire an FDT agent to clean my house, and the FDT agent will get the middling utility.
You're stipulating that CDT-me in your thought experiment doesn't have access to any (psychological) actions that causally bind me to not steal from you. Right? Then sure, CDT-me would steal if he ended up in your house, and you'd want to prevent this.
But you're also stipulating that I do have access to the action "decide to follow FDT". That's something that would causally bind me to not steal from you, if I took it before you made your decision whether to hire me. Why is this action a legitimate option in the hypothetical, while various other non-FDT ways of binding oneself aren't?
Fair point, and CDT agents self-modifying is a thing that has been studied. You might e.g. modify yourself to specifically not prefer the stealing in this case. My understanding is that these modifications are equivalent to the agent changing its decision theory: since the modifications that an agent chooses are predictable based on the decision properties of the scenarios it finds itself in, they can themselves be captured by a description of a decision procedure, which is just what a decision theory is.
I think the resulting decision theory is called son-of-CDT, and is mostly like FDT but not quite in certain circumstances. But this is deep MIRI knowledge which I'm not sure is actually published and I'm entirely going off of what I've seen ex-MIRIans post on LessWrong.
For any decision theory XDT, it is possible to construct a world where Omega gives you one bazillion utilons if you don’t follow XDT, and murders you if you do follow XDT. This is part of why these problems are called “unfair”.
That's like saying that the Halting Problem isn't an issue because problems that involve self-reference are unfair. You can't just avoid the Halting Problem by saying "no explicit self-reference", because seemingly reasonable stipulations that don't explicitly have self-reference in them may imply it anyway.
It may turn out that for some decision theories, reasonable-seeming problems that don't explicitly say "Omega punishes you if you follow XDT" may be equivalent to "Omega punishes you if you follow XDT" anyway.
That's like saying that the Halting Problem isn't an issue because problems that involve self-reference are unfair. You can't just avoid the Halting Problem by saying "no explicit self-reference", because seemingly reasonable stipulations that don't explicitly have self-reference in them may imply it anyway.
If people hadn't done roughly this, we would never have gotten the entire field of verifiable programs. Likewise, no-free-lunch theorems provide evidence that no brain can ever exist, and similar impossibility results show that GPT-style language models cannot exist (it's impossible to learn the rules to a formal language from only positive examples of that language).
Ruling out a class of things as "unfair" or "unrealistic" is sometimes necessary. For the same reason that Godel's incompleteness theorem shouldn't stop you doing maths.
FDT is good because it works in the (relatively) un-contrived Newcomb's problem which is equivalent to the totally un-contrived Parfit's Hitchhiker problem. If you want to extend decision theory to problems where the agent is deceived, or is punished because of following a particular decision theory, you find that you need to do some mixture of
If you want to extend decision theory to problems where the agent is deceived, or is punished because of following a particular decision theory,
The issue is that there may not be a choice. If you don't want to extend decision theory to that kind of problem, exactly what are you going to do to exclude problems like that? It won't necessarily have a line "if decision_theory == 'XDT'" in it. It may not be very obvious, and it may not even be possible to determine, that some problem falls into a category that you want to exclude.
No, what I mean is that there's a symmetry between the setup "Omega kills you if you follow FDT for the crime of following FDT" and the setup "Omega kills you if you follow CDT for the crime of following CDT" which isn't necessarily present in setups like "Omega simulates you and then based on your actions in the simulation, does X or Y". There's other problems with the second kind of system in some cases if Omega is allowed to lie, since this also allows symmetry into the system.
You can set up a version of Newcomb's problem where Omega never lies, but you can't do that for e.g. Newcomb's revenge, since in that case, Omega has to tell it's simulation of you that you're in the regular Newcomb's problem.
Actually, the halting problem (well, its generalization, Rice's theorem) allow you to get a more precise intuition for why punishing agents iff they follow XDT is 'unfair' (it would be Turing-uncomputable for Omega to decide if an agent follow XDT, even with his omniscience and infinite compute).
I address that. The point is, CDTers have a parallel claim to Newcomb's problem being unfair.
The point isn't just that FDT isn't well-defined mathematically. I explain in-principle reasons why it can't be.
I don't think the CDTers claim to Newcomb being unfair a valid one. The parallel claim is Omega saying "OK, you follow CDT, I am going to kill you." which is a separate problem from Newcomblike problems where you're being simulated.
And I don't think your in-principle reasons actually hold up in practice. FDT's output can be understood as "Output X such that, given the statement 'FDT outputs X', value is maximized" which in classical logic does break by the principle of explosion. This is, again, why the entire logical uncertainty agenda exists.
(My guess is that the answer is something like a modified FDT(N) outputting "Output X such that, if you feed 'FDT outputs X' into a logical inductor and run it for N steps, expected value is maximized" which provably[1] converge as
A claim I would agree with is something like: "FDT as currently written down is not well-defined, but appears to be intuitively navigable and may be modifiable into something like F'DT which basically gets at the same flavour, using the language of logical uncertainty"? I get the impression your view is much stronger than that, along the lines of "FDT is so completely poorly-defined it's pointless to ever look in that direction again, salt the earth." based on the general vibe of this article, but that's not justifiable from the evidence that we have!
I don't have a proof but this is the general kind of thing that logical inductors prove.
Why is Newcomb's any less fair than Newcomb's revenge? https://www.umsu.de/blog/2022/772
"FDT outputs X" is importantly underspecified. If you just change what it says in your exact situation, then you have no account of how it changes the other algorithms like the simulator.
The bigger problem though isn't the counterlogicals but analyzing how in the counterlogical world other algorithms would be different.
Newcomb's Revenge actually is a fair problem which is not quite as much of a problem as "If you use FDT, I will kill you". But if you accept Newcomb's Revenge as symmetric to Newcomb's problem then you have to also deny that any decision theory can be better than any other. For any decision problem where we simulate an agent in situation X, decide a payoff matrix based on X for a different instance which is in a different situation Y, then you can come up with a flip. Newcomb's problem X is important because it depends on your behaviour in the situation that you are actually in! Which breaks the asymmetry.
And ok if you actually want to think about this in depth it gets super cursed super quickly, because any FDT agent will have to have a prior over being in a Newcomblike problem vs a Revengelike problem and if it turns out that in the real world, Newcomb problems are rare and Revenge problems are common, then FDT will update accordingly and start one-boxing, but actually the concept of high-fidelity simulation at some level itself breaks down because the simulator can induce any behaviour in your simulated self by giving you arbitrary inputs.
Newcomb's problem also works if you treat the simluated agents as agents themselves who have an accurate prior over their own situation (the maths takes a while to shake out but it definitely does). While Revenge only works in this way if the simulated agents think they're being simulated by a Newcomblike Omega (otherwise, if they know they're 50:50 real or being simulated by a Revengelike Omega, they will two box).
I don't understand your claim on how exactly the counterlogicals break. Normally the obstacle is Lob's theorem, which IIUC Logical Induction fixes, but if you have a stronger argument then I would like to hear it.
So you need some explanation of why it is that nearly all the decision-theory experts—who write monstrously complicated papers with math that would go over your head—think FDT is wrong, but intro philosophy students who know almost nothing about decision theory think that it’s right. In general, you should be skeptical of views that are rejected by ~100% of relevant experts, even after considering them at length.
This is an incorrect characterization of the field. If you define a "relevant expert" as an academic who publishes exclusively in academic philosophy journals (which are mostly read, at best, by other academics), then 15 years ago that might have been a valid (if lame) argument from authority / status. But these days, the most accomplished and high impact philosophers do at least some of their work in industry (e.g. frontier AI labs and non-university-backed research orgs), and a significant chunk of them accept some flavor of FDT as true.
"Adopt FDT" is probably too strong as a rephrasing; by "accept some flavor of FDT as true", I intended to include anyone who one-boxes and / or cooperates in the prisoner's dilemma because they disagree with you and academic-only philosophers on this:
One is called causal decision theory (CDT). I’m trying to be impartial, so I won’t tell you that it’s (probably) the correct view.
Which is definitely a minority view among rationalists, and (I am pretty confident) a minority view among researchers and philosophers at frontier AI labs and similar. cf. https://www.lesswrong.com/posts/n6wajkE3Tpfn6sd5j/christiano-decision-theory-excerpt, particularly:
Robert Wiblin: So you think it's the case that when it comes to programming an AI, there's actually a lot of agreement on what kind of decision theory it should be using in practice. Or at least, people agree that it needn't be causal decision theory, even though philosophers think in some more fundamental sense causal decision theory is the right decision theory.
Paul Christiano: I think that's right. I don't know exactly what... I think philosophers don't think that much about that question. But I think it's not a tenable position, and if you really got into an argument with philosophers it wouldn't be a tenable position that you should program your AI to use causal decision theory.
Of course there's still a bunch of intra-rationalist and AI researcher disagreement on the specifics of FDT / UDT, e.g. a bunch of them probably disagree specifically because of updatelessness / EDT-with-tickle reasons, which are (debatably) not just "some flavor of FDT" as I said above.
IDK who exactly the most prominent people who have published specific thoughts on this are, but as a starting point the MacAskill post from 2019 that you cited got a bunch of substantive replies. I didn't check all of their credentials / accomplishments and what they're up to these days, but I suspect the real locus of our disagreement is more about the bounds of who qualifies as a philosopher / expert. I think "academic philosophy credentials" are not a good proxy for legible philosophy expertise (see sibling reply for more).
A big chunk of philosophers one-box. These people generally adopt evidential decision theory. They also cooperate in PD with twin. So no, FDT isn't just the same as one boxing. If it was, then all EDTers would be FDTers.
I don't think you should program an AI to follow CDT. This is because decision theories are theories of which actions are rational, not the dispositions that you want to have. So I agree with Christiano--this obviously can't be the criterion for being a functional decision theorist.
Here's one test: there is not a single person, to the best of my knowledge, anywhere in the world who adopts FDT and has a Ph.D in philosophy. This fact is surprising if it's the right view.
Here's one test: there is not a single person, to the best of my knowledge, anywhere in the world who adopts FDT and has a Ph.D in philosophy. This fact is surprising if it's the right view.
False:
With that said I was a CDTer who became an FDTer and also a rationalist who became an empiricist so maybe it's my job to bring balance to the force. (source)
Tyler holds a PhD in analytic philosophy and democratic theory from Rutgers University (source)
Insofar as we think we should defer to some extent to [group of people] on [topic], shouldn't we defer to [group of people whose job it is to scrutinize arguments on that topic] more so than [group of people who "are most accomplished and high impact"]? What does the latter have to do with their expertise about decision theory?
As a starting point / outside view heuristic, maybe? But I think this:
> [group of people whose job it is to scrutinize arguments on that topic]
Is not a particularly good description of what academia and the academic peer review process is actually doing in many fields, including academic philosophy. Philosophy is not the worst field in terms of pathologies of academia, but IMO it is punching far below its weight given the intellectual horsepower and human capital it attracts. Compare to academic economics: https://marginalrevolution.com/marginalrevolution/2026/03/what-is-economics-these-days.html which often punches above its weight, in part because it is far less insular. Accomplishment and impact across disciplines are alternatives to academic credentials as a way of making expertise legible.
Is not a particularly good description of what academia and the academic peer review process is actually doing in many fields, including academic philosophy
I don't understand why you think this. I know academia has pathologies, sure, but it seems pretty clear — as a mechanistic claim, not an "outside view heuristic" — that academic philosophers are trained to scrutinize philosophical arguments. I don't think you've really explained why I should believe that "accomplished and high impact" people have more experience with scrutinizing philosophical arguments than academic philosophers.
My experience taking philosophy classes at MIT was academic philosophers are trained very poorly at scrutinizing philosophical arguments. People adjacent to philosophy—logicians, mathematicians, linguists, (not physicists)—are generally better able to work through philosophical arguments. The first two because learning the language forces them to learn how to think argumentatively, the third because it deals with similar structure without the baggage introduced by academic philosophers.
For example, if I present the Grandfather Paradox to a professor in these four fields, I expect:
The logician would ask me for the axioms, and keep pushing until I admit one of them is, "your grandfather cannot die."
The mathematician would say, "obviously you cannot kill your grandfather, otherwise you have a contradiction."
The linguist would ask me what exactly I mean by grandfather—biological? has he frozen sperm? does he have children yet? a twin brother?—and conclude, "you can kill your grandfather, at least the person that word refers to in the mind of the time traveler."
The philosopher would say, "oh that's such an interesting thought experiment. I don't know, can you? It doesn't seem like you can't, and yet that seems like it would create a contradiction." Then they would try putting the paradox in premise-conclusion form, and two hours later conclude, "well many of these axioms and implications seem a little fishy, but I would have to say you can/cannot [equally likely] kill your grandfather".
Note: decision theories are theories about rationality. They tell you what decisions are wise and sensible to make. They are not theories about the desirable dispositions to have or about how you should program an AI. There are interesting questions about those sorts of things, but they aren’t what decision theory is about.
Tendentious. Certainly you are welcome to define your way to victory. At that point, why bother writing the rest of the post?
There’s a response to this that I’ve heard from a lot of functional decision theorists. Here’s the idea: you don’t really know if you are the algorithm being simulated in Newcomb’s problem or the actual person. For all you know, you might be the simulation, in which case you outputting “one-box” leads to more utility. I find this response very bizarre:
- I know I’m not a simulated algorithm. The simulated algorithm isn’t conscious (we can stipulate). I am.
This disagreement probably isn't tractable, but, uh, false. I assume it's not a load-bearing argument for you, though.
Note that the real world contains many Newcomb-like problems. People do in fact go around making decisions that depend on their beliefs about other people's decision algorithms. So one question I have is whether you think CDT is good either as a descriptive or normative decision theory. It seems quite obvious that it's a bad descriptive theory of human behavior: people model other people's decision algorithms, and people implement decision algorithms in that context. It also really leaves quite a lot to be desired as a normative theory: there are wildly fewer positive-sum trades that can be made with agents that actually implement (or approximate) CDT, compared to FDT.
Note that the real world contains many Newcomb-like problems. People do in fact go around making decisions that depend on their beliefs about other people's decision algorithms.
I disagree, see here for why. (I think "decisions that depend on their beliefs about other people's decision algorithms" is too weak to get you a Newcomblike structure.)
>Certainly you are welcome to define your way to victory
I mean, you‘re free to use the term “decision theory“ to refer to whatever you want. But the claim that the theory about what decisions to make is the same thing as the theory about how to program an AI or what are the most desirable dispositions to have is what actually requires an argument.
Yep. Presumably BB's implicit claim here is something like: "People make claims that we should do such and such thing because of FDT, but those claims don't follow from 'we should program an AI to follow FDT'."
For example, here's Richard Ngo saying we should cooperate with the values of civilizations outside our lightcone because of FDT.
The first calculator is like calculators we are used to. The second calculator is from a foreign land: it’s identical except that the numbers it outputs always come with a negative sign (‘–’) in front of them when you’d expect there to be none, and no negative sign when you expect there to be one. Are these calculators running the same algorithm or not?
Yes: Knowing what output the one gives helps you figure out what output the other gives.
"But wait, don't we already know what output the other gives? It's a calculator."
Yes, this test admittedly only works for programs that are hard enough to run that you notice a shortcut.
Fortunately, the test only needs to work for finding instances of yourself, and you don't yet know what you will do while you are calculating the consequences of your possible actions. So: In order to figure out whether to one-box, start with the assumption that running your sourcecode on your inputs outputs "one-box", notice that you can now infer more about what you were predicted to do, act accordingly. Changing in what language you are being predicted does not break this.
I explain at length why you can't just rely on correlations between two algorithms wrt their outputs.
I'm not trying to rely on statistical correlation. Suppose I'm in a Prisoner's Dilemma with a twin, I'm sitting in a red room, he's sitting in a blue room. I am implemented in C and he is implemented in c, which is the same programming language except all the lower- and uppercase characters have the opposite interpretation. In order to figure out whether to cooperate, I assume that FDT cooperates in my situation and see what I can deduce. Let us reason about the computation trace t by which FDT cooperated in my situation. (We can't write out t fully, since then it would be longer than itself.) If in t we swap all occurences of "blue" and "red", it is still a legal computation trace; thus we can infer that FDT cooperates also in the situation where I am sitting in a blue room and my twin is sitting in a red room. Next, we can prove that translating a program between C and c never changes behavior; thus we can infer that FDT cooperates also in the situation where my twin is sitting in a blue room and I am sitting in a red room. Now we have shown that if FDT cooperates in my situation, we achieve mutual cooperation. Similarly, if FDT defects in my situation, we achieve mutual defection. Therefore, I cooperate.
- Decision theory generally assumes that you’re self-interested. But if I’m the algorithm, then I care about algorithm me—not the version in the real world. So then I wouldn’t care about what the output of the algorithm was.
Yes, I think FDT (and every other decision rule) becomes kind of incoherent if you are trying to do indexical selfishness. But I'm not indexically selfish! If I'm in a simulation, I want the version of me outside the sim to be happy too, and in general for the world to be good. We both know that (under materialistic assumptions) personal identity is not well-defined, and selfishness is irrational. You even wrote a post about this! So I don't think it's FDT's fault that it doesn't work well under indexical selfishness.
I also find it weird that you stipulate that you can tell about yourself that you are conscious but the simulated being is not. If that's the case, then there is a noticeable difference between you and the sim that influences your decisions, so the sim is not a good predictor of your behavior.
Why does CDT become incoherent if you're indexically selfish?
Something can be a good predictor between you even if there is a noticeable difference between you and then. The standard Newcomb's problem doesn't assume that there's a conscious organism running in the simulator's mind.
Suppose you're copied and one of the copies is asked to decide between A) he gets one scoop of ice cream or B) the other copy gets two scoops of ice cream. Under CDT with indexical selfishness, before you're copied (or know which copy you are) you want your future self to choose B (and would self-bind or self-modify if you could to cause this), but afterwards you'd choose A.
Look, I reckon the issues that you raise here are real and good to ponder about. It's true that FDT is about intervening on logical dependency graphs and it's unclear foundationally how we ought to go about constructing logical dependency graphs.
But almost the same thing is true for causal dependency graphs. Pearl's definitions of causality are ultimately fairly circular. What is an intervention, platonically? Not as a sort of practical "how do you go and simulate an intervention in an experiment?", but what fundamentally is this intervention thing we're trying to approximate? And the answer is, it's the thing that makes the resulting distribution match the true causal graph. In Pearl's words, the true interventions are conducted "on God's computer". God's computer here being precisely the platonically ideal causal graph. Okay, great. So then what's the correct causal graph; how do we determine the program on God's computer? Well, the correct causal graph is the one gives us the right consequences of interventions. So this is circular. I don't think the non-Pearlean alternatives are better; I think they're worse.
One objection here might go like: sure we don't know really how causal dependency graphs are founded, but we know we can't construct logical dependency graphs of the type we need, so we have a stronger objection to FDT. And..maybe, but I still think informal FDT has some advantages over informal CDT.
I might cheekily characterize FDT versus CDT like this: CDT is what you get when you go and try and model your situation while whispering the word "causal" in the back of your mind while FDT is what you get when you go and try and model your situation while whispering the word "logical" in the back of your mind. And I think FDT might genuinely win this one. If your logical vibes coincide with your causal vibes, then there's no problem. Where they come apart, like in Newcombe's problem, the FDT vibes are better.
Now you might say, in the smoking lesion, where do the logical vibes push me? Here, I happen to agree with you that the answer is unclear, I don't think they clearly agree with the causal vibes. But I can't chalk this up as a win for CDT. If I make the problem precise in such a way that the CDT answer is clearly intuitively correct (e.g. no one else has thought through the decision theory of this problem yet, so my doing so puts them outside my reference class), then I think the logical vibes and causal vibes do agree. In other cases - all the other people truly are in my reference class - then I'm not sure that the causal recommendation is appropriate.
A more specific criticism is that the kind of dependency that CDT says is the correct one to evaluate does not itself depend on any deeper justification. It's just that intuitively, causal-vibe dependencies are usually the ones we want to track. But, while intuition is a good guide as to what we should do in typical cases, it is a poor guide as to what we should choose in unusual cases. So I don't see much reason to presume in favour of CDT here.
I am aware that I am somewhat outlier-skeptical of causation with respect to most philosophers, but I am prepared to defend this position.
But almost the same thing is true for causal dependency graphs. Pearl's definitions of causality are ultimately fairly circular.
Note that BB approvingly quotes Schwarz, who thinks that other formulations of CDT by Lewis, Joyce, or Skyrms are better than Pearl's.
But even Pearl’s theory plausibly doesn’t appeal to impossible propositions to evaluate ordinary options. Lewis’s or Joyce’s or Skyrms’s certainly doesn’t.
I feel this is a dubious move by Schwarz. Maybe it's true that Joyce's and Skyrms' theories of causation don't appeal to impossible propositions. Lewis' - "take the closest possible world where your action is B instead of A" doesn't, but we can equally ask: is there a fact about which possible world in which our action is B instead of A is the closest? Lewis' guidance on this account, as I recall, is that we should take the closest possible world "in the ordinary sense", which strikes me as a dumbfoundingly inadequate attempt to answer this question. I do not think it is an accident that Pearl's "comparatively sketchy" theory is much more widespread.
The point is: maybe you can pin a particular difficulty on FDT but not CDT, but that doesn't mean the various accounts of causation don't suffer from objections of similar strength, it just means that if they do, they differ in their particularities. I think they probably do.
Back when the FDT paper came out, I considered counterpossibles an inherently interesting question.
For example, I wondered, is there a Bayesian formulation of probabilistic primality testing? Well, there is, as Gaifman explains, but can we compute P(x is prime|witnesses) for a particular x and particular witnesses? That seems to require counterpossibles: even if the number is prime, we have to consider the probability that these would have been witnesses if it were not.
Maybe back then, it didn't seem so strange to me to say, oh by the way when we figure out counterpossibles we can plug that solution into a decision theory.
Now though, I'm more interested in decision theory than in counterpossibles. And I don't think counterpossibles are essential to decision theory, so saying you'll solve counterpossibles as a subproblem seems like a distraction.
It would be one thing if counterpossibles were easy, so whatever you think of counterpossibles, it's just the best way to solve decision theory. But... well, October 2027 will be the ten-year anniversary of the FDT paper.
Who knows what I would have thought of this post if you had made it when FDT first came out. I mean, a lot of your post is just repeating the unsolved subproblems in the FDT paper itself. But many years later, my response to hearing that is not "yes, we know, that's right in the paper" but "yes, that's why FDT was a disappointment and I'm thinking about the next thing".
I think you might benefit from designing your own UDT 1.0 variant (i.e. a variant of Wei Dai's original) that doesn't require logical omniscience/infinite compute. Note that UDT 1.1 is mentioned indirectly in the published Death in Damascus paper, but what is actually described works similarly to UDT 1.0. Martín Soto and Abram Demski have put some effort into writing useful notes. I think they're still missing a precise constructive description of Bayesian Logical Inductors (BLIs), but you could partner with a mathematician and go around to the researchers involved until you've collected enough mental model content to have it written up.
4 Should you light yourself on fire for no benefit?
I think this section is addressed best by https://intelligence.org/2017/04/07/decisions-are-for-making-bad-outcomes-inconsistent/. Your reasoning about it not providing any benefits to you "at this point" is basically assuming CDT and you could apply the same line of argumentation to Newcomb's paradox to argue for twoboxing.
I know I’m not a simulated algorithm. The simulated algorithm isn’t conscious (we can stipulate). I am.
This is effectively denying the possibility of the predictors, in general.
Third, FDT isn’t actually the theory that leaves you with most expected utility on average. In fact, in many cases, it’s EDT (perhaps updateless) that leaves you with the most expected utility. For example, in the smoker’s lesion case, EDTers tend to finish better-off than other people. In smoker’s lesion, smoking correlates with worse health, but it doesn’t cause it. But EDTers are less likely to smoke, so on average they’ll have better health.
Uhh, suppose you get +1 utility for smoking and -10 utility if you get cancer. A population of EDT agents does not smoke at all, but let's say 50% of them get cancer. All of the population of CDT agents smoke, and the same 50% of them get cancer.
Question: what population gets higher utility?
EDIT okay, I think I violated the premise of the thought experiment. HMMM. You can postulate that correlation and mechanism was just discovered, what those agents would do on the next time step? So, EDT agents would miss one smoking opportunity and then correct, as correlation vanishes, or maybe they get stuck not smocking at all, if all of them are EDT.
But you do agree that in my corrected version, EDT half of the agents miss one smoking opportunity?
E.g. Suppose someone decided to collect statistics for the first time, and will do it again 10 years later. They discover that smoking correlates with cancer and moreover find out that it's entirely through the lesion. For next 10 years EDT agents do not smoke. All CDT agents do smoke. Then they collect the statistic again, and there is no correlation. EDT agents start smoking too or what? So, they just missed some smoking time?
Like, wouldn't acting on this correlation destroy the correlation?
I think that you severely misunderstood the way predictors work.
Suppose that you have an army of
Indeed, suppose that the army of agents was put into the Newcomb's box experiment. If the box is non-transparent, then the entire army has the same probability
Similarly, if the agents face a transparent Newcomb, then suppose that they have a probability
Not following what you think I'm misunderstanding or what this has to do with the things I say. I grant, of course, that the average returns are higher if you're disposed to one-box in transparent Newcomb. So if you're just looking at timelessly beneficial dispositions, those are the same across transparent and opaque Newcomb. But the entire dispute is about whether those perfectly match up with rational choice.
(Let me know if I got your point right).
But the entire dispute is about whether those perfectly match up with rational choice.
It is a rational choice between what and what? Between honestly two-boxing, honestly one-boxing and fooling the Predictor into believing that you'll one-box, then two-boxing? The problem's point is that the Predictor isn't THAT stupid
For what you should do in the problem. No one thinks you'll fool the predictor. What we think is that it is irrational to take some action if, at the time you take it, you know it will simply result in you having less money than if you took a different action. We're not under the illusion that such people on average leave with more money or that you should precommit to two-boxing or any such things.
it is irrational to take some action if, at the time you take it, you know it will simply result in you having less money than if you took a different action
Suppose, alternatively, that you and another person (or even an optimizer for entirely different values) are both given a chance to pay $1 so that the counterpart received $2. Does it become rational to pay or not to pay?
I lean towards thinking you shouldn't pay, but I'm somewhat uncertain about that. I lean causal, but am sympathetic both to EDT and some third undiscovered theory.
My personal ranking of decision theories goes EDT > UDT/FDT > CDT
I don’t really see why I should believe the only important thing to consider about my decisions is their causal impact. Even more convincing to me than Newcomb’s problem is the perfect deterministic twin prisoner’s dilemma. On the other hand, the UDT/FDT belief that the true theory of rationality can’t be diachronically inconsistent seems like wishful thinking to me. There is a certain sort of beauty and elegance in thinking that the correct theory about what actions to take is the same as the theory about what dispositions it would be luckiest to have, but I don’t know why I should believe that either. EDT is closer to my common sense: it never makes you sacrifice your expected utility on the alter of theoretical ideas about causality or subjunctive dependency or ideal-agenthood. It never forces you to think about possible worlds that you know aren’t the actual world.
I do wonder whether at least some proponents of FDT are espousing it strategically. It seems like the best strategy wrt Newcomb style problems is to spend your life espousing one boxing and even one box for lower stakes examples, but ultimately two box at the first critical try.
Normally I find this sort of "my opponents don't actually believe what they say" reasoning to be lazy, but there are a couple things that make me find it plausible in this case.
The ASI isn't going to be worse at prediction than you, currently predicting that people are going to two-box in Newcomb's problem.
I have one more contribution. I have myself attempted to justify causal interventional semantics from more fundamental principle. The justification I landed on was (abridged):
NOW, here is an interesting point. In Newcomb's problem, a naive model would include two separate functions
Now my justification for interventions is clearly far from broadly accepted, and in fact I myself think it's quite half baked. Furthermore, this is an argument from convenience: introducing
*With an imperfect predictor, the situation is more complicated, and it seems to come down to: my theory can license you to conclude that consequences given
In your example in 3.3, I don't think it's true that FDT recommends one-boxing in scenario 2. Your choice of boxes is a product of (among other things) both your genes and of FDT's recommendation (just the former if you don't use FDT). But since "[t]he cases where the simulation is inaccurate are the same as the ones where there isn’t an overlap between your gene and which box you take", it follows that your simulated action is unaffected by FDT's recommendation. So FDT acts as if it has no control over your simulated action, and therefore two-boxes.
I think the weirdness of this example comes from the stipulation that "[t]he cases where the simulation is inaccurate are the same as the ones where there isn’t an overlap between your gene and which box you take". This is actually a pretty strong hypothesis! It makes the scenario very different from how it would be if you just talked about a 99.9% accurate simulator.
A more minor issue: when you write "Imagine that there’s some gene that correlates 99.9% with two-boxing" it is not clear whether you refer to two-boxing in scenario 1 or two-boxing in scenario 2. They can be different, of course. But I think your argument is closest to correct if we assume you are talking about two-boxing in scenario 2.
To give a meta summary of the problems in this essay: the author does not define their terms, runs rampant with them, and is then shocked when they run into contradictory intuitions.
Here FDT’s answer is that you should two-box in the first case but not in the second case.
No, it says to two-box in both cases for exactly that axiom of extensionality ("equivalent prediction principle").
No fact about whether two algorithms are the same
This entire objection is a failure to recognize that 0% and 100% are not probabilities. Rather than saying, "are these the exact same?" which is always impossible to verify to 100% confidence, not just with algorithms, you should ask, "how similar are these policies?" The goal of functional decision theory is to maximize the utility of all agents with similar policies to your own, weighted by exp(-KL(their policy||your policy)). So, for example, your twin that presents exactly the same except on Newcomb's problem has the maximally distant policy to your own when you run into Newcomb.
The calculator objection is made up. You define your terms and then claim they are not defined! You gave us the isomorphism between the calculator's algorithm and the neg-calculator's algorithm. Under the axiom of extensionality, these are the same algorithms. Unless you want to define "algorithm" differently, which involves the minus sign. In which case, they're slightly different algorithms (see the previous paragraph).
Determining how one algorithm being different would affect another algorithm being different, without depending on the epistemic probability of the second being different if the first was.
Is being exponentially accurate in polynomial time good enough? If so, just use annealing to find the trembling-hand equilibria. If not, you're asking to solve P=NP.
The predictor has a failure rate of only 1 in a trillion trillion.
How? A rational decision-maker chooses mixed policies (for that entropy bonus), and even when there is certain, painful death on the table, they will almost certainly choose it more than 1 in a trillion trillion times. Even an irrational actor will simply mess up and walk in the wrong direction more often than 1 in a trillion trillion. If they claim they can predict my randomness ahead of time, they are lying. As a good functional decision theorist, I use a quantum random number generator (e.g. stare at a lightbulb while thinking) to prevent my randomness being hacked like that.
The simulation is highly correlated with you, so his guesses about whether you’ll cut off your leg for no benefit are 99.9% accurate.
The issue with most of these scenarios is you are unclear on what you mean by 99.9% accurate. Is this epistemic or aleatoric uncertainty? If it is epistemic, Newcomb is not powerful enough to change his decision based on your algorithm. Your algorithm should be: pretend to be a one-boxer (or leg-cutter) up until actually put in that scenario. If he gives you tests before putting you in the scenario, well an EDT or CDT would certainly do their best to pass the tests as well for the future utility.
If it is only the aleatoric uncertainty in your policy, we have set up as an axiom that Newcomb knows your policy. Then you object, "why not just change the policy?" But you literally just made a stipulation that you can't! It's exactly the failure most people make with the Grandfather's Paradox. What seems to be possible is physically impossible when you impose axiomatic restrictions.
The goal of functional decision theory is to maximize the utility of all agents with similar policies to your own, weighted by exp(-KL(their policy||your policy)).
Huh, may I have a source on this? I thought you could point FDT at maximizing any utility function you like.
The issue with saying, "this agent," is you do not actually know its policy. The best anyone can do is generate all programs that output the seen distribution of actions, using error-correction codes for nondeterministic policies. Now you have many theories of varying description lengths for the agent, which you weight according to the Solomonoff prior. We can always describe another agent's policy with a fixed KL(their policy||your policy) extra error-correction bits, so the utils under a given theory are
sum_{policy} exp(-|theory| - KL(policy||your policy)) utils(policy)
and the total utils are
sum_{theory} sum_{policy} ... = constant * sum_{policy} exp(-KL(policy||your policy) utils(policy)
using error-correction codes for nondeterministic policies
I assume you mean Arithmetic coding.
Why do you need to know the policy in order to figure out the utility function? I thought you could point FDT at, like, maximizing Chaitin's constant. I am hoping to look at whatever reference document you are getting your definitions from, is there no such thing?
There's a lot of snark here but it's all incorrect.
//No, it says to two-box in both cases for exactly that axiom of extensionality ("equivalent prediction principle").//
I explain why this isn't right. Only one algorithm is dependent on yours. The other is just correlated.
//This entire objection is a failure to recognize that 0% and 100% are not probabilities. Rather than saying, "are these the exact same?" which is always impossible to verify to 100% confidence, not just with algorithms, you should ask, "how similar are these policies?"//
This is wrong on a number of counts. First of all, measures of similarities are not the same as probabilities. So this doesn't require any claim about similarity. Note that FDT wants to say that if you're in prisoner's dilemma against a perfect twin, your actions are correlated 100% with what they do (even if you don't have a credence of 1 that that is so). Second, as I explain, measuring similarity is even more difficult.
Re calculator, whether they output the same thing depends on how you interpret their outputs, as explained in the post.
Re being able to determine what's true of one algorithm from the other, that's just looking at correlation which can't be the relevant notion for the reasons I explain.
We imagine an agent who never messes up an accidentally picks the wrong one. By the lights of FDT, you get more expected utility timelessly if you're always disposed not to pay. And it's a stipulation of the thought experiment that the predictor is reliable--doesn't matter if this could exist in the real world.
99.9% accurate in the sense that 99.9% of the time, the predictor guesses right. We additionally can imagine that his decision depends on what you do on the last moment, not just on what you're pretending to do until then.
I acknowledge the snark. I get annoyed when people repeatedly make the mistake of not defining their terms, running rampant with them, being shocked when they get mismatched intuitions, and conclude the undefinitions wrong. It's no more logical than, "you're wrong because I feel that way."
I explain why this isn't right. Only one algorithm is dependent on yours. The other is just correlated.
Correlated is doing a lot of equivocating in your intuitions. It's merely correlated not causal, he says! What's the difference? everyone asks. Oh, there is none, they are extensionally identical, but using the word correlated will trick the functional decision theorist into taking a different action.
One man's modus tollens is another man's modus ponens.
You say, "since my intuitions imply the functional decision theorist will take different actions in these extensionally equivalent scenarios, clearly the functional decision theorist can be Dutch booked."
I say, "since the functional decision theorist is rational and cannot be Dutch booked, clearly your intuitions are wrong about what the functional decision theorist will do. Go back and straighten out your definitions."
Second, as I explain, measuring similarity is even more difficult.
I literally told you how to measure similarity: KL(p||q).
Re calculator, whether they output the same thing depends on how you interpret their outputs, as explained in the post.
I didn't think this was your main objection because you told us the isomorphism. However, if the isomoprhism is unknown, or there isn't an isomorphism but some other transformation, you can use the mutual information to recover it. Re: mutual information neural estimation.
99.9% accurate in the sense that 99.9% of the time, the predictor guesses right.
And it's a stipulation of the thought experiment that the predictor is reliable--doesn't matter if this could exist in the real world.
Ah yes, the principle of explosion proves every proposition true and false. Just sneak in contradictory axioms (by not defining your terms) and you can prove anything you want!
CDT: I have all the information needed to decide correctly.
EDT: I don't have all the information but need to decide anyway.
FDT: I want to create a scenario where I can justify my decision even when I should know better.
Crosspost.
1 The analytic philosophers vs the rationalists
A lot of analytic philosophers are sympathetic to Rationalism (the social movement, not the alternative to empiricism). I don’t know if I’m senior enough yet to count as a philosopher, but I certainly count myself as among those sympathetic. Yet virtually all of them have the same complaint: Rationalists very often make philosophical errors, especially when it comes to decision theory.
The Rationalist community, for those unaware, is a group devoted to forming beliefs rationally. They disproportionately live in the Bay Area, post on LessWrong, think AI is going to be a big deal, adopt various reductionist philosophical views, etc. I’ve written about my thoughts about the Rationalists here—they’re very smart and interesting; they get a lot right but are sometimes overconfident and wrong about philosophy.
The Rationalist decision theory du jour is called functional decision theory (FDT). Academic decision theorists don’t like the theory. The number of academic decision theorists who adopt it could be counted on one hand by someone missing all of their fingers. (Edit: this used to say by someone missing four of their fingers, as I thought that Ben Levinstein adopted FDT. He doesn’t. So I think the number is literally zero!). My position on the view is as simple as can be: I think the view is definitely wrong. It both is sufficiently underspecified so as to give no real recommendations, and also the recommendations that it supposedly gives are extremely implausible on their face.
I have had debates with about 5 million Rationalists on this subject. Half my time in the bay area was spent arguing with people about decision theory. When I sleep, I am haunted by the ghosts of FDTers. If you keep saying some point over and over again, it sometimes makes sense to write it up. I thought I’d do that. But if you want to read more from other people who are better at decision theory than me, and also more sensible and measured, read Will MacAskill’s great piece and also Wolfgang Schwarz’s piece.
2 What the heck is FDT?
(Skip this section if you know what each of the main decision theories are).
Decision theories tell you how to get what you want. Specifically, they tell you how to reason about cases where different options get you different amounts of what you want (the amounts of what you want are measured in units of utility. This doesn’t have anything to do with utilitarianism the moral theory—it just denotes the amounts of whatever it is that you’re optimizing for).
There are two major decision theories that academic philosophers like. One is called causal decision theory (CDT). I’m trying to be impartial, so I won’t tell you that it’s (probably) the correct view. It says that you should take the action that causes you to have the most utility. Specifically, it says that when taking an action, you should ignore non-causal influences that your actions might have on the state of the world and only do what causes the best thing.
There’s a second view called evidential decision theory (EDT). It says that you should take the action which leaves you with the expectation of having the most utility. So when deciding between acts A and B, ask: how much utility would I expect to have if I take A? What about if I take B? If you’d expect to have more if you took A than B, then you should take A. If you’d expect to have more if you took B, then you should take B.
Functional decision theory is different from either. It says you should think of your action as determining the outcome of your decision algorithm. You should take the act which is such that across time, you expect to get the most utility if your algorithm outputs that act.
So EDT asks: what action leaves me with the expectation that I’ll be richest? CDT asks: what action causes me to be the richest? And FDT asks: what action would my algorithm outputting make me expect to be the richest if it was settled at the start of time?
Here’s a famous case to distinguish the theories. It’s called Newcomb’s problem. It’s the most famous dilemma in decision theory.
There are two boxes, A and B. You have the option of either taking just A or both A and B. B has $1,000. One hour ago, a very accurate predictor guessed whether you would take both boxes or just box A. If he predicted you would just take box A, he put $2,000 in box A. If he predicted you’d take both boxes, he put nothing in box A.
Question: should you take both boxes or just box A?
CDTers say: both boxes. Taking the second box causes you to get an extra $1,000. The fact that it correlates with there being less money in the box is irrelevant. By taking one box, CDTers claim, you’re just passing up an extra $1,000.
EDTers say: just one box. If you take just the first box, you’ll generally end up with $2,000 instead of $1,000. EDTers say: you expect to end up with more money if you take one box, so you should take one box!
FDTers say: it depends on how the predictor predicts what you’ll do. Suppose they run your algorithm or an algorithm very much like yours to predict what you’ll do. Well then, by changing the results of your algorithm, you change their prediction. So then you should one-box. The output of your algorithm, then, determines how much money is in the box—FDT thinks of your decisions as determining the results of your algorithm.
But suppose instead that they make predictions by looking at some other characteristic that merely correlates with one-boxing. E.g. maybe they look at whether you had a professor that two-boxed. In this case, FDT says you should two box. The predictor isn’t running your algorithm, so changing the outcome of the algorithm doesn’t change what is in the box.
So what’s wrong with FDT? I have two main gripes: what FDT says is wildly underspecified—there’s no remotely plausible way to fill in the details. Also, the few judgments that FDT supposedly gives are often wildly implausible!
3 FDT doesn’t say anything
The biggest problem with FDT is that it is devoid of genuine content.
3.1 Is there a fact about how other functions would be different in the impossible world where mine was?
FDT says that when taking an action, you should consider how the world would be if your decision procedure gave some recommendation. But what does that mean? Specifically, suppose that you are kind of like me but different in a bunch of respects. Maybe you’re my brother. Maybe you’re Claude Opus 4.7 and I’m Claude Opus 4.6. Maybe you’re an almost exact copy of me. Does changing my algorithm change your algorithm? How could we possibly answer this question?
Remember, my decision algorithm is some mathematical function. So we’re asked to imagine in the mathematically impossible world where some math function outputted something different from what it mathematically has to output, whether other mathematical functions would be different. What could this mean? How could there possibly be an answer to this question? How can you have a theory that depends on there being determinate answers to the question: in the logically impossible world where some necessary mathematical fact was different, how would other necessary mathematical facts be different? What?
FDTers often claim that CDT requires considering counterpossibles too, because it instructs you to hold fixed what the world is independent of your choice and then make the decision that maximizes utility with respect to that. Now, even if this is right, it’s a lot sketchier to consider how other algorithms would be different in counterpossible worlds than just considering irrelevant features of generic counterpossibles. But CDT holds fixed only which things causally depend on your act, not the initial conditions. So it never has to consider a situation where, say, the initial conditions determine that you’ll take some act A, yet you take act B. As Wolfgang Schwarz put it:
And note: this isn’t just some minor quibble with what FDT says in a few cases. This is the core mechanic of FDT. This is what FDT needs to generate a single result in a single case! Every case where FDT gives a recommendation, it does so by analyzing the counterfactual where the output of a mathematical function was different. Insofar as there’s no fact of the matter about that, FDT doesn’t give any recommendations in any cases.
Let’s apply this to Newcomb’s problem. Suppose the predictor predicts what I’ll do by running an algorithm. Presumably it won’t be exactly the same algorithm as the one I’m employing. He’s not running an exact mental simulation of me even if his simulation reliably correlates with what I’ll do. Suppose my algorithm will in fact output one-boxing. FDT requires we answer: in the logically impossible world where my algorithm outputted two-boxing, would the predictor’s algorithm output two-boxing? Clearly there’s no fact of the matter about that! So FDT doesn’t even get clear results in Newcomb’s problem! As long as the predictor isn’t running an exact simulation of you, FDT falls silent on the question of what you should do.
3.2 Statistical correlations aren’t enough
Now, there’s an obvious-sounding solution to this problem. Just consider the nearest epistemically possible world where your decision theory outputs some recommendation, and then tabulate the amount of utility you expect to get. So suppose that you learned that your algorithm was disposed to two-box. Then ask: how much money would you expect to get. Compare that to how much you’d expect to get if you learned your algorithm was disposed to one-box. If you’d expect to have more after learning your algorithm one-boxes than two-boxes, then you should one-box.
But this obvious-sounding solution doesn’t work. It makes the theory into updateless EDT.2 To see this, imagine that some people are born with a gene that correlates heavily with two-boxing. The predictor predicts what I’ll do by looking at whether I have the gene. Two-boxing doesn’t cause the gene or affect whether you have the gene in any way. This solution would recommend one-boxing in this case. If I knew that my algorithm was disposed to one-box, I’d have a high credence in my having the gene, and in my getting rich. But FDT isn’t supposed to say that!
In fact, this leaves FDT vulnerable to the very smoker’s lesion result that FDTers take to be decisive against EDT. Imagine that smoking doesn’t cause your health to be worse. Instead, smoking correlates with having a lesion on your lung that both makes you likelier to smoke and makes your health worse. It seems rational to smoke, because smoking has no effect on whether you have the lesion on your lung. Yet if your algorithm outputs smoking, that makes you expect that you have the lesion, and so it lowers the expected utility that you get according to this solution.
Now, you could modify the view once again so that you only analyze your expectations concerning other algorithms. This way, you wouldn’t look at how much utility you’d expect to get if your algorithm outputted some action. Or, at the very least, you wouldn’t take the action which, if your algorithm outputs, leaves you with the highest amount of expected utility. Instead, when deciding between two actions A and B, you’d imagine:
That way, you only analyze your algorithm’s probabilistic impact on other algorithms. Whether you have lung cancer is not an algorithm. So you don’t treat your algorithm being different as affecting it in the way relevant to decision making.
But this is of no help. Imagine a modified case where the lesion doesn’t make your health worse. Instead, there’s an algorithm that checks to see if you have the lesion. If you do, then it makes your health worse and also makes you likelier to smoke. Now there’s an algorithm in the mix, so this view is back to thinking (wrongly, and contrary to the spirit of FDT) that you shouldn’t smoke. After all, your algorithm outputting “don’t smoke” makes you expect that the other algorithm output “is less likely to smoke and has better health.”
So now the FDTers are in pretty rough shape. They need to have some account of how your algorithm outputting A would affect other different algorithms. But this can’t just be about your credence in the other algorithm having some outcome, conditional on yours outputting A. FDT depends on analyzing how your action being different (counterpossibly) would make other algorithms different (counterpossibly) without looking at how likely other algorithms would be different in the nearest epistemically possible world where yours is different. How could there possibly be a satisfying solution to this problem?
What it needs is some precise specification of how similar two algorithms are that doesn’t depend on:
But what could it possibly depend on? Isn’t it obvious that there’s no single privileged joint-carving way to decide the similarity of algorithms that doesn’t just look at the statistical correlation between their outputs? Certainly FDTers owe us some account of how this works. It doesn’t do to call it an unsolved problem, when this is the entire engine of the theory—when there’s no plausible story of what a solution would even look like, strong active reason to think there is no such solution, and a solution is needed for the theory to give any result in any case.3
Let’s be a bit more concrete. Imagine that I’m in a prisoner’s dilemma against my twin (note that my twin isn’t exactly like me but is similar). I understand having a credence in my twin cooperating conditional on my cooperating. But if we’re not talking about conditional credences, how could there be a uniquely privileged sharp fact about the non-statistical algorithmic correlation between us two?
3.3 No fact about whether two algorithms are the same
Things get even worse. How do we determine if two functions are running the same algorithm? I’m told this is an “unsolved problem” for FDT. There seem to be a lot of those. And remember, you can’t just look at whether they always output the same thing, because FDT distinguishes between mere correlations and paired algorithms. As Will MacAskill put it in his piece:
Now, as Will notes, standard attempts to measure whether two algorithms are the same generally imply that one system may run many different algorithms simultaneously. If the ultimate account has to do with the mapping between inputs and outputs, then changing the output of your algorithm may have bizarre effects on other features of the world. As Will writes:
There’s a related problem. Suppose that there is someone who is psychologically identical to me at all times before Newcomb’s problem. In Newcomb’s problem, they one-box. Should we think of changing the results of my “algorithm” as changing the results of theirs? What could possibly determine this?
There’s a somewhat strange paradox here. Imagine that there’s someone who is psychologically identical to me at all times before the prisoner’s dilemma. I’m in a prisoner’s dilemma against them. They defect. On FDT, I should defect too. But then we’re running the same algorithm. So then I should cooperate. But then we’re running different algorithms, so I should defect.
Now, you might object that the scenario, as I’ve described, is impossible. If I’m basing my decision on theirs, then we can’t be running the exact same algorithm. Here we should imagine that my decision is not based on theirs. We should then consider the question: what action do I have most objective reason to do (instead of which one is best for me to do given what I know).
3.4 Conclusion
So let’s recap. FDT needs a solution to each of the following to give almost any judgment in almost any case:
Then, even if we had a solution to both of those, FDT would have the problem:
Absent a solution to the first two, FDT isn’t a theory. It’s a collection of suggestions. In every case that has ever arisen in the history of the species and all the standard thought experiments, it is wildly unclear what FDT says. There are deep reasons to think it doesn’t say anything.
4 Should you light yourself on fire for no benefit?
My answer is “no.” FDT’s answer is “yes.” Here’s the case (from Will MacAskill, though similar examples abound):
FDT says that you should slowly and painfully burn yourself to death. After all, having the disposition to do that makes you better off in expectation timelessly. It makes it so that probably there won’t be the bomb in the box in the first place, and you won’t have to pay $100.
But this just seems irrelevant. The bomb is in the box. I have no uncertainty about what will happen if I choose Left. In cases where you have no uncertainty about how the world is, where one action simply leaves you with less utility, you shouldn’t take that action. The fact that this case is rare doesn’t matter! It’s a crazy recommendation of FDT that it tells you to light yourself on fire when you know that if you do so, you will not benefit at all.
FDTers I’ve talked to sometimes have said this is unfairly rhetorically loaded. “It’s not for no benefit,” they claim. “The benefit comes from you being better off if your decision algorithm is disposed to make it.” But at the time you’re taking the act, you have no uncertainty about how the world is. You know what benefits will come about if you take the act: none. So this phrasing is accurate.
And there are an infinite number of other similar examples. Imagine that everyone in the world is put into a deep slumber. Then, the predictor simulates you and guesses if you’ll, thirty years after waking up, painfully cut off your leg for no benefit. The simulation is highly correlated with you, so his guesses about whether you’ll cut off your leg for no benefit are 99.9% accurate. If he predicts that you’ll slice off your leg, he wakes you up. If he predicts that you won’t, then he doesn’t. Assume that waking up is very good for you.
FDT implies that because being disposed to slice off your leg for no benefit makes you likelier to wake up, you should slice off your leg thirty years later. But that just seems crazy. At the time you’re making decisions, you’re already awake. If you’re already awake, it makes no sense to slice off your leg on grounds that it makes you likelier to be awake. The odds that you’re awake are already 100%.
Note: decision theories are theories about rationality. They tell you what decisions are wise and sensible to make. They are not theories about the desirable dispositions to have or about how you should program an AI. There are interesting questions about those sorts of things, but they aren’t what decision theory is about. So don’t think to yourself “would I be better off timelessly having the disposition to slice off my leg?” Think “is it rational, at the time I’m making the decision, to cut off my leg for no benefit.” I think the answer is clear: no! I’ll talk more about this distinction in the next section.
There’s a response to this that I’ve heard from a lot of functional decision theorists. Here’s the idea: you don’t really know if you are the algorithm being simulated in Newcomb’s problem or the actual person. For all you know, you might be the simulation, in which case you outputting “one-box” leads to more utility. I find this response very bizarre:
Thus, I think FDT gives incorrect recommendations.
5 Does FDT get more utility?
A claim that FDTers are fond of making is that following FDT gets you the most utility. Take the version of Newcomb’s problem where the boxes are transparent, for example. So in this case, you can peer into both boxes and see how much money is in each. In this case, both CDT and EDT recommend taking two boxes. After all, at this point you have no uncertainty about how the world is—taking two boxes leaves you with an extra $1,000. FDTers recommend you take one box, because that timelessly leaves you with more utility.
Thus, FDTers generally leave transparent Newcomb’s problem richer than either EDTers or CDTers. FDT proponents claim that FDT “gets you more utility,” and is thus the right criterion of action. I have four problems with this argument.
First, I don’t think FDT says anything in any case because it’s not a complete theory (see section 2). If FDT says nothing, it can’t get you the most utility.
Second, FDT doesn’t always get you the most utility. For example, consider the following exotic possible world: the actual world. In this one, if you hang around academic philosophers, they will think you’re silly if you adopt FDT. This will make you sad. So adopting FDT gets you less utility. Additionally, in the actual world, I would get less utility if I were an FDTer, because I find it fun to argue with FDTers about decision-theory. Or imagine that the government passed a law where they tortured everyone who thought FDT was the right view. FDTers wouldn’t be better off.
Or imagine the following setup. You’re offered a box full of cash. A predictor predicts if you’d one-box or two-box in Newcomb’s problem. If you one-box, they put nothing in. If you two-box, they put a million dollars in. Now, suddenly, it’s the two-boxers who are rich.
These examples may seem unfair. You directly get rewarded based on things that are downstream of your decision theory. But Newcomb’s problem is also unfair in precisely this way. It ties how much money you get in a box to your judgments in a decision problem.
Now, you can get around this by narrowing the claim. You can say something like “FDT gets you most utility with respect to the utility that’s downstream of your decision algorithm.” But similarly, CDTers can claim “CDT gets you the most utility causally,” and EDTs can claim “EDT gets you the most utility evidentially.” The different theories disagree about what kind of utility is decision-theoretically relevant. So just pounding the table and saying “my theory is best by the lights of the criterion that my theory says is decision-theoretically relevant,” is obviously question-begging.
Third, FDT isn’t actually the theory that leaves you with most expected utility on average. In fact, in many cases, it’s EDT (perhaps updateless) that leaves you with the most expected utility. For example, in the smoker’s lesion case, EDTers tend to finish better-off than other people. In smoker’s lesion, smoking correlates with worse health, but it doesn’t cause it. But EDTers are less likely to smoke, so on average they’ll have better health.
Now, FDTers’ reply will presumably be that what matters isn’t just leaving with the most utility on average. Fair enough. But then they can’t appeal to this criterion. They don’t do best by it. Which kind of utility you get the most of can’t straightforwardly tell you which decision theory is right, because the decision theories disagree about which kind of utility matters.
Fourth, this argument begs the question in a different way. Other theories make a distinction between the disposition that are beneficial to have and the ones that are rational. For example, imagine that a highly reliable predictor checks to see if you’ll give into blackmail for $100. If so, then he blackmails you. If not, he doesn’t. In this case, non-FDT views grant that it’s timelessly better to not give into blackmail. They simply think that once you’re being blackmailed, the rational thing to do is to give in. At that time, you’re simply paying $100 to avoid having your life ruined.
Now, FDTers reject such a distinction. But we’ll need some argument against this distinction. Otherwise, this objection simply assumes that there’s no distinction between dispositions that are rational and ones that are beneficial. Non-FDTers have a perfectly sensible reply to this objection: in situations where you are directly rewarded for being irrational—for making some unwise decision—then of course the irrational people will be better off!
And non-FDTers have their own claim that their theory gets you the most utility. In, say, MacAskill’s bomb case, FDTers blow themselves up while CDTers and EDTers don’t. CDTers and EDTers thus leave with more utility when they’re in this situation.
Non-FDTers can grant: something FDTish might describe the kinds of dispositions you timelessly want to have, depending on how the world is. But that’s different from it being the right account of rationality. The dispositions that are beneficial aren’t necessarily the ones that are rational. Decision theories are theories of rationality, not of how to program an AI. If you are only interested in the question of how to program an AI, don’t purport to be giving a decision theory that is superior to the ones philosophers endorse.
6 Conclusion
FDT is both implausible and underbaked. It sometimes licenses setting yourself on fire for no benefit. It depends on analyzing how other algorithms would be different in the logically impossible world where your algorithm was different, but has no account of how to analyze logically impossible worlds, how to analyze what it means for your algorithm to be different, and how to analyze the impact that your algorithm being different has on other algorithms. This isn’t a minor technicality—it means that there is literally no situation where we can derive the correct answer from the theory.
Permit me to go slightly meta for a moment. Ideas like FDT are not unknown to academic philosophers. Various ideas in the vicinity have been proposed. Indeed, a view like FDT—where you one-box in Newcomb’s problem even if the boxes are transparent—is intuitive to a lot of people. But the view is pretty widely rejected because it doesn’t really hold up when you scrutinize it and filling in the details is very difficult. There’s a line by Scott Alexander that I sometimes think of:
The response from academic philosophers has been more in the direction of “write papers explaining their reasoning.” FDTers who think their theory is unfairly neglected by the experts need some explanation of why the academic philosophers who hear of FDT nearly always think it’s wrong.
Among laypeople who hear about decision theory, lots of them adopt something FDTish. So you need some explanation of why it is that nearly all the decision-theory experts—who write monstrously complicated papers with math that would go over your head—think FDT is wrong, but intro philosophy students who know almost nothing about decision theory think that it’s right. In general, you should be skeptical of views that are rejected by ~100% of relevant experts, even after considering them at length.