(Crossposted from MidWittgenstein (as part of my lame resolution to write more this year))
This is a post about what I see as a bit of a confusion in some Decision Theory arguments. It touches on FDT criticisms of CDT/EDT so I wanted to post here in case I've misunderstood something about FDT and can get some pushback from people more familiar. Edited slightly, but the tone/style is probably not very LessWrong-y though so apologies
I think there’s a lot of cross-talk and confusion around how different decision theories approach a class of decision problems, specifically in the criticisms FDT proponents have of more established theories. I want to briefly go through why I think these disagreements come from some muddy abstractions (namely, treating agents in a decision problem as one homogenous/continuous agent rather than multiple distinct agents at different timesteps). I think spelling this out explicitly shows why FDT is a bit confused in its criticisms of more “standard” recommendations, and moreover why the evidence in favour of it (outperforming on certain problems) ends up actually being a kind of circular argument.
I’m going to mostly focus on FDT vs EDT here, because I think the misunderstanding FDT has is pretty much the same when compared to either EDT or CDT, and I personally like EDT more. The rough TL;DR of what I want to say is:
Part 1: The First Part
It’s an overwhelmingly natural intuition to think of future and past (and counterfactual) versions of you as “you” in a very fundamental sense. It’s a whole other can of worms as to why, but it feels pretty hard-coded into us that the person who e.g. is standing on the stage tomorrow in Newcomb’s problem is the exact same agent as you. This is a very useful abstraction that works so well partly because other versions of you are so well aligned with you, and so similar to you. But from a bottom-up perspective they’re distinct agents - they tend to work pretty well together, but there’s no a priori reason why they should judge the same choices in the same way, or even have the same preferences/utility functions. There are toy models where they clearly don’t e.g. if you for some reason only ever care about immediate reward at any timestep.
It’s a distinction that’s usually so inconsequential as to be pretty ignorable (unless you’re trying to rationalise your procrastination), but a lot of the thornier decision problems that split people into camps are - I think - largely deriving their forcefulness from this distinction being so swept under the rug. Loosely speaking, the problems work by driving a wedge between what’s good for some agents in the set-up and what’s good for others, and by applying this fuzzy abstraction we lose track of this and see what looks like a single agent doing sub-optimally for themselves, rather than mis-aligned agents not perfectly co-operating.
To spell it out, let’s consider two flavours of a fun thought experiment:
Single-agent Counterfactual coin-toss
You’re offered a fair coin toss on which you would win $2 on Heads if and only if it’s predicted (by a virtually omniscient predictor etc.) that you would pay $1 on Tails. You agree to the game and the coin lands Tails - what do you do?
I think this thought experiment is one of the cleanest separator of the intuitions of various decision theories - CDT obviously refuses to pay, reasoning that it causes them to win $1. EDT agrees with the answer but reasons differently: conditioned on all information they have - including the fact that the coin is already definitely Tails - their expected value is greater if they don’t pay, so they don’t. FDT - as I understand it - pays however, since the ex-ante EV of your decision algorithm is higher if it pays on Tails (it expects to win $0.5 on average, whereas EDT and CDT both know they’ll get nothing on Heads because they know they won’t pay on Tails). This ability to “cooperate” with counterfactual versions of you is a large part of why people like FDT, while the fact that you feel like you’re “locally” giving up free money on certain branches when you know that’s the branch you’re on feels equally weird to others. I think the key to understanding what’s going on here is that the abstraction mentioned above - treating all of these branches as containing the same agent - is muddying the water.
Consider the much less interesting version of this problem:
Multi-agent Counterfactual coin-toss
Alice gets offered the coin toss, and Bob will get paid $2 on heads iff it’s predicted that Clare will pay $1 on Tails. Assume also that:
Alice cares equally about Bob and Clare, whereas Bob and Clare care only about themselves
What do the various agents think in this case? Alice thinks that this is a pretty good deal and wants Clare to pay on Tails, since it raises the EV across the set of people she cares about. Bob obviously thinks the deal is fantastic and that Clare should pay. Clare, however, understandably feels a bit screwed over and not inclined to play ball. If she cared about Bob’s ex-ante expected value from the coin-toss (i.e. she had the same preferences as Alice), she would pay, but she doesn’t, so she doesn’t.
A key point I want to make is that we can think of the single-agent coin-toss involving just “you” as actually being the same as a multi-agent coin-toss, just with Alice, Bob, and Clare being more similar and related in a bunch of causal ways. If me-seeing-Tails is a different agent to me-seeing-Heads is a different agent to me-before-the-toss, then it’s not necessarily irrational for “you” to not pay on Tails, or for “you” to not pay the Driver in Parfit’s Hitchhiker, because thinking of these agents as the same “you” that thought it was a great deal ex-ante and wanted to commit is just a convenient abstraction that breaks down here.
One of the main ways they’re different agents, which is also the way that’s most relevant to this problem, is that they plausibly care about different things. IF (and I’ll come back to this “if” in a sec) what rational agents in these problems are trying to do is something like “I want the expected value over future possible versions of this specific instance of me to be maximised”, then at different stages in the experiment different things are being maximised over, since the set of possible future people are not identical for each agent. For example, in the coin-toss case the set for me-seeing-Tails contains only people who saw tails, whereas for me-before-the-toss it doesn’t. If both “me”s are trying to maximise EV over these different sets, it’s not surprising that they disagree on what’s the best choice, any more than it’s surprising that Clare and Alice disagree above.
And I think an EDT proponent says the above - i.e. “maximise the EV of future possible mes” - is what rational agents are doing, and so we should accept that rational agents will decline to pay the driver in Parfit’s Hitchhiker, and not pay on Tails above etc. But crucially, through the above lens this isn’t a failure of rationality as much as a sad consequence of having imperfectly aligned agents pulling in different directions. Moreover, EDT proponents will still say things like “you should try to find a way to pre-commit to paying the driver” or “you should alter yourself in such a way that you have to pay on Tails”, because those are rational things for the ex-ante agent to do given what they are trying to maximise. I think some FDT proponents see this as an advantage of their theory - “look at the hoops this idiot has to jump through to arrive at the decision we can just see is rational”. But this is misguided, since properly viewed these aren’t weird hacks to make a single agent less predictably stupid, but rather a natural way in which agents would try to coordinate with other, misaligned agents.
Part 2. FDT response and circularity
Of course this isn’t the only way we can posit what rational agents try to maximise - we could claim they’re thinking something like “What maximises the ex-ante EV of agents running the same decision algorithm as me in this problem?”, in which case me-seeing-Tails should indeed pay since it makes his decision algorithm ex-ante more profitable in expectation. This is, as I understand it, the kind of the crux between classic decision theories like EDT and CDT on the one hand and things like FDT on the other. The disagreement is really encapsulated in FDT’s rejection of the “Sure Thing” principle. FDT says that it’s rational for you to forgo a “sure thing” (walking away with your free $1 in Tails), because if your decision algorithm forgoes, then it makes more money ex-ante in expectation. In other words, in this specific situation you (i.e the specific distinct agent who just flipped Tails and is now eyeing the door) might be unfortunately losing money, but on average FDT agents who take this bet are walking around richer for it!
I don’t think EDT actually disagrees with any FDT assessment here, it just disagrees that this is the correct framing of what a rational actor is trying to maximise. If what a rational agent should do is maximise the ex-ante EV of its decision algorithm in this problem, then FDT recommendations are right - but why is this what they should be maximising?
The fact that EDT is internally consistent by its own light obviously isn't novel to FDT proponents. But I think an FDT proponent here says “Well ok EDT has an internally consistent principle here too, but the FDT principle is better because the agents do better overall in expectation. Look at all those FDT agents walking around with $2! None of you dummies have $2”. But then this is clearly circular. They do better in expectation according to the ex-ante agent, but the whole point of this disagreement via more “fine-grained” agents is that this isn’t the only agent through whose lens we can evaluate a given problem. In other words, we can’t justify choosing a principle which privileges a specific agent in the problem (in this case, the ex-ante agent) by appealing to how much better the principle does for that agent. It’s no better than the EDT agent insisting EDT is better because, for any given agent conditioning on all their evidence, it maximises their EV and FDT doesn’t.
So really the question that first needs to be answered before you can give a verdict on what is rational to do on Tails is “Given a decision problem with multiple distinct agents involved, how should they decide what to maximise over?” If the answer is they should maximise the EV of “downstream” future agents, they’ll end up with EDT decisions, and be misaligned with other agents. And if the answer is they should be maximising over ex-ante EV of agents running their decision algorithm, they’ll all be aligned and end up with FDT decisions.
But the difference in performance of these decisions can’t be used to answer the question, because the evaluation of the performance depends on which way you answer the question.
To be fair to FDT proponents, this line of reasoning is just as circular when used by an EDT agent. I bring it up as a failing of FDT proponents here though because I see them levying the above kind of performance-related arguments in favour of their view against EDT, whereas my take of EDT criticisms of FDT seems to be more like “Huh? Shouldn’t you figure out that whole counterpossible thing before you say your theory is even coherent, let alone better?”
Part 3: Is there an objective answer then?
So if this way of deciding between decision theories is circular, how do we decide which one to use? Is there some other way to fill out this “free parameter” of what’s rational to be maximising over? I’m not sure. We can rely on our intuitions somewhat - if both theories can coherently perform better by their own metrics, we can look at which metric feels less “natural” to use. For most people this will probably be EDT-like verdicts, given how overwhelmingly intuitive things like the Sure Thing principle are. This seems pretty weak though - intuitions are incredibly slippery in the cases where these theories come apart, and I think you can think your way into finding either intuitive.
My hunch is instead that some of the machinery of thinking about decision theory just doesn’t survive at this level of zooming in/removing the abstracting away of multiple agents. It’s equipped to adjudicate decisions given an agent with defined goals/preferences - but it just doesn’t seem to have an answer for “what exactly should these multiple agents all be caring about?” It seems almost more in the realm of game theory - but even there the players have well-defined goals - here we’re essentially arguing over whether the players in a game should a priori have aligned goals. It just seems like a non-starter. Which is a very anticlimactic way to end a post but there you go.
Parts 2 and 3 seem like you independently discovered the lack of performance metrics for decision theories.
Ah I hadn't realised Caspar wrote that, thanks for the link! I agree that seems to be getting at the same idea, and it's kind of separable from the multi-agent point
I've recently been leaning more in the direction of there being multiple coherent approaches to the Counterfactual Mugging problem, but I was still feeling quite confused by it. I really think that the multi-agent perspective adds a lot of clarity.
Cousin_it and I discovered the Counterfactual Prisoner's Dilemma, which it similar to the Counterfactual Mugging, but presents a case where the kind of agent that pay attention to their counterfactual selves does better in each arm. I think it presents a limitation to thinking of yourself as completely disconnected from agents in other branches, but I suppose we can still model the situation as independent agents with some kind of subjunctive link.
You may find it interesting that I've also argued in favour of viewing counterfactuals as circular here and that I even ran a competition on this last year.
The main opposition appears to be between updatelessness and expected utility. Actions that make up updateless behavior result from a single decision, and can lack coherence. This is basically what the thought experiments about counterfactuals are demonstrating, actions that are individually clearly at odds with agent's purported utility function, and only become correct as elements of a larger policy. But it gets worse, for expected utility is itself a product of the assumption of coherence between decisions! When an agent is acting updatelessly, it only makes a single decision to choose that single updateless policy, and there is no other decision for this one to be coherent with, so no reason at all for preference to have the form of expected utility. When the updateless point of view is internalized deeply enough, it starts to seem like the whole idea of expected utility has no foundation behind it. And that's fair, for updateless decisions it doesn't.
But some kind of coherence between actions and resulting preference that is consistent in some sense (even if it's not necessarily expected utility specifically) intuitively seems like a natural principle in its own right, putting the idea of updatelessness into question. The incoherent actions of an updateless agent are bound by a single updateless decision jointly directing all of them, but coherent actions of an updateful agent are bound by some different kind of shared identity. If updateful actions shouldn't be attributed to a single agent, there is still motivation to introduce an entity that encompasses them. This entity is then an origin of consistent preference, but this entity is not the scope of optimization by a single updateless decision determining all actions.
Didn't read the whole thing, don't think anything so far is multi-agent related, but
You’re offered a fair coin toss on which you’ll win $2 on Heads if and only if it’s predicted (by a virtually omniscient predictor etc.) that you will pay $1 on Tails. You agree to the game and the coin lands Tails - what do you do?
You don't have a choice in the matter, you will pay $1 since the predictor predicted it. It does not matter what you think you will decide, or what you think you will do. What actually happens is that you will find yourself paying $1, it is an inevitability. There is nothing else to discuss, and decision theories do not enter into it. Your mental processes may depend on the decision theories you entertain, but your actions do not.
Having read this comment, I can now see an ambiguity in the language. I read it as "You’re offered a fair coin toss on which (you’ll win $2 on Heads if and only if it’s predicted that you will pay $1 on Tails)". That is, you're offered the coin flip regardless of the prediction, but can only win money on heads if it's predicted that you would have paid on tails.
you’ll win $2 on Heads if and only if it’s predicted that you will pay $1 on Tails
that is my reading as well... Still means you will pay $1 as predicted if the outcome is tails, regardless of your internal decision theory.
Why? Being predicted to not pay on tails is perfectly consistent with seeing a flip of tails (and not paying).
As I see it, the game proceeds as follows: You flip a coin. If it comes up tails, you are asked whether or not you want to pay $1. If it comes up heads, the predictor estimates whether you would have paid up on a result of tails: you get $2 if they predict that you would, otherwise you get nothing.
You know these rules, and that the predictor is essentially perfect for all practical purposes.
Hmm, I guess I misunderstood the setup, oops. I assumed that only those who are predicted to pay $1 on tails would be offered the game. Apparently... something else is going on? The game is offered first, and then the predictor makes the prediction?
Yes, that's why I bracketed my interpretation as I did: in my reading, the only clause to which the prediction result applies is "you’ll win $2 on Heads".
I'm probably misunderstanding you or I've worded things in a confusing way that I haven't noticed - I don't think anywhere it's implied what you do on Tails? The "iff" here is just saying you would be paid on Heads iff you would pay on Tails - the flip will happen regardless and the predictor hasn't made any prediction about the coin itself, just you're conditional behaviour
Edit: Maybe the "iff you will pay $1 on Tails" makes it sound like the predictor is predicting both the coin and your response, I'll edit to make more clear