Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Predictors exist: CDT going bonkers... forever

18Dagon

23Daniel Kokotajlo

3Vasco Grilo

2Dagon

8dxu

2Dagon

15Daniel Kokotajlo

20Wei Dai

3Daniel Kokotajlo

2[anonymous]

5Stuart_Armstrong

4Dagon

5waveman

4Viliam

5Daniel Kokotajlo

3Caspar Oesterheld

3Stuart_Armstrong

1Caspar Oesterheld

3Isnasene

10FeepingCreature

1Isnasene

4Stuart_Armstrong

1Isnasene

2Stuart_Armstrong

1Isnasene

3ozziegooen

3Pattern

2shminux

2Stuart_Armstrong

2shminux

5Stuart_Armstrong

New Comment

[note: this is bugging me more than it should. I really don't get why this is worth so much repetition of examples that don't show anything new.]

I'll admit I'm one of those who doesn't see CDT as hopeless. It takes a LOT of hypothetical setup to show cases where it fails, and neither newcomb nor this seem to be as much about decision theory as about free will.

Part of this is my failing. I keep thinking CDT is "classical decision theory", and it means "make the best conditional predictions you can, and then maximize your expected value. This is very robust, but describes all serious decision theories. The actual discussion is about "causal decision theory", and there are plenty of failure cases, where the agent has a flawed model of causality.

But for some reason, we can't just say "incorrect causal models make bad predictions" and move on. We keep bringing up really contrived cases where a naive agent, which we label CDT, makes bad conditional predictions, and it's not clear why they're so stupid as to not notice. I don't know ANYONE who claims an agent should make and act on incorrect predictions.

For your newcomb-like example (and really, any Omega causality violation), I assert that a CDT agent could notice outcomes and apply bayes' theorem to the chance that they can trick Omega just as well as any other DT. Assuming that Omega is cheating, and changing the result after my choice is sufficient to get the right answer.

Cases of mind-reading and the like are similarly susceptible to better causality models - recognizing that the causality is due to the agent's intent, not their actions, makes CDT recognize that to the extent it can control the intent, it should.

Your summary includes " the CDT agent can never learn this", and that seems the crux. To me, not learning something means that _EITHER_ CDT agent is a strawman that we shouldn't spend so much time on, _OR_ this is something that cannot be true, and it's probably good if agents can't learn it. If you tell me that a Euclidian agent knows pi and can accurately make wagers on the circumference of a circle knowing only it's diameter, but it's flawed because a magic being puts it on a curved surface and it never re-considers that belief, I'm going to shrug and say "okay... but here in flatland that doesn't happen". It doesn't matter how many thought experiments you come up with to show counterfactual cases where C/D is different for a circle, you're completely talking past my objection that Euclidian decision theory is simple and workable for actual use.

To summarize my confusion, does CDT require that the agent unconditionally believe in perfect free will independent of history (and, ironically, with no causality for the exercise of will)? If so, that should be the main topic of dispute - the frequency of actual case where it makes bad predictions, not that it makes bad decisions in ludicrously-unlikely-and-perhaps-impossible situations.

To summarize my confusion, does CDT require that the agent unconditionally believe in perfect free will independent of history (and, ironically, with no causality for the exercise of will)? If so, that should be the main topic of dispute - the frequency of actual case where it makes bad predictions, not that it makes bad decisions in ludicrously-unlikely-and-perhaps-impossible situations.

Sorta, yes. CDT requires that you choose actions not by thinking "conditional on my doing A, what happens?" but rather by some other method (there are different variants) such as "For each causal graph that I think could represent the world, what happens when I intervene (in Pearl's sense) on the node that is my action, to set it to A?)" or "Holding fixed the probability of all variables not causally downstream of my action, what happens if I do A?"

In the first version, notice that you are choosing actions by imagining a Pearl-style intervention into the world--but this is not something that actually happens; the world doesn't actually contain such interventions.

In the second version, well, notice that you are choosing actions by imagining possible scenarios that aren't actually possible--or at least, you are assigning the wrong probabilities to them. ("holding fixed the probability of all variables not causally downstream of my action...")

So one way to interpret CDT is that it believes in crazy stuff like hardcore incompatibilist free will. But the more charitable way to interpret it is that it doesn't believe in that stuff, it just acts as if it does, because it thinks that's the rational way to act. (And they have plenty of arguments for why CDT is the rational way to act, e.g. the intuition pump "If the box is already either full or empty and you can't change that no matter what you do, then no matter what you do you'll get more money by two-boxing, so..."

Not believing it, but thinking it's rational to act that way, seems even worse than believing it in the first place.

I -totally- understand the arguing-against-the-premise response to such things. It's coherent and understandable to say "CDT is good enough, because these examples can't actually happen, or are so rare that I'll pay that cost in order to have a simpler model for the other 99.9999% of my decisions". I'd enjoy talking to someone who says "I accept that I'll get the worse result, but it's the right thing to do because ... ". I can't ITT the ending to this sentence.

these examples can't actually happen, or are so rare that I'll pay that cost in order to have a simpler model for the other 99.9999% of my decisions

Indeed, if it were true that Newcomb-like situations (or more generally, situations where other agents condition their behavior on predictions of your behavior) do not occur with any appreciable frequency, there would be much less interest in creating a decision theory that addresses such situations.

But far from constituting a mere 0.0001% of possible situations (or some other, similarly minuscule percentage), Newcomb-like situations are simply the norm! Even in everyday human life, we frequently encounter other people and base our decisions off what we expect them to do—indeed, the ability to model others and act based on those models is integral to functioning as part of any social group or community. And it should be noted that humans do *not* behave as causal decision theory predicts they ought to—we do not betray each other in one-shot prisoner’s dilemmas, we pay people we hire (sometimes) well in advance of them completing their job, etc.

This is not mere “irrationality”; otherwise, there would have been no reason for us to develop these kinds of pro-social instincts in the first place. The observation that CDT is inadequate is fundamentally a combination of (a) the fact that it does not accurately predict certain decisions we make, and (b) the claim that the decisions we make are in some sense *correct* rather than incorrect—and if CDT disagrees, then so much the worse for CDT. (Specifically, the sense in which our decisions are correct—and CDT is not—is that our decisions result in more expected utility in the long run.)

All it takes for CDT to fail is the presence of predictors. These predictors don’t have to be Omega-style superintelligences—even moderately accurate predictors who perform significantly (but not ridiculously) above random chance can create Newcomb-like elements with which CDT is incapable of coping. I really don’t see any justification at all for the idea that these situations somehow constitute a superminority of possible situations, or (worse yet) that they somehow “cannot” happen. Such a claim seems to be missing the forest for the trees: you don’t need perfect predictors to have these problems show up; the problems show up anyway. The only purpose of using Omega-style perfect predictors is to make our thought experiments clearer (by making things more extreme), but they are by no means *necessary*.

Which summarizes my confusion. If CDT is this clearly broken, why is it so discussed (and apparently defended, though I don't actually know any defenders).

Dagon, I sympathize. CDT seems bonkers to me for the reasons you have pointed out. My guess is that academic philosophy has many people who support CDT for three main reasons, listed in increasing order of importance:

(1) Even within academic philosophy, many people aren't super familiar with these arguments. They read about CDT vs. EDT, they read about a few puzzle cases, and they form an opinion and then move on--after all, there are lots of topics to specialize in, even in decision theory, and so if this debate doesn't grip you you might not dig too deeply.

(2) Lots of people have pretty strong intuitions that CDT vindicates. E.g. iirc Newcomb's Problem was originally invented to prove that EDT was silly (because, silly EDT, it would one-box, which is obviously stupid!) My introductory textbook to decision theory was an attempt to build for CDT an elegant mathematical foundation to rival the jeffrey-bolker axioms for EDT. And why do this? It said, basically, "EDT gives the wrong answer in Newcomb's Problem and other problems, so we need to find a way to make some version of CDT mathematically respectable."

(3) EDT has lots of problems too. Even hardcore LWer fans of EDT like Caspar Oesterheld admit as much, and even waver back and forth between EDT and CDT for this reason. And the various alternatives to EDT and CDT that have been thus far proposed also seem to have problems.

My introductory textbook to decision theory was an attempt to build for CDT an elegant mathematical foundation to rival the jeffrey-bolker axioms for EDT. And why do this? It said, basically, “EDT gives the wrong answer in Newcomb’s Problem and other problems, so we need to find a way to make some version of CDT mathematically respectable.”

Joyce's Foundations of Causal Decision Theory, right? That was the book I bought to learn decision theory too. My focus was on anthropic reasoning instead of Newcomb's problem at the time, so I just uncritically accepted the book's contention that two-boxing is the rational thing to do. As a result, while trying to formulate my own decision theory, I had to come up with complicated ways to force it to two-box. It was only after reading Eliezer's posts about Newcomb's problem that I realized that if one-boxing is actually the right thing to do, the decision theory could be made much more elegant. (Too bad it turns out to still have a number of problems that we don't know how to solve.)

But considering that randomness as an antidote to perfect predictions is ubiquitously available in this universe, it's hard to see what practical implications these CDT failures in highly contrived thought experiments have.

You may like this, then: https://www.lesswrong.com/posts/9m2fzjNSJmd3yxxKG/acdt-a-hack-y-acausal-decision-theory

I have guessed that by CDT you mean https://en.wikipedia.org/wiki/Causal_decision_theory

But why make people guess?

Protip: define and/or provide links for opaque terms upon first use.

Even if most people on LW are probably familiar with the abbreviation, someone may come here following a link from elsewhere.

Well said.

I had a similar idea a while ago and am working it up into a paper ("CDT Agents are Exploitable"). Caspar Oesterheld and Vince Conitzer are also doing something like this. And then there is Ahmed's *Betting on the Past* case.

In their version, the Predictor offers bets to the agent, at least one of which the agent will accept (for the reasons you outline) and thus they get money-pumped. In my version, there is no Predictor, but instead there are several very similar CDT agents, and a clever human bookie can extract money from them by exploiting their inability to coordinate.

Long story short, I would bet that an actual AGI which was otherwise smarter than me but which doggedly persisted in doing its best to approximate CDT would fail spectacularly one way or another, "hacked" by some clever bookie somewhere (possibly in its hypothesis space only!). Unfortunately, arguably the same is true for all decision theories I've seen so far, but for different reasons...

>Caspar Oesterheld and Vince Conitzer are also doing something like this

That paper can be found at https://users.cs.duke.edu/~ocaspar/CDTMoneyPump.pdf . And yes, it is structurally essentially the same as the problem in the post.

Cool!

I notice that you assumed there were no independent randomising devices available. But why would the CDT agent ever opt to use a randomising device? Why would it see that as having value?

Apologies, I only saw your comment just now! Yes, I agree, CDT never strictly prefers randomizing. So there are agents who abide by CDT and never randomize. As our scenarios show, these agents are exploitable. However, there could also be CDT agents who, when indifferent between some set of actions (and when randomization is not associated with any cost), *do* randomize (and choose the probability according to some additional theory -- for example, you could have the decision procedure: "follow CDT, but when indifferent between multiple actions, choose a distribution over these actions that is ratifiable".). The updated version of our paper -- which has now been published Open Access in *The Philosophical Quarterly* -- actually contains some extra discussion of this in Section IV.1, starting with the paragraph "Nonetheless, what happens if we grant the buyer in Adversarial Offer access to a randomisation device...".

Decision theories map world models into actions. If you ever make a claim like "This decision-theory agent can never learn X and is therefore flawed", you're either misphrasing something or you're wrong. The capacity to learn a good world-model is outside the scope of what decision theory is[1]. In this case, I think you're wrong.

For example, suppose the CDT agent estimates the prediction will be "zero" with probability p, and "one" with probability 1-p. Then if p≥1/2, they can say "one", and have a probability p≥1/2 of winning, in their own view. If p<1/2, they can say "zero", and have a subjective probability 1−p>1/2 of winning.

This is not what a CDT agent would do. Here is what a CDT agent would do:

1. The CDT agent makes an initial estimate that the prediction will be "zero" with probability 0.9 and "one" with probability 0.1.

2. The CDT agent considers making the decision to say "one" but notices that Omega's prediction aligns with its actions.

3. Given that the CDT agent was just considering saying "one", the agent updates its initial estimate by reversing it. It declares "I planned on guessing one before but the last time I planned that, the predictor also guessed one. Therefore I will reverse and consider guessing zero."

4. Given that the CDT agent was just considering saying "zero", the agent updates its initial estimate by reversing it. It declares "I planned on guessing zero before but the last time I planned that, the predictor also guessed* *zero. Therefore I will reverse and consider guessing one."

5. The CDT agent realizes that, given the predictor's capabilities, its own prediction will be undefined

6. The CDT agent walks away, not wanting to waste the computational power

The longer and longer the predictor is accurate for, the higher and higher the CDT agent's prior becomes that its own thought process is casually affecting the estimate[2]. Since the CDT agent is embedded, it's impossible for the CDT agent to reason outside it's thought process and there's no use in it nonsensically refusing to leave the game.

Furthermore, any good decision-theorist knows that you should never go up against a Sicilian when death is on the line[3].

[1] This is not to say that world-modeling isn't relevant to evaluating a decision theory. But in this case, we should be fully discussing things that may/may not happen in the actual world we're in and picking the most appropriate decision theory for this one. Isolated thought experiments do not serve this purpose.

[2] Note that, in cases where this isn't true, the predictor should get worse over time. The predictor is trying to model the CDT agent's predictions (which depend on how the CDT agent's actions affect its thought-process) without accounting for the way the CDT agent is changing as it makes decision. As a result, a persevering CDT agent will ultimately beat the predictor here and gain infinite utility by playing the game forever

[3] The Battle of Wits from the Princess Bride is isomorphic to problem in this post

Since when does CDT include backtracking on noticing other people's predictive inconsistency? And, I'm not sure that any such explicitly iterative algorithm would be stable.

- The CDT agent considers making the decision to say “one” but notices that Omega’s prediction aligns with its actions.

This is the key. You're not playing CDT here, you're playing "human-style hacky decision theory." CDT cannot notice that Omega's prediction aligns with its hypothetical decision because Omega's prediction is causally "before" CDT's decision, so any causal decision graph cannot condition on it. This is why post-TDT decision theories are also called "acausal."

[Comment edited for clarity]

Since when does CDT include backtracking on noticing other people's predictive inconsistency?

I agree that CDT does not including backtracking on noticing other people's predictive inconsistency. My assumption is that decision-theories (including CDT) takesa world-map and outputs an action. I'm claiming that this post is conflating an error in constructing an accurate world-map with an error in the decision theory.

CDT cannot notice that Omega's prediction aligns with its hypothetical decision because Omega's prediction is causally "before" CDT's decision, so any causal decision graph cannot condition on it. This is why post-TDT decision theories are also called "acausal."

Here is a more explicit version of what I'm talking about. CDT makes a decision to act based on the expected value of its action. To produce such an action, we need to estimate an expected value. In the original post, there are two parts to this:

Part 1 (Building a World Model):

- I believe that the predictor modeled my reasoning process and has made a prediction based on that model. This prediction happens before I actually instantiate my reasoning process
- I believe this model to be accurate/quasi-accurate
- I start unaware of what my causal reasoning process is so I have no idea what the predictor will do. In any case, the causal reasoning process must continue because I'm thinking.
- As I think, I get more information about my causal reasoning process. Because I know that the predictor is modeling my reasoning process, this let's me update my prediction of the predictor's prediction.
- Because the above step was part of my causal reasoning process and information about my causal reasoning process affects my model of the predictor's model of me, I must update on the above step as well
- [The Dubious Step] Because I am modeling myself as CDT, I will make a statement intended to inverse the predictor. Because I believe the predictor is modeling me, this requires me to inverse
*myself.*That is to say, every update my causal reasoning process makes to my probabilities is inversing the previous update - Note that this only works if I believe my reasoning process (but not necessarily the ultimate action) gives me information about the predictor's prediction.
- The above leads to infinite regress

Part 2 (CDT)

- Ask the world model what the odds are that the predictor said "one" or "zero"
- Find the one with higher likelihood and inverse it

I believe Part 1 fails and that this isn't the fault of CDT. For instance, imagine the above problem with zero stakes such that decision theory is irrelevant. If you ask any agent to give the inverse of its probabilities that Omega will say "one" or "zero" with the added information that Omega will perfectly predict those inverses and align with them, that agent won't be able to give you probabilities. Hence, the failure occurs in building a world model rather than in implementing a decision theory.

-------------------------------- Original version

Since when does CDT include backtracking on noticing other people's predictive inconsistency?

Ever since the process of updating a causal model of the world based on new information was considered an epistemic question *outside the scope of decision theory*.

To see how this is true, imagine the exact same situation as described in the post with *zero *stakes. Then ask any agent with any decision theory about the inverse of the prediction it expects the predictor to make. The answer will always be "I don't know", independent of decision theory. Ask that same agent if it can assign probabilities to the answers and it will say "I don't know; every time I try to come up with one, the answer reverses."

All I'm trying to do is compute the probability that the predictor will guess "one" or "zero" and *failing*. The output of failing here isn't "well, I guess I'll default to fifty-fifty so I should pick at random"[1], it's NaN.

Here's a causal explanation:

- I believe the predictor modeled my reasoning process and has made a prediction based on that model.
- I believe this model to be accurate/quasi-accurate
- I start unaware of what my causal reasoning process is so I have no idea what the predictor will do. But my prediction of the predictor
*depends*on my causal reasoning process - Because my causal reasoning process is contingent on my prediction and my prediction is contingent on my causal reasoning process, I end up in an infinite loop where my causal reasoning process cannot converge on an actual answer. Every time it tries, it just
*keeps*updating. - I quit the game because my prediction is
*incomputable*

I'm claiming that this post is conflating an error in constructing an accurate world-map with an error in the decision theory.

The problem is not that CDT has an inaccurate world-map; the problem is that CDT has an accurate world map, and then breaks it. CDT would work much better with an inaccurate world-map, one in which its decision causally affects the prediction.

See this post for how you can hack that: https://www.lesswrong.com/posts/9m2fzjNSJmd3yxxKG/acdt-a-hack-y-acausal-decision-theory

Having done some research, it turns out the thing I was actually pointing to was ratifiability and the stance that any reasonable separation of world-modeling and decision-selection should put ratifiability in the former rather than the latter. This specific claim isn't new: From "Regret and Instability in causal decision theory":

Second, while I agree that deliberative equilibrium is central to rational decision making, I disagree with Arntzenius that CDT needs to be ammended in any way to make it appropriately deliberational. In cases like Murder Lesion a deliberational perspective is forced on us by what CDT says. It says this:A rational agent should base her decisions on her best information about the outcomes her acts are likely to causally promote, and she should ignore information about what her acts merely indicate. In other words, as I have argued, the theory asks agents to conform to Full Information, which requires them to reason themselves into a state of equilibrium before they act. The deliberational perspective is thus already a part of CDT

However, it's clear to me now that you were discussing an older, more conventional, version of CDT[1] which does not have that property. With respect to that version, the thought-experiment goes through but, with respect to the version I believe to be sensible, it doesn't[2].

[1] I'm actually kind of surprised that the conventional version of CDT is *that *dumb -- and I had to check a bunch of papers to verify that this was actually happening. Maybe if my memory had complied at the time, it would've flagged your distinguishing between CDT and EDT here from past LessWrong articles I've read like CDT=EDT. But this wasn't meant to be so I didn't notice you were talking about something different.

[2] I am now confident it does not apply to the thing I'm referring to -- the linked paper brings up "Death in Damascus" specifically as a place where ratifiable CDt does not fail

Can you clarify what you mean by "successfully formalised"? I'm not sure if I can answer that question but I can say the following:

Stanford's encyclopedia has a discussion of ratifiability dating back to the 1960s and (by the 1980s) it has been applied to both EDT and CDT (which I'd expect, given that constraints on having an accurate world model should be independent of decision theory). This gives me confidence that it's not just a random Less Wrong thing.

Abram Dempski from MIRI has a whole sequence on when CDT=EDT which leverages ratifiability as a sub-assumption. This gives me confidence that ratifiability is actually onto something (the Less Wrong stamp of approval is important!)

Whether any of this means that it's been "successfully formalised", I can't really say. From the outside-view POV, I literally did not know about the conventional version of CDT until yesterday. Thus, I do not really view myself as someone *currently *capable of verifying the extent to which a decision theory has been successfully formalised. Still, I consider this version of CDT old enough historically and well-enough-discussed on Less Wrong by Known Smart People that I have high confidence in it.

I like this formulation. Personally, I've felt that Newcomb's problem is a bit overly complex and counter-intuitive. Arguably Newcomb's problem with transparent boxes would be the same as regular Newcomb's problem, for instance.

Andrew Critch once mentioned a similar problem around rock-paper-scissors and Bayes. The situation was, "Imagine you are playing a game of rock-paper-scissors against an omega who can near-perfectly predict your actions. What should your estimate be for the winning decisions?" The idea was that a Bayesian would have to admit that one has a 33.33333... + delta chance of winning, and then expect that to win in 33.333333 + delta times, but they would predictably win ~0 times, so this showcases a flaw in Bayes. However, it was claimed that Logical Induction would handle this.

Another game that came to mind from your post is Three-card Monte with a dealer who chose randomly but was really good at reading minds.

I definitely would acknowledge this as a nasty flaw in a Bayesian analysis, but could easily imagine that it's a flaw in the naive use of Bayesian analysis, rather than the ideal.

I was a bit curious about the possibility of imagining what reflective Bayes would look like. Something like,

In the case of rock-paper-scissors, the agent knows that

It could condition on this, making a much longer claim,

One obvious issue that comes up is that the justifications for Bayes lie in axioms of probability that clearly are not effectively holding up here. I'd assume that the probability space of some outcomes is not at all a proper measure, as the sum doesn't equal 1.

Miller's principle: p(x|p(x) = y) = y

It could condition on this, making a much longer claim,p(B|(p(B|ω)=0.3333+δ),ω,p(B|(p(B|ω)=0.3333+δ),ω))=0+γ)

This equation didn't have a final = and right side.

Omega will predict their action, and compare this to their actual action. If the two match...

For a perfect predictor the above simplifies to "lose 1 utility", of course. Are you saying that your interpretation of EDT would fight the hypothetical and refuse to admit that perfect predictors can be imagined?

CDT would fight the hypothetical, and refuse to admit that perfect predictors *of their own actions* exist (the CDT agent is perfectly fine with perfect predictors of other people's actions).

That... doesn't seem like a self-consistent decision theory at all. I wonder if any CDT proponents agree with your characterization.

I'm using CDT as it's formally stated (in, eg, the FDT paper).

The best defence I can imagine from a CDT proponent: CDT is decision theory, not game theory. Anything involving predictors is game theory, so doesn't count.

I've been wanting to get a better example of CDT (causal decision theory) misbehaving, where the behaviour is more clearly suboptimal than it is in the Newcomb problem (which many people don't seem to accept as CDT being suboptimal), and simpler to grasp than Death in Damascus.

## The "predictors exist" problem

So consider this simple example: the player is playing against Omega, who will predict their actions

^{[1]}. The player can take three actions: "zero", "one", or "leave".If ever they do "leave", then the experiment is over and they leave. If they choose "zero" or "one", then Omega will predict their action, and compare this to their actual action. If the two match, then the player loses 1 utility and the game repeats; if the action and the prediction differs, then the player gains 3 utility and the experiment ends.

Assume that actually Omega is a perfect or quasi-perfect predictor, with a good model of the player. An FDT or EDT agent would soon realise that they couldn't trick Omega, after a few tries, and would quickly end the game.

But the CDT player would be incapable of reaching this reasoning. Whatever distribution they compute over Omega's prediction, they will always estimate that they (the CDT player) have at least a 50% chance of choosing the other option

^{[2]}, for an expected utility gain of at least 0.5(3)+0.5(−1)=1.Basically, the CDT agent can never learn that Omega is a good predictor of themselves

^{[3]}. And so they will continue playing, and continue losing... for ever.Omega will make this prediction not necessarily before the player takes their action, not even necessarily without seeing this action, but still makes the prediction independently of this knowledge. And that's enough for CDT. ↩︎

For example, suppose the CDT agent estimates the prediction will be "zero" with probability p, and "one" with probability 1-p. Then if p≥1/2, they can say "one", and have a probability p≥1/2 of winning, in their own view. If p<1/2, they can say "zero", and have a subjective probability 1−p>1/2 of winning. ↩︎

The CDT agent has no problem believing that Omega is a perfect predictor of

other agents, however. ↩︎