So... what's the deal with counterfactuals?

Over the past couple of years, I've been writing about the CDT=EDT perspective. I've now organized those posts into a sequence for easy reading.

I call CDT=EDT a "perspective" because it is a way of consistently answering questions about what counterfactuals are and how they work. At times, I've argued strongly that it is the *correct* way. That's basically because:

- it has been the
*only*coherent framework I put any stock in (more for lack of other proposals for dealing with logical counterfactuals than for an abundance of bad ones); - there
*are*strong arguments for it,*if*you're willing to make certain assumptions; - it would be awfully nice to settle this whole question of counterfactual reasoning and move on. CDT=EDT is in a sense the most boring possible answer, IE that all approaches we've thought of are essentially equivalent and there's no hope for anything better.

However, recently I've realized that there's a perspective which unifies *even more* approaches, while being *less boring* (more optimistic about counterfactual reasoning helping us to do well in decision-theoretic problems). It's been right in front of me the whole time, but I was blind to it due to the way I factored the problem of formulating decision theory. It suggests a research direction for making progress in our understanding of counterfactuals; I'll try to indicate some open curiosities of mine by the end.

# Three > Two

The claim I'll be elaborating on in this post is, essentially, that the framework in Jessica Taylor's post about memoryless cartesian environments is better than the CDT=EDT way of thinking. You'll have to read the post to get the full picture if you haven't, but to briefly summarize: if we formalize decision problems in a framework which Jessica Taylor calls "memoryless cartesian environments" (which we can call "memoryless POMDPs" if we want to be closer to academic CS/ML terminology), reasoning about anthropic uncertainty in a certain way (via the self-indication assumption, SIA for short) makes it possible for CDT to behave like UDT.

The result there is sometimes abbreviated as UDT=CDT+SIA, although UDTCDT+SIA is more accurate, because the optimal UDT policies are a subset of the policies which CDT+SIA can follow. This is because UDT has self-coordination power which CDT+SIA lacks. (We could say UDT=CDT+SIA+coordination, but unfortunately "coordination" lacks a snappy three-letter acronym. Or, to be even more pedantic, we could say that UDT1.0 = CDT+SIA, and UDT1.1 = CDT+SIA+coordination. (The difference between 1.0 and 1.1 is, after all, the presence of global policy coordination.)) [EDIT: This isn't correct. See Wei Dai's comment.]

Caspar Oesterheld commented on that post with an analogous EDT+SSA result. SSA (the self-sampling assumption) is one of the main contenders beside SIA for correct anthropic reasoning. Caspar's comment shows that we can think of the correct anthropics as a function of your preference between CDT and EDT. So, we could say that CDT+SIA = EDT+SSA = UDT1.0; or, CDT=EDT=UDT for short. [EDIT: As per Wei Dai's comment, the equation "CDT+SIA = EDT+SSA = UDT1.0" is really not correct due to differing coordination strengths; as he put it, UDT1.0 > EDT+SSA > CDT+SIA.]

My CDT=EDT view came from being pedantic about how decision problems are represented, and noticing that when you're pedantic, it becomes awfully hard to drive a wedge between CDT and EDT; you've got to do things which are strange enough that it becomes questionable whether it's a fair comparison between CDT and EDT. However, I didn't notice the extent to which my "being very careful about the representation" was really *insisting that bayes nets are the proper representation*.

*(Aside: Bayes nets which are representing decision problems are usually called influence diagrams rather than Bayes nets. I think this convention is silly; why do we need a special term for that?)*

It is rather curious that LIDT also illustrated CDT=EDT-style behavior. It is part of what made me feel like CDT=EDT was a convergent result of many different approaches, rather than noticing its reliance on certain Bayes-net formulations of decision problems. Now, I instead find it to be curious and remarkable that logical induction seems to think as if the world were made of bayes nets.

If CDT=EDT comes from insisting that decision problems are represented as Bayes nets, CDT=EDT=UDT is the view which comes from insisting that decision problems be represented as memoryless cartesian environments. At the moment, this just seems like a better way to be pedantic about representation. It unifies three decision theories instead of two.

# Updatelessness Doesn't Factor Out

In fact, I thought about Jessica's framework frequently, but I didn't think of it as an objection to my CDT=EDT way of thinking. I was blind to this objection because I thought (logical-)counterfactual reasoning and (logically-)updateless reasoning could be dealt with as separate problems. The claim was not that CDT=EDT-style decision-making did well, but rather, that any decision problem where it performed poorly could be analyzed as a case where updateless reasoning is needed in order to do well. I let my counterfactual reasoning be simple, blaming all the hard problems on the difficulty of logical updatelessness.

Once I thought to question this view, it seemed very likely wrong. The Dutch Book argument for CDT=EDT seems closer to the true justification for CDT=EDT reasoning than the Bayes-net argument, but the Dutch Book argument is a dynamic consistency argument. I know that CDT and EDT both violate dynamic consistency, in general. So, why pick on one special type of dynamic consistency violation which CDT can illustrate but EDT cannot? In other words, the grounds on which I can argue CDT=EDT seem to point more directly to UDT instead.

# What about all those arguments for CDT=EDT?

## Non-Zero Probability Assumptions

I've noted before that each argument I make for CDT=EDT seems to rely on an assumption that actions have non-zero probability. I leaned heavily on an assumption of epsilon exploration, although one could also argue that all actions must have non-zero probability on different grounds (such as the implausibility of knowing so much about what you are going to do that you can completely rule out any action, before you've made the decision). Focusing on cases where we have to assign probability zero to some action was a big part of finally breaking myself of the CDT=EDT view and moving to the CDT=EDT=UDT view.

(I was almost broken of the view about a year ago by thinking about the XOR blackmail problem, which has features in common with the case I'll consider now; but, it didn't stick, perhaps because the example doesn't actually force actions to have probability zero and so doesn't point so directly to where the arguments break down.)

Consider the transparent Newcomb problem with a perfect predictor:

**Transparent Newcomb.***Omega runs a perfect simulation of you, in which you face two boxes, a large box and a small box. Both boxes are made of transparent glass. The small box contains $100, while the large one contains $1,000. In the Simulation, Omega gives you the option of either taking both boxes or only taking the large box. If Omega predicts that you will take only one box, then Omega puts you in this situation for real. Otherwise, Omega gives the real you the same decision, but with the large box empty. You find yourself in front of two full boxes. Do you take one, or two?*

Apparently, since Omega is a perfect predictor, we are forced to assign probability zero to one-boxing even if we follow a policy of epsilon-exploring. In fact, if you implement epsilon-exploration by refusing to take any action which you're very confident you'll take (you have a hard-coded response: if **P("I do action X")>1-epsilon**, do anything but **X**), which is how I often like to think about it, then ** you are forced to 2-box in transparent Newcomb**. I was

*expecting*CDT=EDT type reasoning to 2-box (at which point I'd say "but we can fix that by being updateless"), but this is a

*really weird reason*to 2-box.

Still, that's not in itself an argument against CDT=EDT. Maybe the rule that we can't take actions we're overconfident in is at fault. The argument against CDT=EDT style counterfactuals in this problem is that the agent should expect that if it 2-boxes, then it won't ever be in the situation to begin with; at least, not in the *real* world. As discussed somewhat in the happy dance problem, this breaks important properties that you might want out of conditioning on conditionals. (There are some interesting consequences of this, but they'll have to wait for a different post.) More importantly for the CDT=EDT question, this can't follow from evidential conditioning, or learning about consequences of actions through epsilon-exploration, or any other principles in the CDT=EDT cluster. So, there would at least have to be other principles in play.

A very natural way of dealing with the problem is to represent the agent's uncertainty about whether it is in a simulation. If you think you might be in Omega's simulation, observing a full box doesn't imply certainty about your own action anymore, or even about whether the box is really full. This is exactly how you deal with the problem in memoryless cartesian environments. But, if we are willing to do this here, we might as well think about things in the memoryless cartesian framework all over the place. This contradicts the CDT=EDT way of thinking about things in lots of problems where updateless reasoning gives different answers than updatefull reasoning, such as counterfactual mugging, rather than only in cases where some action has probability zero.

(I should actually say "problems where updateless reasoning gives different answers than *non-anthropic* updateful reasoning", since the whole point here is that updateful reasoning *can* be consistent with updateless reasoning so long as we take anthropics into account in the right way.)

I also note that trying to represent this problem in bayes nets, while possible, is very awkward and dissatisfying compared to the representation in memoryless cartesian environments. You could say I shouldn't have gotten myself into a position where this felt like significant evidence, but, reliant on Bayes-net thinking as I was, it did.

Ok, so, looking at examples which force actions to have probability zero made me revise my view even for cases where actions all have non-zero probability. So again, it makes sense to ask: but what about the arguments in favor of CDT=EDT?

## Bayes Net Structure Assumptions

The argument in the bayes net setting makes some assumptions about the structure of the Bayes net, illustrated earlier. Where do those go wrong?

In the Bayes net setting, observations are represented as parents of the epistemic state (which is a parent of the action). To represent the decision conditional on an observation, we condition on the observation being true. This stops us from putting some probability on our observations being false due to us being in a simulation, as we do in the memoryless cartesian setup.

In other words: the CDT=EDT setup makes it impossible to update on something and still have rational doubt in it, which is what we need to do in order to have an updateful DT act like UDT.

There's likely *some* way to fix this while keeping the Bayes-net formalism. However, memoryless cartesian environments model it naturally.

Question: how can we model memoryless cartesian environments in Bayes nets? Can we do this in a way such that the CDT=EDT theorem applies (making the CDT=EDT way of thinking compatible with the CDT=EDT=UDT way of thinking)?

## CDT Dutch Book

What about the Dutch-book argument for CDT=EDT? I'm not quite sure how this one plays out. I need to think more about the setting in which the Dutch-book can be carried out, especially as it relates to anthropic problems and anthropic Dutch-books.

## Learning Theory

I said that I think the Dutch-book argument gets closer to the real reason CDT=EDT seems compelling than the Bayes-net picture does. Well, although the Dutch Book argument against CDT gives a crisp justification of a CDT=EDT view, I felt the learning-theoretic intuitions which lead me to formulate the dutch book are closer to the real story. It doesn't make sense to ask an agent to have good counterfactuals in any single situation, because the agent may be ignorant about how to reason about the situation. However, any errors in counterfactual reasoning which result in observed consequences predictably differing from counterfactual expectations should eventually be corrected.

I'm still in the dark about how this argument connects to the CDT=EDT=UDT picture, just as with the Dutch-book argument. I'll discuss this more in the next section.

# Static vs Dynamic

A big update in my thinking recently has been to cluster frameworks into "static" and "dynamic", and ask how to translate back and forth between static and dynamic versions of particular ideas. Classical decision theory has a strong tendency to think in terms of statically given decision problems. You could say that the epistemic problem of figuring out what situation you're in is assumed to factor out: decision theory deals only with what to do once you're in a particular situation. On the other hand, learning theory deals with more "dynamic" notions of rationality: rationality-as-improvement-over-time, rather than an absolute notion of perfect performance. (For our purposes, "time" includes logical time; even in a single-shot game, you can learn from relevantly similar games which play out in thought-experiment form.)

This is a messy distinction. Here are a few choice examples:

**Static version:** Dutch-book and money-pump arguments.

**Dynamic version:** Regret bounds.

Dutch-book arguments rely on the idea that you shouldn't *ever* be able to extract money from a rational gambler without a chance of losing it instead. Regret bounds in learning theory offer a more relaxed principle, that you can't ever extract *too much* money (for some notion of "too much" given by the particular regret bound). The more relaxed condition is more broadly applicable; Dutch-book arguments only give us the probabilistic analog of logical consistency properties, whereas regret bounds give us inductive learning.

**Static:** Probability theory.

**Dynamic:** Logical induction.

In particular, the logical induction criterion gives a notion of regret which implies a large number of nice properties. Typically, the difference between logical induction and classical probability theory is framed as one of logical omniscience vs logical uncertainty. The static-vs-dynamic frame instead sees the critical difference as one of rationality in a static situation (where it makes sense to think about perfect reasoning) vs learning-theoretic rationality (where it doesn't make sense to ask for perfection, and instead, one thinks in terms of regret bounds).

**Static: **Bayes-net decision theory (either CDT or EDT as set up in the CDT=EDT argument).

**Dynamic:** LIDT.

As I mentioned before, the way LIDT seems to naturally reason as if the world were made of Bayes nets now seems like a curious coincidence rather than a convergent consequence of correct counterfactual conditioning. I would like a better explanation of why this happens. Here is my thinking so far:

- Logical induction lacks a way to question its perception. As with the Bayes-net setup used in the CDT=EDT argument, to observe something is to think that thing is true. There is not a natural way for logical induction to reason anthropically, especially for information which comes in through the traders thinking longer. If one of the traders calculates digits of and bets accordingly, this information is simply known by the logical inductor; how can it entertain the possibility that it's in a simulation and the trader's calculation is being modified by Omega?
- Logical induction knows its own epistemic state to within high accuracy, as is assumed in the Bayes-net CDT=EDT theorem.
- LIDT makes the action a function of the epistemic state alone, as required.

There's a lot of formal work one could do to try to make the connection more rigorous (and look for places where the connection breaks down!).

**Static: **UDT.

**Dynamic: **???

The problem of logical updatelessness has been a thorn in my side for some time now. UDT is a good reply to a lot of decision-theoretic problems when they're framed in a probability-theoretic setting, but moving to a logically uncertain setting, it's unclear how to apply UDT. UDT requires a fixed prior, whereas logical induction gives us a picture in which logical uncertainty is fundamentally about how to revise beliefs as you think longer.

The main reason the static-vs-dynamic idea has been a big update for me is that I realized that a lot of my thinking has been aimed at turning logical uncertainty into a "static" object, to be able to apply UDT. I haven't even posted about most of those ideas, because they haven't lead anywhere interesting. Tsvi's post on thin logical priors is definitely an example, though. I now think this type of approach is likely doomed to failure, because the dynamic perspective is simply superior to the static one.

The interesting question is: how do we translate UDT to a dynamic perspective? How do we learn updateless behavior?

For all its flaws, taking the dynamic perspective on decision theory feels like something asymptotic decision theory got right. I have more to say about what ADT does right and wrong, but perhaps it is too much of an aside for this post.

A general strategy we might take to approach that question is: how do we translate individual things which UDT does right into learning-theoretic desiderata? (This may be more tractable than trying to translate the UDT optimality notion into a learning-theoretic desideratum whole-hog.)

**Static:** Memoryless Cartesian decision theories (CDT+SIA or EDT+SSA).

**Dynamic:** ???

The CDT=EDT=UDT perspective on counterfactuals is that we can approach the question of learning logically updateless behavior by thinking about the learning-theoretic version of anthropic reasoning. How do we learn which observations to take seriously? How do we learn about what to expect supposing we *are* being fooled by a simulation? Some optimistic speculation on that is the subject of the next section.

# We Have the Data

Part of why I was previously very pessimistic about doing any better than the CDT=EDT-style counterfactuals was that we *don't have any data* about counterfactuals, almost by definition. How are we supposed to learn what to counterfactually expect? We only observe the real world.

Consider LIDT playing transparent Newcomb with a perfect predictor. Its belief that it will 1-box in cases where it sees that the large box is full must converge to 100%, because it only ever sees a full box in cases where it does indeed 1-box. Furthermore, the expected utility of 2-boxing can be anything, since it will never see cases where it sees a full box and 2-boxes. This means I can make LIDT 1-box by designing my LI to think 2-boxing upon seeing a full box will be catastrophically bad: I simply include a trader with high initial wealth who bets it will be bad. Similarly, I can make LIDT 2-box whenever it sees the full box by including a trader who bets 2-boxing will be great. Then, the LIDT will never see a full box except on rounds where it is going to epsilon-explore into 1-boxing.

*(The above analysis depends on details of how epsilon exploration is implemented. If it is implemented via the probabilistic chicken-rule, mentioned earlier, making the agent explore whenever it is very confident about which action it takes, then the situation gets pretty weird. Assume that LIDT is epsilon-exploring pseudorandomly instead.)*

LIDT's confidence that it 1-boxes whenever it sees a full box is jarring, because I've just shown that I can make it either 1-box or 2-box depending on the underlying LI. Intuitively, an LIDT agent who 2-boxes upon seeing the full box should not be near-100% confident that it 1-boxes.

The problem is that the cases where LIDT sees a full box and 2-boxes are all counterfactual, since Omega is a perfect predictor and doesn't show us a full box unless we in fact 1-box. LIDT doesn't learn from counterfactual cases; the version of the agent in Omega's head is shut down when Omega is done with it, and never reports its observations back to the main unit.

(The LI *does* correctly learn the *mathematical fact* that its algorithm 2-boxes when input observations of a full box, but, this does not help it to have the intuitively correct expectations when Omega feeds it false sense-data.)

In the terminology of The Happy Dance Problem, LIDT isn't learning the right observation-counterfactuals: the predictions about what action it takes given different possible observations. However, * we have the data:* the agent

*could*simulate itself under alternative epistemic conditions, and train its observation-counterfactuals on what action it in fact takes in those conditions.

Similarly, the action-counterfactuals are wrong: LIDT can believe anything about what happens when it 2-boxes upon seeing a full box. Again, * we have the data:* LI can observe that on rounds when it is mathematically true that the LIDT agent would have 2-boxed upon seeing a full box, it doesn't get the chance. This knowledge simply isn't being "plugged in" to the decision procedure in the right way. Generally speaking, an agent can observe the real consequences of counterfactual actions, because (1) the counterfactual action is a mathematical fact of what the agent does under a counterfactual observation, and (2) the important effects of this counterfactual action occur in the real world, which we can observe directly.

This observation makes me much more optimistic about learning interesting counterfactuals. Previously, it seemed like *by definition* there would be no data from which to learn the correct counterfactuals, other than the (EDTish) requirement that they should match the actual world for actions actually taken. Now, it seems like I have not one, but *two* sources of data: the observation-counterfactuals can be simulated outright, and the action-counterfactuals can be trained on what actually happens when counterfactual actions are taken.

I haven't been able to plug these pieces together to get a working counterfactual-learning algorithm yet. It might be that I'm still missing a component. But ... it *really* feels like there should be something here.