Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Open Philanthropy's Joe Carlsmith and Nick Beckstead had a short conversation about decision theory a few weeks ago with MIRI's Abram Demski and Scott Garrabrant (and me) and LW's Ben Pace. I'm copying it here because I thought others might find it useful.

Terminology notes:

  • CDT is causal decision theory, the dominant theory among working decision theorists. CDT says to choose the action with the best causal consequences.
  • EDT is evidential decision theory, CDT's traditional rival. EDT says to choose the action such that things go best conditional on your choosing that action.
  • TDT is timeless decision theory, a theory proposed by Eliezer Yudkowsky in 2010. TDT was superseded by FDT/UDT because TDT fails on dilemmas like counterfactual mugging, refusing to pay the mugger.
  • UDT is updateless decision theory, a theory proposed by Wei Dai in 2009. UDT in effect asks what action "you would have pre-committed to without the benefit of any observations you have made about the universe", and chooses that action.
  • FDT is functional decision theory, an umbrella term introduced by Yudkowsky and Nate Soares in 2017 to refer to UDT-ish approaches to decision theory.

Carlsmith:  Anyone have an example of a case where FDT and updateless EDT give different verdicts?

Beckstead:  Is smoking lesion an example?

I haven't thought about how updateless EDT handles that differently from EDT.

Demski:  FDT is supposed to be an overarching framework for decision theories "in the MIRI style", whereas updateless EDT is a specific decision theory.

In particular, FDT may or may not be updateless.

Updateful FDT is basically TDT.

Now, I generally claim it's harder to find examples where EDT differs from causal counterfactuals than people realize; eg, EDT and CDT do the same thing on smoking lesion. So be aware that you're not going to get the "standard view" from me.

However, TDT gets some problems wrong which UDT gets right, eg, counterfactual mugging.

Updateless FDT would not get this wrong, though; it appears to be all about the updatelessness.

To get EDT-type and CDT-type DTs to really differ, we have to go to Troll Bridge. EDT fails, FDT variants will often succeed.

Garrabrant:  I feel like it is not hard to come up with examples where updateful EDT and CDT differ (XOR blackmail), and for the updateless question, I think the field is small enough that whatever Abram says is the “standard view.”

I think that to get EDT and CDT to differ updatelessly, you need to cheat and have the agent have some weird non-Bayesian epistemics (Bayesians get the tickle defense), so it is hard to construct formal examples.

Unfortunately, all agents have weird non-Bayesian epistemics, so that doesn’t mean we get to just skip the question.

My “standard view” position is that EDT is obviously philosophically correct the way that Bayesianism is obviously philosophically correct; CDT is an uglier thing that gets the same answer in ideal conditions; but then embeddedness gives you non-ideal conditions everywhere, and CDT is closer to being structured in a way that can handle getting the right answer in spite of having weird epistemics.

My non-standard-view answer is that hopefully the successors to the Cartesian Frame ontology / Factored Set ontology will make this question go away.

Bensinger:  Terminology/history side-note: Abram's right that "FDT is supposed to be an overarching framework for decision theories 'in the MIRI style'", but I don't think it's meant to be so overarching as to include TDT. I think the original intended meaning was basically 'FDT = UDT-ish approaches to decision theory'.

From the comments on Let’s Discuss Functional Decision Theory:

My model is that 'FDT' is used in the paper instead of 'UDT' because:

  • The name 'UDT' seemed less likely to catch on.
  • The term 'UDT' (and 'modifier+UDT') had come to refer to a bunch of very different things over the years. 'UDT 1.1' is a lot less ambiguous, since people are less likely to think that you're talking about an umbrella category encompassing all the 'modifier+UDT' terms; but it's a bit of a mouthful.
  • I've heard someone describe 'UDT' as "FDT + a theory of anthropics" -- i.e., it builds in the core idea of what we're calling "FDT" ("choose by imagining that your (fixed) decision function takes on different logical outputs"), plus a view to the effect that decisions+probutilities are what matter, and subjective expectations don't make sense. Having a name for the FDT part of the view seems useful for evaluating the subclaims separately.

The FDT paper introduces the FDT/UDT concept in more CDT-ish terms (for ease of exposition), so I think some people have also started using 'FDT' to mean something like 'variants of UDT that are more CDT-ish', which is confusing given that FDT was originally meant to refer to the superset/family of UDT-ish views. Maybe that suggests that researchers feel more of a need for new narrow terms to fill gaps, since it's less often necessary in the trenches to crisply refer to the superset.

[...]

Nate says: "The main datapoint that Rob left out: one reason we don't call it UDT (or cite Wei Dai much) is that Wei Dai doesn't endorse FDT's focus on causal-graph-style counterpossible reasoning; IIRC he's holding out for an approach to counterpossible reasoning that falls out of evidential-style conditioning on a logically uncertain distribution. (FWIW I tried to make the formalization we chose in the paper general enough to technically include that possibility, though Wei and I disagree here and that's definitely not where the paper put its emphasis. I don't want to put words in Wei Dai's mouth, but IIRC, this is also a reason Wei Dai declined to be listed as a co-author.)"

Footnote: a philosopher might say 'FDT is an overarching framework for approaches to decision theory in the MIRI style'; and they might be happy calling FDT "a decision theory", in the same sense that 'CDT' and 'EDT' are deemed decision theories even though they've been interpreted and operationalized in dozens of different ways by philosophers.

(The FDT paper calls FDT a 'decision theory' rather than 'a family of decision theories' because it's written for mainstream philosophers.)

As a matter of terminology, I think MIRI-cluster people are more likely to (e.g.) see 10 distinct decision algorithms and group them in ~8 distinct 'decision theories' where a philosopher might group them into ~2 distinct 'decision theories'. 🤷‍♀️

Carlsmith:  Thanks for the comments, all. My hazy understanding had been something like: updateful CDT and updateful EDT are both focused on evaluating actions, but CDT evaluates them using counterfactuals/do-operators or some such, whereas EDT evaluates them using conditionals.

The difference that updatelessness makes is that you instead evaluate overall policies (mappings from inputs to outputs) relative to some prior, and act on that even after you’ve “learned more.” The CDT version of this, I thought, would do something like counterfactual/do-operator type reasoning about what sort of policy to have — and this sounded a lot like FDT to me, so I’ve been basically rounding FDT off to “updateless CDT.” The EDT version, I imagined, would do something like conditional reasoning about what sort of policy to have. Thus, the whiteboard diagram below.

On this framework, I’m a bit confused by the idea that FDT is a neutral over-arching term for MIRI-style decision theory, which can be updateless or not. For example, my impression from the paper was that FDT was supposed to be updateless in the sense of e.g. paying up in counterfactual mugging. and my sense was that FDT was taking a stand on the “counterfactuals vs. conditionals” at least to some extent, insofar as it was using counterfactuals/do-operators on causal graphs. But sounds like I’m missing some of the relevant distinctions here, and/or just mis-remembering what the paper was committed to (this is just me speaking from impressions skimming through the less-wrong-ish literature on this stuff).

Garrabrant:  I think that there is this (obvious to LessWrongers, because it is deeply entangled with the entire LessWrong philosophy) ontology in which “I am an algorithm” rather than “I am a physical object.” I think that most decision theorists haven’t really considered this ontology. I mostly view FDT (the paper) as a not-fully-formal attempt to bridge that inferential difference and argue for identifying with your algorithm.

I view it as a 2x2x2 cube for (algorithm vs physical), (CDT vs EDT), (updateless vs updateful).

And FDT is mostly about the first axis, because that is the one people are being stupid about. I think that the general MIRI-LW consensus is that the third axis should go on the updateless side, although there is also some possibility of preferring to build tools that are not updateless/do not identify with their algorithm (for the purpose of enslaving them 🙂).

Pace:  😆

Garrabrant:  However, the CDT vs EDT axis is more controversial, and maybe the actual answer looks more like “the question doesn’t really make sense once you identify with your algorithm correctly”.

One view I partially hold is that updateless-algorithm EDT is correct for an ideal reasoner, but all three axes represent tools to get approximately the right answer in spite of not being an ideal reasoner. 

Where naively pretending you are an ideal reasoner leads to catastrophe.

And this does not mean they are just hacks. Not being an ideal reasoner is part of being an embedded agent.

(Anyway, I think the paper may or may not make some choices about the other axes, but the heart of FDT is about the algorithm question.)

Pace:  Is 'ideal embedded reasoner' a wrong concept?

Garrabrant:  It depends on your standards for “ideal.” I doubt we will get anywhere near as ideal as “Bayesianism/Solomonoff induction.”

Carlsmith:  OK, so is this getting closer? EDT vs. CDT: evaluate X using conditionals vs. counterfactuals/do-operators/some causation-like thing.

Algorithm vs. Physical: X is your algorithm vs. X is something else. (I would’ve thought: the action? In which case, would this reduce to something like the policy vs. action? Not sure if evaluating policies on an “I am a physical object” view ends up different from treating yourself as an algorithm.)

Updateful vs. updateless: evaluate X using all your information, vs. relative to some prior.

Bensinger:  Agreed the FDT paper was mainly about the algorithm axis. I think the intent behind the paper was to make FDT = 'yes algorithm, yes updateless, agnostic about counterfactuals-vs-conditionals', but because the paper's goal was "begin to slowly bridge the inferential gap between mainstream philosophers and MIRI people" it brushed a lot of the interesting details under the rug.

And we thought it would be easier to explain LW-style decision theory using a counterfactual-ish version of FDT than using a conditional-ish version of FDT, and I think the paper failed to make it clear that we wanted to allow there to be conditional-ish FDTs.

Demski:  

Algorithm vs. Physical: X is your algorithm vs. X is something else. (I would’ve thought: the action?

I don't agree with this part, quite. You can't change your whole algorithm, so it doesn't really make sense to evaluate different possible algorithms. You do, however, have control over your policy (which, at least roughly, is "the input-to-output mapping implemented by your algorithm").

Algorithm-vs-physical isn't quite "evaluate possible policies", however. I think a central insight is "think as if you have control over all your instances, rather than just one instance". This is commonly associated with EDT (because a causal decision theorist can "interpret" EDT as mistakenly thinking it has control over everything correlated with it, including other instances -- the you-are-your-algorithm insight says this is actually good, if perhaps not quite the right reasoning to use). So it's possible for someone to heavily endorse the "you are your algorithm" insight without grokking the "act as if you control your policy" idea.

(Such a person might get Newcomb right but be confused about how to handle Transparent Newcomb, a state which I think was common on LW at one point??)

(Transparent Newcomb is super confusing if you don't get the action/policy distinction, because Omega is actually choosing based on your policy -- what you would do if you saw a full box. But that's not easy to see if you're used to thinking about actions, to the point where you can easily think Transparent Newcomb isn't well-defined, or have other confusions about it.)

But I think TDT did have both the "you control your instances" insight and the "you control your policy" insight, just not coupled with the actual updatelessness part!

So if we really wanted to, we could make a 3x2x2, with the "alg vs physical" split into:

-> I am my one instance (physical); evaluate actions (of one instance; average over cases if there is anthropic uncertainty)

-> I am my instances: evaluate action (for all instances; no worries about anthropics)

-> I am my policy: evaluate policies

is 'ideal embedded reasoner' a wrong concept?

A logical inductor is ideal in the specific sense that it has some theoretical guarantees which we could call "rationality guarantees", and embedded in the sense that with a large (but finite) computer you could actually run it. I think this scales: we can generally talk about bounded rationality notions which can actually apply to actual algorithms, and these rationality conditions are "realizable ideals" in some sense.

The true "ideal" rationality should be capable of giving sound advice to embedded agents. But the pre-existing concepts of rationality are very far from this. So yeah, I think "Scott's Paradox" ("updateless EDT is right for ideal agents, but pretending you're an ideal agent can be catastrophic, so in some cases you should not follow the advice of ideal DT") is one which should dissolve as we get better concepts.

Updateful vs. updateless: evaluate X using all your information, vs. relative to some prior

And this is a basic problem with UDT: which prior? How much information should we use / not use?

Do realistic embedded agents even "have a prior"?

Or just, like, a sequence of semi-coherent belief states?

On this framework, I’m a bit confused by the idea that FDT is a neutral over-arching term for MIRI-style decision theory, which can be updateless or not. For example, my impression from the paper was that FDT was supposed to be updateless in the sense of e.g. paying up in counterfactual mugging.

I currently see it as a "reasonable view" (it's been my view at times) that updatelessness is doomed, so we have to find other ways to achieve eg paying up in counterfactual mugging. It still points to something about "MIRI-style DT" to say "we want to pay up in counterfactual mugging", even if one does not endorse updatelessness as a principle.

So I see all axes except the "algorithm" axis as "live debates" -- basically anyone who has thought about it very much seems to agree that you control "the policy of agents who sufficiently resemble you" (rather than something more myopic like "your individual action"), but there are reasonable disagreements to be had about updatelessness and counterfactuals.

Beckstead:  One thing I find confusing here is how to think about the notion of "sufficiently resemble." E.g., how would I in principle estimate how many more votes go to my favored presidential candidate in a presidential election (beyond the standard answer of "1")?

(I have appreciated this discussion of the 3 x 2 x 2 matrix. I had previously been thinking of it in the 2 x 2 terms of CDT vs. EDT and updateless/updateful.)

Demski:  

One thing I find confusing here is how to think about the notion of "sufficiently resemble." E.g., how would I in principle estimate how many more votes go to my favored presidential candidate in a presidential election (beyond the standard answer of "1")?

My own answer would be the EDT answer: how much does your decision correlate with theirs? Modulated by ad-hoc updatelessness: how much does that correlation change if we forget "some" relevant information? (It usually increases a lot.)

For voting in particular, if these esoteric DT considerations would change my answer, then they usually wouldn't, actually (because if the DT is important enough in my computation, then I'm part of a very small reference class of voters, and so, should mostly act like it's just my one vote anyway).

But I think this line of reasoning might actually underestimate the effect for subtle reasons I won't get into (related to Agent Simulates Predictor).

Beckstead:  Cool, thanks.

Carlsmith:  For folks who think that CDT and EDT basically end up equivalent in practice, does that mean updateful EDT two-boxes in non-transparent newcomb, and you need to appeal to updatelessness to get one-boxing?

Anyone have a case that differentiates between (a) updateful, but evaluates policies, and (b) updateless?

Demski:  EDT might one-box at first due to simplicity priors making it believe its actions are correlated with similar things (but then again, it might not; depends on the prior), but eventually it'll learn the same thing as CDT.

Now, that doesn't mean it'll two-box. If Omega is a perfect predictor (or more precisely, if it knows the agent's action better than the agent itself), EDT will learn so, and one-box. And CDT will do the same (under some potentially contentious assumptions about CDT learning empirically) because no experiments will be able to show that there's no causal relationship.

On the other hand, if Omega is imperfect (more precisely, if its predictions are worse than or equal to the agent's own), EDT will learn to two-box like CDT, because its knowledge about its own action "screens off" the probabilistic relationship.

Anyone have a case that differentiates between (a) updateful, but evaluates policies, and (b) updateless?

Counterfactual mugging!

42

Ω 17

7 comments, sorted by Highlighting new comments since Today at 11:45 PM
New Comment

Rob, are you able to disclose why people at Open Phil are interested in learning more decision theory? It seems a little far away from the AI strategy reports they've been publishing in recent years, and it also seemed like they were happy to keep funding MIRI (via their Committee for Effective Altruism Support) despite disagreements about the value of HRAD research, so the sudden interest in decision theory is intriguing.

Mostly personal interest on my part (I was working on a blog post on the topic, now up), though I do think that the topic has broader relevance.

I was in the chat and don't have anything especially to "disclose". Joe and Nick are both academic philosophers who've studied at Oxford and been at FHI, with a wide range of interests. And Abram and Scott are naturally great people to chat about decision theory with when they're available.

My (uninformed) (non-)explanation: NB and JC are both philosophers by training, and it's not surprising for philosophers to be interested in decision theory.

My own answer would be the EDT answer: how much does your decision correlate with theirs? Modulated by ad-hoc updatelessness: how much does that correlation change if we forget "some" relevant information? (It usually increases a lot.)

I found this part particularly interesting and would love to see a fleshed-out example of this reasoning so I can understand it better.

How would I in principle estimate how many more votes go to my favored presidential candidate in a presidential election (beyond the standard answer of "1")?

 

I'm happy to see Abram Demski mention this as I've long seen this as a crucial case for trying to understand subjunctive linking.

My own answer would be the EDT answer: how much does your decision correlate with theirs?

This is my perspective as well. I can't imagine that subjunctive linking exists ontologically. That is that there isn't some objective fact in the universe, in and of itself linking someone's decision to yours, but instead it is about how you model other actors (I don't know if I still fully embrace this post, but it's still illustrative of my position).

So unless we actually start getting into the details of how you're modelling the situation, we can't really answer it. In a way this means that the concept of subjunctive linking can be a misleading frame for this question. The way this question is answered is by updating the model given the new information (that a particular person voted a particular way) rather than trying to identify some mysterious free-floating effect that we have no reason to think exists.

One way to try to understand this would be to try constructing the simplest case we can understand. So let's imagine a world where there are two candidates, Hilary and Obama. We'll assume there are 10 voters and that you have no information about the other voters apart from the fact that:

  • There's a 50% chance that every voter has a 40% chance of voting for Hilary and 60% for Obama
  • There's a 50% chance that every voter has a 60% chance of voting for Hilary and 40% for Obama

Once you've decided on your vote it should cause you to update your probability about which world you are in and then you can calculate the chance of winning the election.  Anyway, this is just a comment, but I'll probably solve this probably and put it in its own separate post afterwards.

I imagine that by constucting a whole bunch of similar scenarios we might be able to make solid progress here.

For voting in particular, if these esoteric DT considerations would change my answer, then they usually wouldn't, actually (because if the DT is important enough in my computation, then I'm part of a very small reference class of voters, and so, should mostly act like it's just my one vote anyway).

Strongly agreed and something that people often miss.
 

Is there/Should there be a boolean table of undominated decision theories vs. enough problems to disprove any domination?