Anyone want to debate publicly about FDT?

27Vanessa Kosoy

26Vaniver

16green_leaf

6shminux

14cousin_it

2shminux

3Chris_Leong

3Heighn

2Chris_Leong

2Thoth Hermes

2Heighn

12Wei Dai

10Vanessa Kosoy

2Wei Dai

6Vanessa Kosoy

1Heighn

1Heighn

2Heighn

1omnizoid

1MiguelDev

-2lsusr

5Heighn

2lsusr

4quetzal_rainbow

1lsusr

2quetzal_rainbow

2lsusr

1MinusGix

2lsusr

1MinusGix

1quetzal_rainbow

New Comment

I can debate you if it's in October or later. I'm an AI alignment researcher specializing in AI theory, with extensive background in math, physics and computer science and just a little dabbling in academic philosophy. My position is not quite "FDT is 100% correct" (I have my own theories on the subject, which are more rigorous), but it can probably be rounded off to "FDT is 100% correct" from where you're standing.

I assume this is the post you're talking about? Reading it, I'm finding it difficult to identify why you think what you think. You seem to find a lot of things obvious, and unfortunately they are things that are seem incorrect to me, and why they seem obvious to you is opaque. You think Guaranteed Payoffs is obvious; I think it is rules out a wide class of desirable behavior, and agree with critical takes from Abram Demski, Ofer, Stuart Armstrong, and Zvi. If you put forward arguments with parts, I imagine it might be interesting to look at them in detail and identify a narrower, resolvable disagreement, but as is it doesn't seem like that would be likely to happen in a debate, which would make it an exercise in frustration.

To understand why FDT is true, it's best to start with Newcomb's problem. Since you believe you should two-box, it might be best to debate the Newcomb's problem with somebody first. Debating FDT at this stage seems like a waste of time for both parties.

Is there a formalization of FDT that can be fed into a computer rather than argued about by fallible humans?

Some years ago I made a version of it that works on formulas in provability logic. That logic is decidable, so you can go ahead and code it, and it'll solve any decision problem converted into such formulas. The same approach can deal with observations and probability (but can't deal with other agents or any kind of logical probability). You could say it's a bit tautological though: once you've agreed to convert decision problems into such formulas, you've gone most of the way, and FDT is the only answer that works at all.

Interesting! It seems like something like that should be a canonical reference for "let's enter a problem" e.g. smoking lesion, then "select decision theory", and out pops the answer. Of course, formalizing the problem seems like the hard part.

I'd be keen to debate you on FDT, although I expect you'd probably want to debate someone who would more centrally defend FDT.

In short, my position is:

- FDT constitutes a major advance in decision theory
- It is broken in particular ways, I think I have some ideas of how to fix it, but then you end up with a theory that is FDT-like rather than FDT itself
- However, I'm not convinced that any of Swartz's arguments are fatal to FDT (though I'd have to reread his piece which I read many years ago in order to be certain).

So my expectation is that you'd rather debate someone who would defend FDT more fully and not just critique your specific counterarguments.

In terms of my background, I majored in maths/philosophy, I've been posting about decision theory here on LW for years and I'm currently working on an adversarial collaboration with Abram Demski about whether EDT is a promising approach to decision theory or not.

Well, firstly the assumption that there's a unique way of mapping a physical system to a particular function. Physical systems can be interpreted in many different ways.

Secondly, I think it's a mistake to insist that we model subjunctive linking as logical counterfactuals. My memory isn't perfect, but I don't recall seeing a justification for this choice in the FDT paper, apart from "Wouldn't it be convenient if it were true?"

I suspect this comes from the allergy of much of the LW crowd to philosophy. If you say that you're dealing with logical counterfactuals, then looks like you're dealing with mathematical formalisms, nevermind that it isn't really a formalism until you pin down a lot more details, since there's no objective fact of the matter of what it would mean for a function to be equal to something that it's not.

It seems much more honest to just admit that you're not yet at the formalisation stage and to follow the philosophical route of asking, "So what do we really mean by counterfactuals?". And until you have a good answer to this question, you don't want to commit yourself to a particular route, such as assuming that the solution must be some kind of formalism for dealing with non-classical logic.

A further point: We aren't just trying to imagine that say f(x)=1 instead of 2 because we're interested in this question in and of itself, but rather because we're trying to figure out how to make better decisions. Throwing away the why is a mistake in my books. Even if we were only looking non-classical logics, we would be throwing away our criteria for distinguishing between different schemes. And at the point where we're keeping around our why, then there's no reason for reducing the question to a mere logical one.

I wouldn't aim to debate you but I could help you prepare for it, if you want. I'm also looking for someone to help me write something about the Orthogonality Thesis and I know you've written about it as well. I think there are probably things we could both add to each other's standard set of arguments.

I would love to debate you on this. My view: there is no single known problem in which FDT makes an incorrect decision. I have thought about FDT a lot and it seems quite obviously correct to me.

You should take a look this list of UDT open problems that Vladimir Slepnev wrote 13 years ago, where 2 and 3 are problems in which UDT/FDT seemingly make incorrect decisions, and 1 and 5 are definitely also serious open problems.

Here's what infra-Bayesianism has to say about these problems:

- Understanding multi-agent scenarios is still an open problem in general, but a recent line of inquiry (hopefully I will write about it in a month or so) leads me to believe that, at least in repeated games, IB agents converge to outcomes that are (i) nearly Pareto efficient (ii) above the maximin of each player (and arguably this might generalize to one-shot games as well, under reasonable assumptions). However,
*which*Pareto efficient outcome results depends on the priors (and maybe also on other details of the learning algorithm) those agents have. In the three player prisoner's dilemma you described, CCD is a an outcome satsifying i+ii (as opposed to CD in a two player PD, which fails condition ii). Therefore, it is an outcome that*can*arise with three IB players, and being a "CDT" (or any kind of defector) is not strictly superior. Moreover, it seems natural that the precise outcome depends on priors: depending on what kind of agents you expect to meet, it might be better to drive a "harder" or "softer" bargain. - This depends on how you operationalize ASP. One operationalization is: consider a repeated Newcomb's problem, where the Predictor receives your state on every round and does a weak analysis of it, filling the box only if the weak analysis is sufficient to confirm you will one-box. In this case, a metacognitive IB agent with a sufficiently weak core will learn to win by not querying the envelope in order to make itself more predictable. (This idea appears in some comment I made a while ago, which I can't locate atm.) Now, this still requires that the core is sufficiently weak. However, I think this type of requirement is necessary, since there has to be some cutoff for what counts as a predictor.
- This again comes down to selecting among Pareto efficient outcomes, so the comments in #1 apply.
- This is not a utility function compatible with IB, nor do I see why such utility functions must be admissible. For a metacognitive agent, we can write a utility function that rewards finding an inconsistency after time $t$ by $f(t)$ where $f$ is some function that goes to 0 as its argument goes to infinity.
- See comments in #1.

Why the focus on repeated games? It seems like one of the central motivations for people to be interested in logical decision theories (TDT/UDT/FDT) is that they recommend one-boxing and playing C in PD (against someone with similar decision theory), even in non-repeated games, and you're not addressing that?

**First,** one problem with classical thinking about decision theory is that it assumes perfect separation between *learning* models and *deciding* based on those models. Moreover, different decision theories use different type signatures for what a "model" is: in EDT it's a probability distribution, in CDT it's a causal network (interpreted as physical causality), in FDT it's a "logical" causal network. This without giving any consideration to whether learning models of this type is even possible, or whether the inherent ambiguity makes some questions in decision theory moot.

In order to consider learning as part of the process, we need to set up the problem in a way that allows gathering evidence. The simplest way to do it is iterating the original problem. This quickly leads to insights that show that the perfect separation assumption is deeply flawed. For example, consider the iterated Newcomb's problem. On every round, the agent selects among two actions (one-box and two-box), makes one of two possible observations (empty-box or full-box) and gets a reward accordingly. Remarkably, virtually any reasonable reinforcement learning (RL) algorithm converges to one-boxing^{[1]}.

This happens despite the fact that RL algorithms can be naively thought of as CDT: there's an infinite repetitive causal network, with causation flowing from action to unobservable state, from unobservable state to observation, and from previous unobservable state to next unobservable state. The reason for the apparent paradox is that inferring causation from statistical correlations leads to labeling some kinds of logical causation as causation as well. This already shows that the EDT/CDT-style taxonomy gets murky and inadequate when describing learning agents.

However, some Newcombian scenarios don't yield that easily. For example, in counterfactual mugging RL fails. It also fails in iterated XOR blackmail^{[2]}.

Enter infra-Bayesianism. The motivation is solving realizability. As a "side effect", (iterated/learnable) Newcombian scenarios get solved as well^{[3]}, at least as long as they obey a condition called "pseudocausality"^{[4]}. Here by "solved" I mean, convergence to the FDT-payoff is ensured. This side effect is not accidental, because the existence of powerful predictors in the environment is indeed a nonrealizable setting.

**Second,** another line of thinking which leads to iterated games is Abram's thesis that in logical time, all games are iterated games. I now consider this insight even more powerful than I realized at the time.

A key conjecture informing my thinking about intelligence is: the core of intelligence lies in learning algorithms. In particular, the ability to solve problems by deductive reasoning is also the product of a learning algorithm that converges to deductive reasoning (augmented by some way of selecting promising deductive steps) as an effective method for solving various problems. In particular this frees us from committing to a particular system of axioms: learning algorithms can empirically decide which axiom systems are more useful than others.

However, classical learning theory flounders against the fact that real life priors are unlearnable, because of traps: actions can lead to irreversible long-term harm. But, we can rescue the power of learning algorithms via the *metacognitive agents* framework. The latter allows us to split learning into a "logical" part (imagine doing thought experiments in your head) and a "physical" part, where the latter navigates traps in a particular way that the former has learned. In machine learning parlance, the first part is "metalearning" and the second part is "object-level learning". (See my recent talk for some more details.)

This formalism for metacognition is also a realization of Abram's idea. The role of logical time is played by the amount of computing resources available for metalearning. Indeed, my recent line of inquiry about repeated games (working name: "infra-Bayesian haggling") is applicable in this framework, enabling the same conclusions for one-shot games. However, there is still the major challenge of solving games where we have one-shot transparent-source-code interactions with individual agents, but we also need to learn the semantics of source code *empirically* by interacting with many different agents over time (the latter is just the simplest way of operationalizing the gathering of empirical evidence about semantics). For this I only have vague ideas, even though I believe a solution exists.

**Third,** repeated games are an easier but still highly non-trivial problem, so they are a natural starting point for investigation. One of the reasons for the LessWrongian focus on one-shot games was (I think) that in the iterated Prisoner's Dilemma (IPD), cooperation is already "solved" by tit-for-tat. However, the latter is not really true: while tit-for-tat is a Nash equilibrum in the limit^{[5]}, defect-defect is *also* a Nash equilibrium. Worse, if instead of geometric time discount we use step-function time discount, defect-defect is again the only Nash equilibrium (although tit-for-tat is at least an -Nash equilibrium, with vanishing in the long horizon limit). Also worse, we don't have a learning-theoretic explanation of convergence even to Nash equilibria, because of the grain-of-truth problem!

The grain-of-truth problem is just a special case of the nonrealizability problem. So it's natural to expect that infra-Bayesianism (IB) should help. Indeed, IB more or less solves convergence to Nash equilibria in repeated two-player games off the bat, but in a rather disappointing manner: due to the folk theorem, any payoff above the maximin is a Nash payoff, and IB indeed guarantees the maximin payoff. This doesn't requiring anything resembling learning a theory of mind, just learning the rules of the game. Getting any stronger results proved difficult, until very recently.

Enter infra-Bayesian haggling. Now^{[6]}, IB agents converge^{[7]} to Pareto efficient outcomes.

A major caveat is, infra-Bayesian haggling still look like it isn't really learning a theory of mind, just sort of trial-and-erroring until something sticks. I suspect that a solution to the harder problem I mentioned before (learning the semantics of source code in one-shot transparent interactions with many agents) would look more theory-of-mind-esque.

So, there is still much work ahead but I feel that drawing some insight about your open problems is already somewhat justified.

^{^}It's also interesting to consider the impact on human intuition. I suspect it's much harder to feel that two-boxing makes sense when the experiment repeats every day, and every time you one-box you get out with much more money.

^{^}Perhaps we can rescue the classical taxonomy by declaring that RL is just EDT. But then the question is what kind of learning agent is CDT, if any. The classical Pearlian methods for causal inference don't seem to help, unless you assume the ability to magically intervene on arbitrary variables, which is completely unjustified. As we've seen, the agent's selection of its own actions cannot be interpreted as intervention, if we want to rescue CDT while preserving its insistence that causality is physical.

^{^}Also interesting is that the representation of Newcombian scenarios as an ultra-POMDP, allowing for homogeneous ultradistributions (this obviates the need for the so-called "Nirvana trick"), is a natural reflection of the formulation of the problem in terms of predictors. In particular, the original Newcomb's problem can be represented in this manner, which in some sense preserves the separation between "logical" and "physical" causality (although this alternative representation is observationally indistringuishable: but maybe it's a hint that IB agents would have an easier time learning it, given additional relevant cues).

^{^}Which rules out e.g. full-box-dependent transparent Newcomb, while allowing empty-box-dependent transparent Newcomb or full-box-dependent transparent Newcomb with noise. Pseudocausality is (borrowing the language of the original Timeless Decision Theory paper) a "fairness condition", weaker than that required by e.g. CDT but a little stronger that FDT is supposed to have. There are variants of infra-Bayesianism with a weaker fairness condition, including those specifically constructed for this purpose (noisy infra-Bayesianism and infra-Bayesianism with infinitesimals) which might be contrived, but also infra-Bayesian physicalism (originally motivated by naturalized induction and related issues). However, we haven't yet described the learning theory of the latter.

^{^}Notably, this is the same limit that we consider in learning theory.

^{^}I claim. I haven't actually written out the full proof rigorously, and it will probably take me a while to get around to that. The deterministic/sharp-hypotheses case seems fairly straightforward but the stochastic-hypotheses case might be quite hairy.

^{^}In some limit. Under some non-trivial and yet-to-be-fully-understood but arguably-plausible assumptions about the learning algorithm, and some reasonable assumptions about the prior.

Ah, I just read your substack post on this, and you've referenced two pieces I've already reacted to (and in my view debunked) before. Seems like we could have a good debate on this :)

I have written before why FDT is relevant to solving the alignment problem. I'd be happy to discuss that to you.

I'd be happy to have a written or verbal debate, but not about FDT. FDT is, indeed, exploitable.

There's a couple different ways of exploiting an FDT agent. One method is to notice that FDT agents have implicitly precommitted to FDT (rather than the theorist's intended terminal value function). It's therefore possible to contrive scenarios in which those two objectives diverge.

Another method is to modify your own value function such that "make functional decision theorists look stupid" becomes a terminal value. After you do that, you can blackmail them with impunity.

FDT is a reasonable heuristic, but it's not secure against pathological hostile action.

"Modifying your utility function" is called threat-by-proxy and FDT agents ignore them, ao you are deincentivized to do this.

"Saying you are gonna do it anyway in hope that FDT agent yields" and "doing it anyway" are two very different things.

Correct. The last time I was negotiating with a self-described FDT agent I did it anyway. 😛

My utility function is "make functional decision theorists look stupid", which I satisfy by blackmailing them. Either they cave, which mean I win, or they don't cave, which demonstrates that FDT is stupid.

If your original agent is replacing themselves as a threat to FDT, because they want FDT to pay up, then FDT rightly ignores it. Thus the original agent, which just wants paperclips or whatever, has no reason to threaten FDT.

If we postulate a *different* scenario where your original agent literally terminally values messing over FDT, then FDT would pay up (if FDT actually believes it isn't a threat). Similarly, if part of your values has you valuing turning metal into paperclips and I value metal being anything-but-paperclips, I/FDT would pay you to avoid turning metal into paperclips. If you had different values - even opposite ones along various axes - then FDT just trades with you.

However FDT tries to close off the incentives for strategic alterations of values, even by proxy, to threaten.

So I see this as a non-issue.
I'm not sure I see the pathological case of the problem statement: an agent has utility function of 'Do worst possible action to agents who exactly implement (*Specific Decision Theory*)' as a problem either. You can construct an instance for any decision theory. Do you have a specific idea how you would get past this? FDT would obviously modify itself if it can use that to get around the detection (and the results are important enough to not just eat the cost).

My deontological terminal value isn't to causally win. It's for FTD agents to acausally lose. Either I win, or the FDT agents abandon FDT. (Which proves that FDT is an exploitable decision theory.)

I'm not sure I see the pathological case of the problem statement: an agent has utility function of 'Do worst possible action to agents who exactly implement (

Specific Decision Theory)' as a problem either. Do you have a specific idea how you would get past this?

There's a Daoist answer: Don't legibly and universally precommit to a decision theory.

But the exploit I'm trying to point to is simpler than Daoist decision theory. Here it is: Functional decision theory conflates two decisions:

- Use FDT.
- Determine a strategy via FDT.

I'm blackmailing contingent on decision 1 and not on decision 2. I'm not doing this because I need to win. I'm doing it because I can. Because it puts FDT agents in a hilarious lose-lose situation.

The thing FDT disciples don't understand is that I'm happy to take the scenario where FDT agents don't cave to blackmail. Because of this, FDT demands that FDT agents cave to my blackmail.

I assume what you're going for with your conflation of the two decisions is this, though you aren't entirely clear on what you mean:

- Some agent starts with some (potentially broken in various manners, like bad heuristics or unable to consider certain impacts) decision theory, because there's no magical apriori decision algorithm
- So the agent is using that DT to decide how to make better decisions that get more of what it wants
- CDT would modify into Son-of-CDT typically at this step
- The agent is deciding whether it should use FDT.
- It is 'good enough' that it can predict if it decides to just completely replace itself with FDT it will get punched by your agent, or it will have to pay to avoid being punched.
- So it doesn't completely swap out to FDT, even if it is strictly better in all problems that aren't dependent on your decision theory
- But it can still follow FDT to generate actions it should take, which won't get it punished by you?

Aside: I'm not sure there's a strong definite boundary between 'swapping to FDT' (your 'use FDT') and taking FDT's outputs to get actions that you should take. Ex: If I keep my original decision loop but it just consistently outputs 'FDT is best to use', is that swapping to FDT according to you?

Does `if (true) { FDT() } else { CDT() }`

count as FDT or not?

(Obviously you can construct a class of agents which have different levels that they consider this at, though)

There's a Daoist answer: Don't legibly and universally precommit to a decision theory.

But you're whatever agent you are. You are automatically committed to whatever decision theory you implement. I can construct a similar scenario for any DT.

'I value punishing agents that swap themselves to being `DecisionTheory`

.'

Or just 'I value punishing agents that use `DecisionTheory`

.'

Am I misunderstanding what you mean?

How do you avoid legibly being committed to a decision theory, when that's how you decide to take actions in the first place? Inject a bunch of randomness so others can't analyze your algorithm? Make your internals absurdly intricate to foil most predictors, and only expose a legible decision making part in certain problems?

FDT, I believe, would acquire uncertainty about its algorithm if it expects that to actually be beneficial. It isn't universally-glomarizing like your class of DaoistDTs, but I shouldn't commit to being illegible either.

I agree with the argument for not replacing your decision theory wholesale with one that does not actually get you the most utility (according to how your current decision theory makes decisions). However I still don't see how this exploits FDT.

Choosing FDT loses in the environment against you, so our thinking-agent doesn't choose to swap out to FDT - assuming it doesn't just eat the cost for all those future potential trades. It still takes actions as close to FDT as it can as far as I can tell.

I can still construct a symmetric agent which goes 'Oh you are keeping around all that algorithmic cruft around shelling out to FDT when you just follow it always? Well I like punishing those kinds of agents.'
If the problem specifies that it is an FDT agent from the start, then yes FDT gets punished by your agent. And, how is that exploitable?

The original agent before it replaced itself with FDT shouldn't have done that, given full knowledge of the scenario it faced (only one decision forevermore, against an agent which punishes agents which only implement FDT), but that's just the problem statement?

The thing FDT disciples don't understand is that I'm happy to take the scenario where FDT agents don't cave to blackmail.

? That's the easy part. You are just describing an agent that likes messing over FDT, so it benefits you regardless of the FDT agent giving into blackmail or not.
This encourages agents which are deciding what decision theory to self modify into (or make servant agents) to not use FDT for it, *if* they expect to get more utility by avoiding that.

I have a blog and a YouTube channel. I recently expressed the view that FDT is crazy. IF anyone wants to have either a written or verbal debate about that, hit me up. Credit to Scott Alexander for this suggestion.