In this post, I clarify how far we are from a complete solution to decision theory, and the way in which high-level philosophy relates to the mathematical formalism. I’ve personally been confused about this in the past, and I think it could be useful to people who casually follows the field. I also link to some less well-publicized approaches.

The first disagreement you might encounter when reading about alignment-related decision theory is the disagreement between Causal Decision Theory (CDT), Evidential Decision Theory (EDT), and different logical decision theories emerging from MIRI and lesswrong, such as Functional Decision Theory (FDT) and Updateless Decision Theory (UDT). This is characterized by disagreements on how to act in problems such as Newcomb’s problem, smoking lesion and the prisoner's dilemma. MIRI’s paper on FDT represents this debate from MIRI’s perspective, and, as exemplified by the philosopher who refereed that paper, academic philosophy is far from having settled on how to act in these problems.

I’m quite confident that the FDT-paper gets those problems right, and as such, I used to be pretty happy with the state of decision theory. Sure, the FDT-paper mentions logical counterfactuals as a problem, and sure, the paper only talks about a few toy problems, but the rest is just formalism, right?

As it turns out, there are a few caveats to this:

- CDT, EDT, FDT, and UDT are high-level clusters of ways to go about decision theory. They have multiple attempted formalisms, and it’s unclear to what extent different formalisms recommend the same things. For FDT and UDT in particular, it’s unclear whether any one attempted formalism (e.g. the graphical models in the FDT paper) will be successful. This is because:
- Logical counterfactuals is a really difficult problem, and it’s unclear whether there exists a natural solution. Moreover, any non-natural, arbitrary details in potential solutions are problematic, since some formalisms require everybody to know that everybody uses sufficiently similar algorithms. This highlights that:
- The toy problems are radically simpler than actual problems that agents might encounter in the future. For example, it’s unclear how they generalise to acausal cooperation between different civilisations. Such civilisations could use implicitly implemented algorithms that are more or less similar to each others’, may or may not be trying and succeeding to predict each others’ actions, and might be in asymmetric situations with far more options than just cooperating and defecting. This poses a lot of problems that don’t appear when you consider pure copies in symmetric situations, or pure predictors with known intentions.

As a consequence, knowing what philosophical position to take in the toy problems is only the beginning. There’s no formalised theory that returns the right answers to all of them yet, and if we ever find a suitable formalism, it’s very unclear how it will generalise.

If you want to dig into this more, Abram Demski mentions some open problems in this comment. Some attempts at making better formalisations includes Logical Induction Decision Theory (which uses the same decision procedure as evidential decision theory, but gets logical uncertainty by using logical induction), and a potential modification, Asymptotic Decision Theory. There’s also a proof-based approach called Modal UDT, for which a good place to start would be the 3rd section in this collection of links. Another surprising avenue is that some formalisations of the high-level clusters suggest that they're all the same. If you want to know more about the differences between Timeless Decision Theory (TDT), FDT, and versions 1.0, 1.1, and 2 of UDT, this post might be helpful.

Heterodox opinion: I think the entire MIRIesque (

andacademic philosophy) approach to decision theory is confused. The basic assumption seems to be, that we can decouple the problem of learning a model of the world from the problem of taking a decision given such a model. We then ignore the first problem, and assume a particular shape for the model (for example, causal network) which allows us to consider decision theories such as CDT, EDT etc. However, in reality the two problems cannot be decoupled. This is because thetype signatureof a world model is only meaningful if it comes with an algorithm for how to learn a model of this type.For example, consider Newcomb's paradox. The agent makes a decision under the assumption that Omega behaves in a certain way. But, where did the assumption come from? Realistic agents have to

learneverything they know. Learning normally requires a time sequence. For example, we can consider theiteratedNewcomb's paradox (INP). In INP, any reinforcement learning (RL) algorithm will converge to one-boxing, simply because one-boxing gives it the money. This is despite RL naively looking like CDT. Why does it happen? Because in the learned model, the "causal" relationships arenotphysical causality. The agent comes to believe that taking the one boxcausesthe money to appear there.In Newcomb's paradox EDT succeeds but CDT fails. Let's consider an example where CDT succeeds and EDT fails: the XOR blackmail. The iterated version would be IXB. In IXB, classical RL doesn't guarantee much because the environment is more complex than the agent (it contains Omega). To overcome this, we can use RL with incomplete models. I believe that this indeed solves both INP and IXB.

Then we can consider e.g. counterfactual mugging. In counterfactual mugging, RL with incomplete models doesn't work. That's because the assumption that Omega responds in a way that depends on a counterfactual world is not in the space of models at all. Indeed, it's unclear how can any agent learn such a fact from empirical observations. One way to fix it is by allowing the agent to precommit. Then the assumption about Omega becomes empirically verifiable. But, if we do this, then RL with incomplete models can solve the problem again.

The only class of problems that I'm genuinely unsure how to deal with is game-theoretic superrationality. However, I also don't see much evidence the MIRIesque approach has succeeded on that front. We probably need to start with just solving the grain of truth problem in the sense of converging to ordinary Nash (or similar) equilibria (which might be possible using incomplete models). Later we can consider agents that observe each other's source code, and maybe something along the lines of this can apply.

I very much agree with the point about not decoupling learning and decision theory. I wrote a comment making somewhat similar points.

I'd like to understand this part.

I'm not sure I should find the precommitment solution satisfying. Won't it make some stupid precommitments early (before it has learned enough about the world to make reasonable precommitments) and screw itself up forever? Is there a generally applicable version of precommitments which ensures learning good behavior?

If we take the learning-theoretic view, then we get to bring in tools from iterated games. There's a Pavlov-like strategy for playing deterministic iterated games which converges to optimal responses to non-agentic environments and converges to Pareto optima for environments containing agents who use the Pavlov-like strategy. It is not the greatest at being unexploitable, and it also has fairly bad convergence.

However, I don't yet see how to translate the result to logical-induction type learners. Besides requiring deterministic payouts (a property which can probably be relaxed somehow), the algorithm requires an agent to have a definite history -- a well-defined training sequence. Agents based on logical induction are instead forming generalizations based on any sufficiently analogous situation within logic, so they don't have a well-defined history in the right way. (An actual instance of a logical induction agent has an actual temporal history, but this temporal history is not necessarily what it is drawing on to play the game -- it may have never personally encountered a similar situation.)

In other words, I'm hopeful that there could be a learning-theoretic solution, but I don't know what it is yet.

As for superrationality for agents w/o learning theory, there's cooperative oracles, right? We can make computable analogues with distributed oracles. It's not a real solution,

specifically in thatit ignores learning. So I sort of think we know how to do it in the "static" setting, but the problem is that we live in a learning-theoretic setting rather than a static-rationality setting.Every round of IXB has the following structure:

Suppose that on every round, the termite infestation happens with probability p and its cost is c. Then, this fact corresponds on an incomplete model Φ1 (i.e., Φ1 says that regardless of whether blackmailed arrived and regardless of whether blackmails was paid, the probability is p). Φ1 is incomplete because it doesn't say anything about whether blackmail arrives or not. If Φ1 is true, the agent can guarantee a payoff of −pc per round (by rejecting the blackmail). Therefore, if the agent has a learnable prior that includes Φ1, it will converge to a payoff which is no less than −pc. Of course achieving this payoff requires actually rejecting the blackmail.

This might seem surprising at first, because there is also a different incomplete model Φ2 that says "if you pays the blackmail, infestation will not happen". Φ2 is false if you use physical causal counterfactuals, but from the agent's perspective Φ2 is consistent with all observations. However, Φ2 only guarantees the payoff −c (because it is unknown whether the blackmail will arrive). Therefore, Φ2 will have no effect on the ultimate behavior of the agent.

The precommitments have to expire after some finite time. Incidentally, the agent can still screw itself up forever if the environment contains trap, which is a separate problem.

Notice that I am

notmaking the claim that any sophisticated agent has to be capable of precommitments. This is because, I am not convinced that counterfactual mugging belongs to some class of problems that any sophisticated agent should be able to solve (the "fair" problems). Of course a sophisticated agent which suspects its environment contains this type of situation might want to create a descendant thatiscapable of precommitments.Let me add that I am not even sure what are the correct desiderata. In particular, I don't think that we should expect any group of good agents to converge to a Pareto optimal outcome. IMO in general it is more reasonable to expect that they converge to a Nash equilibrium (or some refinement, like a proper equilibrium). If the setting is iterated, or if they see each other's source code, then some equilibria are Pareto optimal, and perhaps under some reasonable assumptions convergence to Pareto optimal outcomes should be overwhelmingly likely (because of something like, these outcomes have the largest basin of attraction).

What happens in ASP? (Say you're in an iterated Newcomb's problem with a predictor much slower than you, but which meets the LIC or similar.) I'm concerned that it will either settle on two-boxing, or possibly not settle on one strategy, since if it settles on two-boxing then a model which says "you can get the higher reward by one-boxing" (ie, the agent has control over the predictor) looks appealing; but, if it settles on one-boxing, a model which says "you can get higher reward by two-boxing" (ie, the agent's action

doesn'tcontrol the predictor) looks appealing. This concern is related to the way asymptotic decision theory fails -- granted, for cases outside of its definition of "fair".I agree that something like this generally does the right thing in most cases, with the exception of superrationality in games as a result of commitment races.

I still have a little hope that there will be a nice version, which doesn't involve a commitment-races problem and which doesn't make use of an arbitrary commitment cutoff. But I would agree that things don't look good, and so it is reasonable to put this kind of thing outside of "fair" problems.

I don't currently see why we shouldn't ask to converge to pareto optima. Obviously, we can't expect to do so with arbitrary other agents; but it doesn't seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm. This even seems reasonable in the standard iterated Nash picture (where not all strategies achieve pareto optima, but there exist strategies which achieve pareto optima with a broad-ish class of other strategies, including others who use strategies like their own -- while being very difficult to exploit).

But yeah, I'm pretty uncertain about what the desiderata should be -- both with respect to game theory, and with respect to scenarios which require updatelessness/precommitments in order to do well. I agree that it should all be approached with a learning-theoretic perspective.

I am not sure what you mean by "meets the LIC or similar" in this context. If we consider a predictor which is a learning algorithm in itself (i.e., it predicts by learning from the agent's past choices), then the agent will converge to one-boxing. This is because a weak predictor will be fully inside the agent's prior, so the agent will know that one-boxing for long enough will cause the predictor to fill the box. If we consider a predictor that just analyzes the agent's source code and ignores the agent's choices, the agent will converge to two-boxing.

I was never convinced that "logical ASP" is a "fair" problem. I once joked with Scott that we can consider a "predictor" that is just the single line of code "return DEFECT" but in the comments it says "I am defecting only because I

knowyou will defect." It was a joke, but it was half-serious. The notion of "weak predictor" taken to the limit leads to absurdity, and if you don't take it to the limit it might still lead to absurdity. Logical inductors in one way to try specifying a "weak predictor", but I am not convinced that settings in which logic is inserted ad hoc should be made into desiderata.I am not sure we need an

arbitrarycutoff. There might be a good solution where the agent can dynamically choose any finite cutoff.Maybe? The questions are, how robust is this cooperation (i.e. what counts as "same algorithm"), and whether there is a significant cost in other situations. And, on the philosophical-ish level, the question is whether such desiderata should be considered essential for rationality/intelligence. But, I agree that this is worth studying.

Yeah, that's what I meant.

Suppose the interval between encounters with the predictor is long enough that, due to the agent's temporal discounting, the immediate reward of two-boxing outweighs the later gains which one-boxing provides. In any specific encounter with the predictor, the agent may prefer to two-box, but prefer to have been the sort of agent who predictably one-boxes, and also preferring to pre-commit to one-box on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)

But I wasn't sure this would be the result for your agent, since you described the agent using the hypothesis which gives the best picture about achievable utility.

As I discussed in Do Sufficiently Advanced Agents Use Logic, what I tend to think about is the case where the agent doesn't literally encounter the predictor repeatedly in its physical history. Instead, the agent must learn what strategy to use by reasoning about similar (but "smaller") scenarios. But we can get the same effect by assuming the temporal discounting is steep enough, as above.

Yeah, it is clear that there has to be a case where the predictor is so weak that the agent should not care. I'm fine with dropping the purely logical cases as desiderata in favor of the learning-theoretic versions. But, the ability to construct analogous problems for logic and for learning theory is notable. Paying attention to that analogy more generally seems like a good idea.

Yeah, I guess we can do a variety of things:

I suspect I want to evaluate a commitment scheme by asking whether it helps achieve a nice regret-bound notion, rather than defining the regret notion by evaluating regret-with-respect-to-making-commitments.

Thinking about LI policy selection where we choose a slow-growing function

which determines how long we think before we choose the policy to follow on dayf(n)–– there's this weird trade-off between how (apparently) "good" the updatelessness is vs how long it takes to be any good at all. I'm fine with notions of rationality being parameterized by an ordinal or some such if it's just a choose-the-largest-number game. But in this case, choosing too slow-growing a function makes you worse off; so the fact that the rationality principle is parameterized (by the slow-growing function) is problematic. Choosing a commitment scheme seems similar.nSo it would be nice to have a rationality notion which clarified this situation.

My main concern here is: the case for empirical updatelessness seems strong in realizable situations where the prior is meaningful. Things aren't as nice in the non-realizable cases such as logical uncertainty. But it doesn't make sense to abandon updateless principles altogether because of this!

Yes, but in this case the agent

shouldtwo-box. This agent prefers sacrificing the future for short term gain, so that's what it does. Ofc if there a way to precommit that is visible to the predictor it will take it: this way it enjoys the best of both worlds, getting two boxesandcausing the predictor to cooperate on the next round.Btw, I am always interested in the asymptotic when the

time discount parametergoes to 1, rather than the asymptotic when time discount is fixed and time goes to ∞. That is, when I say the agent "converges" to something, I mean it in the former sense. Because, if the agent mostly cares only about the next thousands years, then it matters little how successful it is a million years from now. That is, the former asymptotic tracks the actual expected utility rather than some arbitrary portion thereof. (Also, I think it is interesting to consider what I dubbed "semi-anytime" algorithms: algorithms that receive alower boundfor the time discount parameter as input and guarantee a level of performance (usually regret) that improves as this lower bound grows. Even better are anytime algorithms: algorithms that don't depend on the time discount at all, but come with a performance guarantee that depends on the time discount. However, in settings such as Delegative RL it is impossible to have an anytime algorithm.)Ok, sort of, but then this makes the discounting "wrong" in the same way that hyperbolic discounting is "wrong": dynamic inconsistency. One might then say such decision-theoretic questions reduce to a question of what the right way to discount is (constraints on discounting functions such that we rule out the ones which are "wrong"). I find this perspective somewhat plausible but not obviously right.

I wish I could get you to see the other possible perspective, in which there is something going wrong which is not about discounting behavior. IE, something closer to your remark about commitments. The two-boxer we are discussing can imagine the one-boxing strategy and can have regret with respect to it. Properly defining that notion of regret would allow a learner to one-box.

Maybe I'm still thinking about it the wrong way, but, I still am not convinced two-boxing is a legit values issue here. I think steep discounting doesn't justify two-boxing. Imagine that we let both the ASP predictor and the agent think for a while before the first round -- so the agent still has a lot more processing power than the predictor, but, the predictor is likely to get things right even on the first round. If the agent had been the sort of agent who would one-box, it would get higher reward on the first round. So it doesn't seem to me like the steep-discounting agent "really prefers to two-box".

The predictor can do well w/o any actual rounds being run because it is a logical inductor, so, is learning from closely analogous cases across logical possibility. Of course for this to make sense I have to assume the predictor has access to the agent's source code. I'm also imagining that the agent has the predictor's source code.

Ok. So in some sense you're thinking in the "there are incorrect discounting functions" way (ie, steeper is always worse than less steep).

But, what if this is the last newcomblike problem the agent can reasonably expect to be involved in? There will be no significant pressure from the future, so the agent will tend to defect/two-box/etc. But the predictor can see this coming (even with relatively little processing power, as in ASP). So the agent does worse overall than a (predictable) 1-boxer.

And, of course, if there's a predictable last interaction, the argument via pressure from the future tends to unwind so that actually there will be many rounds on which the gains from 1-boxing are lost rather than only one.

Look, a one-boxing agent would get

less utility. Yes, it would have higher rewards on subsequent rounds, but the first round is more important,by the very assumption you made(steep time discount). Moreover, the two-boxing agent would get higher utility than any agent that one-boxes on some of the rounds, so onallrounds the correct action is two-box.I don't think it makes sense to simultaneously assume that the agent has a lot more power than the predictor and that the predictor is likely to get things right even on the first round. If the agent has a lot more power than the predictor, then the agent can e.g. diagonalize against the predictor and make sure it will

notget things right.Once again, we need to ask

howdoes the agent know the predictor gets things right on the first round. It needs to learn it somehow. For example, maybe in encounters many predictors one after the other. But then, it will again learn a model in which one-boxingcausesthe predictor to cooperate.(Notice that having a perfect predictor requires that either (i) the agent is deterministic or (ii) the predictor has access to the agent's random bits. Option (i) is possible for the case of complete models (because there is a deterministic Bayes-optimal policy; to establish weak feasibility you can use PSRL + pseudorandom), which is sufficient for Newcomb's paradox. For incomplete models you usually want to randomize, although it also possible to instead do deterministic maximin and optionally add an external random bit generator which is not a priori assumed to be invisible to the environment. In option (ii), you can consider a variant of Newcomb's where the predictor responds to the

probabilityof the agent making a particular choice. This variant is more difficult for reasons similar to what happens with counterfactual mugging: it is hard for the agent to discover this behavior empirically. I think that, as opposed to counterfactual mugging, youcancapture it by an incomplete model, although it will be a relatively complex, difficult to learn, model.)Well, the agents I consider interesting start from a state of ignorance, so they have to learn. For this, the time discount has to be shallow, otherwise there is no time to learn.

I agree in the case where the predictor is only good at all later, ie, its initial guess is random/fixed. Which is consistent with the way I originally brought ASP up. In that case, it absolutely makes sense to 2-box if your discounting is steep enough.

But, deliberation takes time. We can think of logical inductors, or Skyrms-style deliberation setups (Skyrms,

The Dynamics of Rational Deliberation). The agent has a certain amount of time to think. If it spends all that time calculating probabilities (as in LIDT), and only makes a decision at the end, then it makes sense to say you can't successfully 1-box in a way which makes you predictably do so. But if you model the decision as being made over an amount of time, then you could possibly make the decision early on in your deliberation, so that the predictor can see. So, we can think of something like a commitment which you can make while still refining probabilities.Again, I agree this doesn't make sense if the predictor is absolutely immovable (only takes our choice as evidence to refine its prediction in later rounds). But if the predictor is something like a logical inductor, then even if it runs much slower than the agent, it can "see" some of the early things the agent does.

Of course, there is the question of how you would know that making an early commitment rather than thinking longer would be a good strategy. I'm imagining they get each other's source code. We can think about it in a learning-theoretic way as follows. Suppose I want good behavior on problems where two AIs exchange source code and then have to play a game (with a certain amount of time to think before they have to act). The system can do anything with its time; think longer, make early commitments, whatever.

Actually, let's simplify a little: since the predictor is not fully an agent, we can just think in terms of building a system which is given source code which outputs the end utility (so the source code includes the predictor, the agent itself, everything). I want a system which does well on such problems if it has enough time to think. In order to get that, what I'm going to do is have it spend the first fraction of its thinking time training itself on smaller problems sampled at random. (It could and should choose instances intelligently, but let's say it doesn't.) It treats these entirely episodically; in each instance it only tries to maximize the score in that one case.

Nonetheless, it can eventually learn to make its decision early in cases where it is beneficial to be predictable.

This is all supposed to be a model of what our agent can do in order to come to a decision in ASP. It can think about similar problems. It can notice a trend where it's useful to make certain decisions early. Sure, it might figure that out too late, in which case the best it can do is probably 2-box. But it might also figure it out in time.

It can do that, which would make it unpredictable. However, it can also not do that. In this case it has no special motivation to do that, so there's no reason to, which means there's no reason for the predictor to expect that.

Not sure whether this was a point you were trying to make, but, I agree that in the case of a perfect predictor learning theory does a fine job of 1-boxing regardless of temporal discounting or any commitment mechanisms.

I'm worrying that the overarching thread may get lost. I think this point about ASP is an important sub-point, for a few reasons, but I hope the discussion doesn't dead-end here. My picture of where we're at:

I've gravitated to ASP partly because I hope it illustrates something about how I see logic as relevant to a learning-theoretic approach to decision theory. But I'm still interested more broadly in what class of decision problems you think you can do well on, with and without commitment mechanisms.

Yes, sounds sort of reasonable. Here is how I think you can realize this using TRL.

As usual, we consider the agent playing an IPD against a predictor (Newcomb's paradox is essentially playing the Prisoner's Dilemma against a "FairBot"). On each round, the predictor gets to see the agent's state at the start of the round. (The state can be considered part of the "source code". For randomizing agents, we also assume the predictors sees the random bits). The predictor then tries to simulate the agent (we assume it knows the rest of the agent's source code as well), and is successful

ifthe agent doesn't execute any programs that are too expensive for the predictor (for the sake of simplicity, assume that no program started on one round continues running during following rounds: I don't think that this assumption makes a difference of principle). Otherwise, the prediction might be wrong (for example, we can assume it defaults to D). The predictor then plays D or C according to its prediction of the agent's action.In this setting, the agent can learn the incomplete hypothesis "if I don't run expensive programs

andI play C, the predictor will also play C". (We assume that the prior allows for side effects of executing programs. Such a prior seems more realistic anyway, and in particular is required to counter non-Cartesian daemons. However, it also has a cost, so perhaps what we really want is a prior that is biased towards few side effects: but, this is immaterial for the current discussion.) This hypotheses guarantees a payoff of U(CC). Assuming that the predictor cannot be exploited, this is the best possible payoff and therefore the agent will converge to cooperation.We might want a more explicit realization of the "simulates" part in "agent simulates predictor". For this, we can assume the agent also receives its own state as an observation (but, I'm not sure how generally useful is this for realistic agents). The agent can then also learn the incomplete hypothesis describing the exact function from agent states to predictor output. However, this hypothesis doesn't affect the outcome: it doesn't predict the agent's state and therefore can only guarantee the payoff U(DD).

I'm leaning this way as well, but I think it's an important clue to figuring out commitment races. ASP Predictor, DefectBot, and a more general agent will make different commitments, and these things are already algorithms specialized for certain situations. How is the chosen commitment related to what the thing making the commitment is?

When an agent can manipulate a predictor in some sense, what should the predictor do? If it starts scheming with its thoughts, it's no longer a predictor, it's just another agent that wants to do something "predictory". Maybe it can only give up, as in ASP, which acts as a precommitment that's more thematically fitting for a predictor than for a general agent. It's still a commitment race then, but possibly the meaning of something being a predictor is preserved by restricting the kind of commitment that it makes: the commitment of a non-general agent is what it

israther than what itdoes, and a general agent is only committed to its preference. Thus a general agent loses all knowledge in an attempt to out-commit others, because it hasn't committed to that knowledge, didn't make it part of what it is.I had an interesting thought that made me update towards your position. There is my old post about "metathreat equilibria", an idea I developed with Scott's help during my trip to Los Angeles. So, I just realized the same principle can be realized in the setting of repeated games. In particular, I am rather confident that the following, if formulated a little more rigorously, is a theorem:

Consider the Iterated Prisoner's Dilemma in which strategies are constrained to depend only on the action of the opponent in the previous round. Then, at the limit γ→1 (shallow time discount), the only thermodynamic equilibrium is mutual cooperation.Then, there is the question of how to generalize it. Obviously we want to consider more general games and more general strategies, but allowing fully general strategies probably won't work (just like allowing arbitrary programs doesn't work in the "self-modification" setting). One obvious generalization is allowing the strategy to depend on some finite suffix of the history. But, this isn't natural in the context of agent theory: why would agents forget everything that happened before a certain time? Instead, we can constraint the strategy to be finite-state, and maybe require it to be communicating (i.e. forbid "grim triggers" where players change their behavior forever). On the game side, we can consider arbitrary repeated games, or communicating stochastic games (they have to be communicating because otherwise we can represent one-shot games), or even communicating partially observable stochastic games. This leads me to the following bold conjecture:

Consider any suitable (as above) game in which strategies are constrained to be (communicating?) finite-state. Then, at the limit γ→1, all thermodynamic equilibria are Pareto efficient.This would be the sort of result that seems like it should be applicable to learning agents. That is, if the conjecture is true, there is good hope that appropriate learning agents are guaranteed to converge to Pareto efficient outcomes. Even if it the conjecture requires some moderately strong additional assumptions, it seems worth studying.

Pardon me, but I removed your first link, because it linked to a document not about alignment and said at the top that it shouldn’t be shared on the public internet. I think you used the wrong link. Apologies if that was a mistake.

Added: Have PM’d you the link.

Ahh thanks :p fixed

Yup, like Caspar, I think that model-free RL learns the EDT policy in most/all situations. I'm not sure what you mean with it looking like CDT.

Isn't it the other way around? The one-boxer gets more money, but gives in to blackmail, and therefore gets blackmailed in the first place.

RL is CDT in the sense that, your model of the world consists of actions and observations, and some causal link from past actions and observations to current observations, but there is no causal origin to the actions. The actions are just set by the agent to whatever it wants.

And, yes, I got CDT and EDT flipped there, good catch!

Utilising a model that assumes we can decouple is different from assuming we can decouple. For example, calculus assumes that space is infinitely divisible, but using a formula derived from calculus to calculate the volume of a sphere doesn't require you to assume space is infinitely divisible. It just has to work as an approximation.

I'm pretty confident in my forgetting approach to logical counterfactuals, though I've sadly been too busy the last few months to pick it up again. I haven't quite formalised it yet - I'm planning to try to collect all the toy problems and I think it'll likely drop out fairly quickly that there are a few distinct, but closely related notions of logical counterfactual. Anyway, I think I've finally figured out how to more precisely state my claim that we ought to build upon consistent counterfactuals.