Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Formalising decision theory is hard

17Vanessa Kosoy

13abramdemski

6Vanessa Kosoy

6abramdemski

4Vanessa Kosoy

4abramdemski

4Vanessa Kosoy

5abramdemski

2Vanessa Kosoy

6abramdemski

4Vanessa Kosoy

4Vladimir_Nesov

2Vanessa Kosoy

6Ben Pace

2abramdemski

3Lukas Finnveden

2Vanessa Kosoy

2Chris_Leong

2Chris_Leong

New Comment

19 comments, sorted by Click to highlight new comments since: Today at 4:58 PM

Heterodox opinion: I think the entire MIRIesque (*and* academic philosophy) approach to decision theory is confused. The basic assumption seems to be, that we can decouple the problem of learning a model of the world from the problem of taking a decision given such a model. We then ignore the first problem, and assume a particular shape for the model (for example, causal network) which allows us to consider decision theories such as CDT, EDT etc. However, in reality the two problems cannot be decoupled. This is because the *type signature* of a world model is only meaningful if it comes with an algorithm for how to learn a model of this type.

For example, consider Newcomb's paradox. The agent makes a decision under the assumption that Omega behaves in a certain way. But, where did the assumption come from? Realistic agents have to *learn* everything they know. Learning normally requires a time sequence. For example, we can consider the *iterated* Newcomb's paradox (INP). In INP, any reinforcement learning (RL) algorithm will converge to one-boxing, simply because one-boxing gives it the money. This is despite RL naively looking like CDT. Why does it happen? Because in the learned model, the "causal" relationships are *not* physical causality. The agent comes to believe that taking the one box *causes* the money to appear there.

In Newcomb's paradox EDT succeeds but CDT fails. Let's consider an example where CDT succeeds and EDT fails: the XOR blackmail. The iterated version would be IXB. In IXB, classical RL doesn't guarantee much because the environment is more complex than the agent (it contains Omega). To overcome this, we can use RL with incomplete models. I believe that this indeed solves both INP and IXB.

Then we can consider e.g. counterfactual mugging. In counterfactual mugging, RL with incomplete models doesn't work. That's because the assumption that Omega responds in a way that depends on a counterfactual world is not in the space of models at all. Indeed, it's unclear how can any agent learn such a fact from empirical observations. One way to fix it is by allowing the agent to precommit. Then the assumption about Omega becomes empirically verifiable. But, if we do this, then RL with incomplete models can solve the problem again.

The only class of problems that I'm genuinely unsure how to deal with is game-theoretic superrationality. However, I also don't see much evidence the MIRIesque approach has succeeded on that front. We probably need to start with just solving the grain of truth problem in the sense of converging to ordinary Nash (or similar) equilibria (which might be possible using incomplete models). Later we can consider agents that observe each other's source code, and maybe something along the lines of this can apply.

I very much agree with the point about not decoupling learning and decision theory. I wrote a comment making somewhat similar points.

I believe that this indeed solves both INP and IXB.

I'd like to understand this part.

One way to fix it is by allowing the agent to precommit. Then the assumption about Omega becomes empirically verifiable.

I'm not sure I should find the precommitment solution satisfying. Won't it make some stupid precommitments early (before it has learned enough about the world to make reasonable precommitments) and screw itself up forever? Is there a generally applicable version of precommitments which ensures learning good behavior?

The only class of problems that I'm genuinely unsure how to deal with is game-theoretic superrationality.

If we take the learning-theoretic view, then we get to bring in tools from iterated games. There's a Pavlov-like strategy for playing deterministic iterated games which converges to optimal responses to non-agentic environments and converges to Pareto optima for environments containing agents who use the Pavlov-like strategy. It is not the greatest at being unexploitable, and it also has fairly bad convergence.

However, I don't yet see how to translate the result to logical-induction type learners. Besides requiring deterministic payouts (a property which can probably be relaxed somehow), the algorithm requires an agent to have a definite history -- a well-defined training sequence. Agents based on logical induction are instead forming generalizations based on any sufficiently analogous situation within logic, so they don't have a well-defined history in the right way. (An actual instance of a logical induction agent has an actual temporal history, but this temporal history is not necessarily what it is drawing on to play the game -- it may have never personally encountered a similar situation.)

In other words, I'm hopeful that there could be a learning-theoretic solution, but I don't know what it is yet.

As for superrationality for agents w/o learning theory, there's cooperative oracles, right? We can make computable analogues with distributed oracles. It's not a real solution, *specifically in that* it ignores learning. So I sort of think we know how to do it in the "static" setting, but the problem is that we live in a learning-theoretic setting rather than a static-rationality setting.

I believe that this indeed solves both INP and IXB.

I'd like to understand this part.

Every round of IXB has the following structure:

- Blackmail either arrives or not (observation)
- Agent either pays or not pays the blackmail (action)
- Infestation either happens or not (observation)

Suppose that on every round, the termite infestation happens with probability and its cost is . Then, this fact corresponds on an incomplete model (i.e., says that regardless of whether blackmailed arrived and regardless of whether blackmails was paid, the probability is ). is incomplete because it doesn't say anything about whether blackmail arrives or not. If is true, the agent can guarantee a payoff of per round (by rejecting the blackmail). Therefore, if the agent has a learnable prior that includes , it will converge to a payoff which is no less than . Of course achieving this payoff requires actually rejecting the blackmail.

This might seem surprising at first, because there is also a different incomplete model that says "if you pays the blackmail, infestation will not happen". is false if you use physical causal counterfactuals, but from the agent's perspective is consistent with all observations. However, only guarantees the payoff (because it is unknown whether the blackmail will arrive). Therefore, will have no effect on the ultimate behavior of the agent.

I'm not sure I should find the precommitment solution satisfying. Won't it make some stupid precommitments early (before it has learned enough about the world to make reasonable precommitments) and screw itself up forever? Is there a generally applicable version of precommitments which ensures learning good behavior?

The precommitments have to expire after some finite time. Incidentally, the agent can still screw itself up forever if the environment contains trap, which is a separate problem.

Notice that I am *not* making the claim that any sophisticated agent has to be capable of precommitments. This is because, I am not convinced that counterfactual mugging belongs to some class of problems that any sophisticated agent should be able to solve (the "fair" problems). Of course a sophisticated agent which suspects its environment contains this type of situation might want to create a descendant that *is* capable of precommitments.

The only class of problems that I'm genuinely unsure how to deal with is game-theoretic superrationality.

Let me add that I am not even sure what are the correct desiderata. In particular, I don't think that we should expect any group of good agents to converge to a Pareto optimal outcome. IMO in general it is more reasonable to expect that they converge to a Nash equilibrium (or some refinement, like a proper equilibrium). If the setting is iterated, or if they see each other's source code, then some equilibria are Pareto optimal, and perhaps under some reasonable assumptions convergence to Pareto optimal outcomes should be overwhelmingly likely (because of something like, these outcomes have the largest basin of attraction).

This might seem surprising at first, because there is also a different incomplete model Φ that says "if you pays the blackmail, infestation will not happen". Φ is false if you use physical causal counterfactuals, but from the agent's perspective Φ is consistent with all observations. However, Φ only guarantees the payoff −c (because it is unknown whether the blackmail will arrive). Therefore, Φ will have no effect on the ultimate behavior of the agent.

What happens in ASP? (Say you're in an iterated Newcomb's problem with a predictor much slower than you, but which meets the LIC or similar.) I'm concerned that it will either settle on two-boxing, or possibly not settle on one strategy, since if it settles on two-boxing then a model which says "you can get the higher reward by one-boxing" (ie, the agent has control over the predictor) looks appealing; but, if it settles on one-boxing, a model which says "you can get higher reward by two-boxing" (ie, the agent's action *doesn't* control the predictor) looks appealing. This concern is related to the way asymptotic decision theory fails -- granted, for cases outside of its definition of "fair".

The precommitments have to expire after some finite time.

I agree that something like this generally does the right thing in most cases, with the exception of superrationality in games as a result of commitment races.

I still have a little hope that there will be a nice version, which doesn't involve a commitment-races problem and which doesn't make use of an arbitrary commitment cutoff. But I would agree that things don't look good, and so it is reasonable to put this kind of thing outside of "fair" problems.

Let me add that I am not even sure what are the correct desiderata. In particular, I don't think that we should expect any group of good agents to converge to a Pareto optimal outcome.

I don't currently see why we shouldn't ask to converge to pareto optima. Obviously, we can't expect to do so with arbitrary other agents; but it doesn't seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm. This even seems reasonable in the standard iterated Nash picture (where not all strategies achieve pareto optima, but there exist strategies which achieve pareto optima with a broad-ish class of other strategies, including others who use strategies like their own -- while being very difficult to exploit).

But yeah, I'm pretty uncertain about what the desiderata should be -- both with respect to game theory, and with respect to scenarios which require updatelessness/precommitments in order to do well. I agree that it should all be approached with a learning-theoretic perspective.

What happens in ASP? (Say you're in an iterated Newcomb's problem with a predictor much slower than you, but which meets the LIC or similar.)

I am not sure what you mean by "meets the LIC or similar" in this context. If we consider a predictor which is a learning algorithm in itself (i.e., it predicts by learning from the agent's past choices), then the agent will converge to one-boxing. This is because a weak predictor will be fully inside the agent's prior, so the agent will know that one-boxing for long enough will cause the predictor to fill the box. If we consider a predictor that just analyzes the agent's source code and ignores the agent's choices, the agent will converge to two-boxing.

I was never convinced that "logical ASP" is a "fair" problem. I once joked with Scott that we can consider a "predictor" that is just the single line of code "return DEFECT" but in the comments it says "I am defecting only because I *know* you will defect." It was a joke, but it was half-serious. The notion of "weak predictor" taken to the limit leads to absurdity, and if you don't take it to the limit it might still lead to absurdity. Logical inductors in one way to try specifying a "weak predictor", but I am not convinced that settings in which logic is inserted ad hoc should be made into desiderata.

I still have a little hope that there will be a nice version, which doesn't involve a commitment-races problem and which doesn't make use of an arbitrary commitment cutoff.

I am not sure we need an *arbitrary* cutoff. There might be a good solution where the agent can dynamically choose any finite cutoff.

...it doesn't seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm.

Maybe? The questions are, how robust is this cooperation (i.e. what counts as "same algorithm"), and whether there is a significant cost in other situations. And, on the philosophical-ish level, the question is whether such desiderata should be considered essential for rationality/intelligence. But, I agree that this is worth studying.

I am not sure what you mean by "meets the LIC or similar" in this context. If we consider a predictor which is a learning algorithm in itself (i.e., it predicts by learning from the agent's past choices),

Yeah, that's what I meant.

, then the agent will converge to one-boxing. This is because a weak predictor will be fully inside the agent's prior, so the agent will know that one-boxing for long enough will cause the predictor to fill the box.

Suppose the interval between encounters with the predictor is long enough that, due to the agent's temporal discounting, the immediate reward of two-boxing outweighs the later gains which one-boxing provides. In any specific encounter with the predictor, the agent may prefer to two-box, but prefer to have been the sort of agent who predictably one-boxes, and also preferring to pre-commit to one-box on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)

But I wasn't sure this would be the result for your agent, since you described the agent using the hypothesis which gives the best picture about achievable utility.

As I discussed in Do Sufficiently Advanced Agents Use Logic, what I tend to think about is the case where the agent doesn't literally encounter the predictor repeatedly in its physical history. Instead, the agent must learn what strategy to use by reasoning about similar (but "smaller") scenarios. But we can get the same effect by assuming the temporal discounting is steep enough, as above.

I was never convinced that "logical ASP" is a "fair" problem. I once joked with Scott that we can consider a "predictor" that is just the single line of code "return DEFECT" but in the comments it says "I am defecting only because Iknowyou will defect." It was a joke, but it was half-serious. The notion of "weak predictor" taken to the limit leads to absurdity, and if you don't take it to the limit it might still lead to absurdity. Logical inductors in one way to try specifying a "weak predictor", but I am not convinced that settings in which logic is inserted ad hoc should be made into desiderata.

Yeah, it is clear that there has to be a case where the predictor is so weak that the agent should not care. I'm fine with dropping the purely logical cases as desiderata in favor of the learning-theoretic versions. But, the ability to construct analogous problems for logic and for learning theory is notable. Paying attention to that analogy more generally seems like a good idea.

I am not sure we need anarbitrarycutoff. There might be a good solution where the agent can dynamically choose any finite cutoff.

Yeah, I guess we can do a variety of things:

- Naming a time limit for the commitment.
- Naming a time at which a time limit for the commitment will be named.
- Naming an ordinal (in some ordinal notation), so that a smaller ordinal must be named every time-step, until a smaller ordinal cannot be named, at which point the commitment runs out

I suspect I want to evaluate a commitment scheme by asking whether it helps achieve a nice regret-bound notion, rather than defining the regret notion by evaluating regret-with-respect-to-making-commitments.

Thinking about LI policy selection where we choose a slow-growing function * f(n)* which determines how long we think before we choose the policy to follow on day

So it would be nice to have a rationality notion which clarified this situation.

My main concern here is: the case for empirical updatelessness seems strong in realizable situations where the prior is meaningful. Things aren't as nice in the non-realizable cases such as logical uncertainty. But it doesn't make sense to abandon updateless principles altogether because of this!

Suppose the interval between encounters with the predictor is long enough that, due to the agent's temporal discounting, the immediate reward of two-boxing outweighs the later gains which one-boxing provides. In any specific encounter with the predictor, the agent may prefer to two-box, but prefer to have been the sort of agent who predictably one-boxes, and also preferring to pre-commit to one-box on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)

Yes, but in this case the agent *should* two-box. This agent prefers sacrificing the future for short term gain, so that's what it does. Ofc if there a way to precommit that is visible to the predictor it will take it: this way it enjoys the best of both worlds, getting two boxes *and* causing the predictor to cooperate on the next round.

Btw, I am always interested in the asymptotic when the *time discount parameter* goes to , rather than the asymptotic when time discount is fixed and time goes to . That is, when I say the agent "converges" to something, I mean it in the former sense. Because, if the agent mostly cares only about the next thousands years, then it matters little how successful it is a million years from now. That is, the former asymptotic tracks the actual expected utility rather than some arbitrary portion thereof. (Also, I think it is interesting to consider what I dubbed "semi-anytime" algorithms: algorithms that receive a *lower bound* for the time discount parameter as input and guarantee a level of performance (usually regret) that improves as this lower bound grows. Even better are anytime algorithms: algorithms that don't depend on the time discount at all, but come with a performance guarantee that depends on the time discount. However, in settings such as Delegative RL it is impossible to have an anytime algorithm.)

Yes, but in this case the agentshouldtwo-box. This agent prefers sacrificing the future for short term gain, so that's what it does. Ofc if there a way to precommit that is visible to the predictor it will take it: this way it enjoys the best of both worlds, getting two boxesandcausing the predictor to cooperate on the next round.

Ok, sort of, but then this makes the discounting "wrong" in the same way that hyperbolic discounting is "wrong": dynamic inconsistency. One might then say such decision-theoretic questions reduce to a question of what the right way to discount is (constraints on discounting functions such that we rule out the ones which are "wrong"). I find this perspective somewhat plausible but not obviously right.

I wish I could get you to see the other possible perspective, in which there is something going wrong which is not about discounting behavior. IE, something closer to your remark about commitments. The two-boxer we are discussing can imagine the one-boxing strategy and can have regret with respect to it. Properly defining that notion of regret would allow a learner to one-box.

Maybe I'm still thinking about it the wrong way, but, I still am not convinced two-boxing is a legit values issue here. I think steep discounting doesn't justify two-boxing. Imagine that we let both the ASP predictor and the agent think for a while before the first round -- so the agent still has a lot more processing power than the predictor, but, the predictor is likely to get things right even on the first round. If the agent had been the sort of agent who would one-box, it would get higher reward on the first round. So it doesn't seem to me like the steep-discounting agent "really prefers to two-box".

The predictor can do well w/o any actual rounds being run because it is a logical inductor, so, is learning from closely analogous cases across logical possibility. Of course for this to make sense I have to assume the predictor has access to the agent's source code. I'm also imagining that the agent has the predictor's source code.

Btw, I am always interested in the asymptotic when thetime discount parametergoes to 1, rather than the asymptotic when time discount is fixed and time goes to ∞.

Ok. So in some sense you're thinking in the "there are incorrect discounting functions" way (ie, steeper is always worse than less steep).

But, what if this is the last newcomblike problem the agent can reasonably expect to be involved in? There will be no significant pressure from the future, so the agent will tend to defect/two-box/etc. But the predictor can see this coming (even with relatively little processing power, as in ASP). So the agent does worse overall than a (predictable) 1-boxer.

And, of course, if there's a predictable last interaction, the argument via pressure from the future tends to unwind so that actually there will be many rounds on which the gains from 1-boxing are lost rather than only one.

Yes, but in this case the agent

shouldtwo-box.

Ok, sort of, but then this makes the discounting "wrong" in the same way that hyperbolic discounting is "wrong": dynamic inconsistency... I wish I could get you to see the other possible perspective, in which there is something going wrong which is not about discounting behavior... The two-boxer we are discussing can imagine the one-boxing strategy and can have regret with respect to it.

Look, a one-boxing agent would get *less utility*. Yes, it would have higher rewards on subsequent rounds, but the first round is more important, *by the very assumption you made* (steep time discount). Moreover, the two-boxing agent would get higher utility than any agent that one-boxes on some of the rounds, so on *all* rounds the correct action is two-box.

Imagine that we let both the ASP predictor and the agent think for a while before the first round -- so the agent still has a lot more processing power than the predictor, but, the predictor is likely to get things right even on the first round... The predictor can do well w/o any actual rounds being run because it is a logical inductor, so, is learning from closely analogous cases across logical possibility. Of course for this to make sense I have to assume the predictor has access to the agent's source code. I'm also imagining that the agent has the predictor's source code.

I don't think it makes sense to simultaneously assume that the agent has a lot more power than the predictor and that the predictor is likely to get things right even on the first round. If the agent has a lot more power than the predictor, then the agent can e.g. diagonalize against the predictor and make sure it will *not* get things right.

Once again, we need to ask *how* does the agent know the predictor gets things right on the first round. It needs to learn it somehow. For example, maybe in encounters many predictors one after the other. But then, it will again learn a model in which one-boxing *causes* the predictor to cooperate.

(Notice that having a perfect predictor requires that either (i) the agent is deterministic or (ii) the predictor has access to the agent's random bits. Option (i) is possible for the case of complete models (because there is a deterministic Bayes-optimal policy; to establish weak feasibility you can use PSRL + pseudorandom), which is sufficient for Newcomb's paradox. For incomplete models you usually want to randomize, although it also possible to instead do deterministic maximin and optionally add an external random bit generator which is not a priori assumed to be invisible to the environment. In option (ii), you can consider a variant of Newcomb's where the predictor responds to the *probability* of the agent making a particular choice. This variant is more difficult for reasons similar to what happens with counterfactual mugging: it is hard for the agent to discover this behavior empirically. I think that, as opposed to counterfactual mugging, you *can* capture it by an incomplete model, although it will be a relatively complex, difficult to learn, model.)

So in some sense you're thinking in the "there are incorrect discounting functions" way (ie, steeper is always worse than less steep).

Well, the agents I consider interesting start from a state of ignorance, so they have to learn. For this, the time discount has to be shallow, otherwise there is no time to learn.

Look, a one-boxing agent would getless utility. Yes, it would have higher rewards on subsequent rounds, but the first round is more important,by the very assumption you made(steep time discount). Moreover, the two-boxing agent would get higher utility than any agent that one-boxes on some of the rounds, so onallrounds the correct action is two-box.

I agree in the case where the predictor is only good at all later, ie, its initial guess is random/fixed. Which is consistent with the way I originally brought ASP up. In that case, it absolutely makes sense to 2-box if your discounting is steep enough.

But, deliberation takes time. We can think of logical inductors, or Skyrms-style deliberation setups (Skyrms, *The Dynamics of Rational Deliberation*). The agent has a certain amount of time to think. If it spends all that time calculating probabilities (as in LIDT), and only makes a decision at the end, then it makes sense to say you can't successfully 1-box in a way which makes you predictably do so. But if you model the decision as being made over an amount of time, then you could possibly make the decision early on in your deliberation, so that the predictor can see. So, we can think of something like a commitment which you can make while still refining probabilities.

Again, I agree this doesn't make sense if the predictor is absolutely immovable (only takes our choice as evidence to refine its prediction in later rounds). But if the predictor is something like a logical inductor, then even if it runs much slower than the agent, it can "see" some of the early things the agent does.

Of course, there is the question of how you would know that making an early commitment rather than thinking longer would be a good strategy. I'm imagining they get each other's source code. We can think about it in a learning-theoretic way as follows. Suppose I want good behavior on problems where two AIs exchange source code and then have to play a game (with a certain amount of time to think before they have to act). The system can do anything with its time; think longer, make early commitments, whatever.

Actually, let's simplify a little: since the predictor is not fully an agent, we can just think in terms of building a system which is given source code which outputs the end utility (so the source code includes the predictor, the agent itself, everything). I want a system which does well on such problems if it has enough time to think. In order to get that, what I'm going to do is have it spend the first fraction of its thinking time training itself on smaller problems sampled at random. (It could and should choose instances intelligently, but let's say it doesn't.) It treats these entirely episodically; in each instance it only tries to maximize the score in that one case.

Nonetheless, it can eventually learn to make its decision early in cases where it is beneficial to be predictable.

This is all supposed to be a model of what our agent can do in order to come to a decision in ASP. It can think about similar problems. It can notice a trend where it's useful to make certain decisions early. Sure, it might figure that out too late, in which case the best it can do is probably 2-box. But it might also figure it out in time.

I don't think it makes sense to simultaneously assume that the agent has a lot more power than the predictor and that the predictor is likely to get things right even on the first round. If the agent has a lot more power than the predictor, then the agent can e.g. diagonalize against the predictor and make sure it willnotget things right.

It can do that, which would make it unpredictable. However, it can also not do that. In this case it has no special motivation to do that, so there's no reason to, which means there's no reason for the predictor to expect that.

Notice that having a perfect predictor requires that either (i) the agent is deterministic or (ii) the predictor has access to the agent's random bits.

Not sure whether this was a point you were trying to make, but, I agree that in the case of a perfect predictor learning theory does a fine job of 1-boxing regardless of temporal discounting or any commitment mechanisms.

I'm worrying that the overarching thread may get lost. I think this point about ASP is an important sub-point, for a few reasons, but I hope the discussion doesn't dead-end here. My picture of where we're at:

- You initially made a claim that you could get some subset of decision problems, such as XOR, via learning-theoretic means. I was trying to learn more about that.
- You made a comment about problems outside the subset you can get being addressable with commitment mechanisms, while adding that you weren't certain this should be a desiderata.
- I was interested in the claim about commitment mechanisms, because I think you can get somewhat messy/patchwork kind-of-updateless-ish stuff out of that, but would be pretty interested if there's an elegant version.
- We started into differences of intuition about what is desirable.
- We settled into going back and forth about ASP.

I've gravitated to ASP partly because I hope it illustrates something about how I see logic as relevant to a learning-theoretic approach to decision theory. But I'm still interested more broadly in what class of decision problems you think you can do well on, with and without commitment mechanisms.

...But if you model the decision as being made over an amount of time, then you could possibly make the decision early on in your deliberation, so that the predictor can see.

Yes, sounds sort of reasonable. Here is how I think you can realize this using TRL.

As usual, we consider the agent playing an IPD against a predictor (Newcomb's paradox is essentially playing the Prisoner's Dilemma against a "FairBot"). On each round, the predictor gets to see the agent's state at the start of the round. (The state can be considered part of the "source code". For randomizing agents, we also assume the predictors sees the random bits). The predictor then tries to simulate the agent (we assume it knows the rest of the agent's source code as well), and is successful *if* the agent doesn't execute any programs that are too expensive for the predictor (for the sake of simplicity, assume that no program started on one round continues running during following rounds: I don't think that this assumption makes a difference of principle). Otherwise, the prediction might be wrong (for example, we can assume it defaults to D). The predictor then plays D or C according to its prediction of the agent's action.

In this setting, the agent can learn the incomplete hypothesis "if I don't run expensive programs *and* I play C, the predictor will also play C". (We assume that the prior allows for side effects of executing programs. Such a prior seems more realistic anyway, and in particular is required to counter non-Cartesian daemons. However, it also has a cost, so perhaps what we really want is a prior that is biased towards few side effects: but, this is immaterial for the current discussion.) This hypotheses guarantees a payoff of U(CC). Assuming that the predictor cannot be exploited, this is the best possible payoff and therefore the agent will converge to cooperation.

We might want a more explicit realization of the "simulates" part in "agent simulates predictor". For this, we can assume the agent also receives its own state as an observation (but, I'm not sure how generally useful is this for realistic agents). The agent can then also learn the incomplete hypothesis describing the exact function from agent states to predictor output. However, this hypothesis doesn't affect the outcome: it doesn't predict the agent's state and therefore can only guarantee the payoff U(DD).

I was never convinced that "logical ASP" is a "fair" problem. I once joked with Scott that we can consider a "predictor" that is just the single line of code "return DEFECT" but in the comments it says "I am defecting only because I know you will defect."

I'm leaning this way as well, but I think it's an important clue to figuring out commitment races. ASP Predictor, DefectBot, and a more general agent will make different commitments, and these things are already algorithms specialized for certain situations. How is the chosen commitment related to what the thing making the commitment is?

When an agent can manipulate a predictor in some sense, what should the predictor do? If it starts scheming with its thoughts, it's no longer a predictor, it's just another agent that wants to do something "predictory". Maybe it can only give up, as in ASP, which acts as a precommitment that's more thematically fitting for a predictor than for a general agent. It's still a commitment race then, but possibly the meaning of something being a predictor is preserved by restricting the kind of commitment that it makes: the commitment of a non-general agent is what it *is* rather than what it *does*, and a general agent is only committed to its preference. Thus a general agent loses all knowledge in an attempt to out-commit others, because it hasn't committed to that knowledge, didn't make it part of what it is.

I don't currently see why we shouldn't ask to converge to pareto optima. Obviously, we can't expect to do so with arbitrary other agents; but it doesn't seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm.

I had an interesting thought that made me update towards your position. There is my old post about "metathreat equilibria", an idea I developed with Scott's help during my trip to Los Angeles. So, I just realized the same principle can be realized in the setting of repeated games. In particular, I am rather confident that the following, if formulated a little more rigorously, is a theorem:

*Consider the Iterated Prisoner's Dilemma in which strategies are constrained to depend only on the action of the opponent in the previous round. Then, at the limit (shallow time discount), the only thermodynamic equilibrium is mutual cooperation.*

Then, there is the question of how to generalize it. Obviously we want to consider more general games and more general strategies, but allowing fully general strategies probably won't work (just like allowing arbitrary programs doesn't work in the "self-modification" setting). One obvious generalization is allowing the strategy to depend on some finite suffix of the history. But, this isn't natural in the context of agent theory: why would agents forget everything that happened before a certain time? Instead, we can constraint the strategy to be finite-state, and maybe require it to be communicating (i.e. forbid "grim triggers" where players change their behavior forever). On the game side, we can consider arbitrary repeated games, or communicating stochastic games (they have to be communicating because otherwise we can represent one-shot games), or even communicating partially observable stochastic games. This leads me to the following bold conjecture:

*Consider any suitable (as above) game in which strategies are constrained to be (communicating?) finite-state. Then, at the limit , all thermodynamic equilibria are Pareto efficient.*

This would be the sort of result that seems like it should be applicable to learning agents. That is, if the conjecture is true, there is good hope that appropriate learning agents are guaranteed to converge to Pareto efficient outcomes. Even if it the conjecture requires some moderately strong additional assumptions, it seems worth studying.

In INP, any reinforcement learning (RL) algorithm will converge to one-boxing, simply because one-boxing gives it the money. This is despite RL naively looking like CDT.

Yup, like Caspar, I think that model-free RL learns the EDT policy in most/all situations. I'm not sure what you mean with it looking like CDT.

In Newcomb's paradox CDT succeeds but EDT fails. Let's consider an example where EDT succeeds and CDT fails: the XOR blackmail.

Isn't it the other way around? The one-boxer gets more money, but gives in to blackmail, and therefore gets blackmailed in the first place.

RL is CDT in the sense that, your model of the world consists of actions and observations, and some causal link from past actions and observations to current observations, but there is no causal origin to the actions. The actions are just set by the agent to whatever it wants.

And, yes, I got CDT and EDT flipped there, good catch!

Utilising a model that assumes we can decouple is different from assuming we can decouple. For example, calculus assumes that space is infinitely divisible, but using a formula derived from calculus to calculate the volume of a sphere doesn't require you to assume space is infinitely divisible. It just has to work as an approximation.

I'm pretty confident in my forgetting approach to logical counterfactuals, though I've sadly been too busy the last few months to pick it up again. I haven't quite formalised it yet - I'm planning to try to collect all the toy problems and I think it'll likely drop out fairly quickly that there are a few distinct, but closely related notions of logical counterfactual. Anyway, I think I've finally figured out how to more precisely state my claim that we ought to build upon consistent counterfactuals.

In this post, I clarify how far we are from a complete solution to decision theory, and the way in which high-level philosophy relates to the mathematical formalism. I’ve personally been confused about this in the past, and I think it could be useful to people who casually follows the field. I also link to some less well-publicized approaches.

The first disagreement you might encounter when reading about alignment-related decision theory is the disagreement between Causal Decision Theory (CDT), Evidential Decision Theory (EDT), and different logical decision theories emerging from MIRI and lesswrong, such as Functional Decision Theory (FDT) and Updateless Decision Theory (UDT). This is characterized by disagreements on how to act in problems such as Newcomb’s problem, smoking lesion and the prisoner's dilemma. MIRI’s paper on FDT represents this debate from MIRI’s perspective, and, as exemplified by the philosopher who refereed that paper, academic philosophy is far from having settled on how to act in these problems.

I’m quite confident that the FDT-paper gets those problems right, and as such, I used to be pretty happy with the state of decision theory. Sure, the FDT-paper mentions logical counterfactuals as a problem, and sure, the paper only talks about a few toy problems, but the rest is just formalism, right?

As it turns out, there are a few caveats to this:

As a consequence, knowing what philosophical position to take in the toy problems is only the beginning. There’s no formalised theory that returns the right answers to all of them yet, and if we ever find a suitable formalism, it’s very unclear how it will generalise.

If you want to dig into this more, Abram Demski mentions some open problems in this comment. Some attempts at making better formalisations includes Logical Induction Decision Theory (which uses the same decision procedure as evidential decision theory, but gets logical uncertainty by using logical induction), and a potential modification, Asymptotic Decision Theory. There’s also a proof-based approach called Modal UDT, for which a good place to start would be the 3rd section in this collection of links. Another surprising avenue is that some formalisations of the high-level clusters suggest that they're all the same. If you want to know more about the differences between Timeless Decision Theory (TDT), FDT, and versions 1.0, 1.1, and 2 of UDT, this post might be helpful.