# Ω 6

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post aims to provide a strong philosophical foundation for logical counterfactuals, while sketching out an informal scheme that will hopefully be formalised further in the future. I believe that providing such a philosophical foundation is important for the same reasons that Sam listed in Motivating a Semantics of Logical Counterfactuals.

Introductory Material

• Miri's Functional Decision Theory (FDT) paper defines subjunctive dependence to refer to the situation where two decision process involve the same calculation. This can result from causation, such as if I calculate an equation, then tell you the answer and we both implement a decision based on this sum. It can also occur non-causally, such as a prediction being linked to a decision as per Newcomb's problem. The problem of logical counterfactuals can be characterised as figuring out what processes subjunctively depend on other processes, so that we can apply FDT.
• In the Co-operation Game, I argued that logical counterfactuals are more about your knowledge of the state of the world than the world itself. Suppose there are two people who can choose A or B. Suppose that a predictor knows that both people will choose A conditional on them being told one of the following two facts a) the other person will choose A b) the other person will choose the same as you. Then whether your decision is modelled to subjunctively depend on the other person depends on which of the two facts you are told. Going further than the original post, one might be told a) and the other b), so that the first sees themselves as not subjunctively linked to the second, while the second sees themselves as subjunctively linked to the first.
• "Your knowledge of the state of the world" can be explicated as being about internally consistent counterfactuals, which we'll label Raw Counterfactuals. When there are multiple raw counterfactuals consistent with your state of knowledge, you can pick the one with the highest utility.
• However, there will also be cases where only a single counterfactual is consistent with your state of knowledge, which results in a rather trivial problem. Consider for example Transparent Newcomb's Problem, where a perfect predictor places the million in a transparent box if and only if it predicts that you will one-box if it does. If you see the million, you know that you must have one-boxed so it doesn't strictly make sense to ask what you should do in this situation. Instead, we'll have to ask something slightly different instead. So, I've slightly modified my position since writing the Co-operation Game: in some situations logical counterfactuals will be defined relative to an imagined, rather than actual epistemic state. We will construct these states by erasing some information as described later in the post.
• Other degenerative cases include when you already know what decision you'll make or when you have the ability to figure it out. For example, when you have perfect knowledge of the environment and the agent, unless you run into issues with unprovability. Note that degeneracy is more common than you might think since knowing, for example, that it is a utility maximiser, tells you its exact behaviour in situations without options that are tied. Again, in these cases, the answer to the question, "What should the agent do?" is, "The only action consistent with the problem statement". However, as we'll see, it is possible sometimes to make these questions less trivial if you're willing to accept a slight tweak to the problem statement.
• We'll consider two kinds of problems, acknowledging that these aren't the only types. In external problems, we imagine decisions from the perspective of a theoretical, unbounded, non-embedded observer who exists outside of the problem statement. Clearly we can't fully adopt the perspective of such an external agent, but describing the high-level details will usually suffice. Critically, in the external perspective, the observer can have goals, such as choosing the agent with the maximum utility, without those being the goals of the agent within the problem.
• In (fully) reflective problems, we imagine decisions from the perspective of an agent considering its own decisions or potential decisions with full knowledge of its own source code. These problems will complicate the counterfactuals since the agent's goals limit the kind of agent that it could be. For example, an agent that wants to maximise utility should only search over the possibility space where it is a utility maximiser.
• Making this distinction more explicit: An external problem would ask "What decision maximises utility?", as opposed to a reflective problem which asks: "What decision maximises utility for a utility maximiser?". This distinction will mainly be important here in terms of when it makes a problem trivial or not.
• The external/reflective distinction is very similar to the difference between embedded and non-embedded problems, but external problems can include embedded agents, just from the perspective of a non-embedded agent. So we can do a surprising amount of our theorising from within the external perspective.

Raw Counterfactuals

• Raw counterfactuals are produced as follows: Starting with the territory we use some process to produce a causal model. We can then imagine constructing different models by switching out or altering parts of the model. These represent a coherent concept seperate from any discussion of decisions. In so far as we care about what could have been, we need to ultimately relate our claims to raw counterfactuals as inconsistent models could not have been.
• Causal Decision Theory uses its own notion of counterfactuals, which we'll term Decision Counterfactuals. These are created by performing world surgery on models of causal processes. Unlike raw counterfactuals, decision counterfactuals are inconsistent. Suppose that you actually defect in the Prisoner's Dilemma, but we are considering the counterfactual where you cooperate. Up until the point of the decision you are the kind of person who defects, but when we arrive at the decision, you magically cooperate.
• Decision counterfactuals are useful because they approximate raw counterfactuals. Performing world surgery all the way back in time requires would require a lot of work. In theoretical decision problems it is usually easy to imagine a raw counterfactual that would match the problem description and provide the same answer as the decision counterfactual. In practical decision problems, we don't have the data to do this.
• Unfortunately, this approximation breaks down when performing world surgery to make a counterfactual decision consistent with the past requires us to change an element of the environment that is important for the specific problem. For example, in Newcomb's problem, changing your current decision requires changing your past self which involves changing a predictor which is considered part of the environment. In this case, it makes sense to fall back to raw counterfactuals and build a new decision theory on top.
• Functional decision theory, as normally characterised, is closer to this ideal as world surgery isn't just performed on your decision, but also on all decisions that subjunctively depend on you. This removes a bunch of inconsistencies, however, we've still introduced an inconsistency by assuming that f(x)=b when f(x) really equals a. The raw counterfactual approach provides a stronger foundation because it avoids this issue. However, since proof-based FDT is very effective at handling reflective problems, it would be worthwhile rebuilding it upon this new foundation.
• Let's consider Newcomb's problem from the external perspective. The external observer is trying to maximise utility rather than the agent within the problem, so there is no restriction on whether the agent can one-box or two-box whilst being consistent with the problem statement. We can then immediately observe that if we use raw counterfactuals, the reward only depends on the agent's ultimate decision and agents that one-box score better than those who don't. Simple cases like this which allow multiple consistent counterfactuals don't require erasure.
• On the other hand, there are problems which only allow a single raw counterfactual and hence require us to tweak the problem to make it well-defined. Consider, for examples, Transparent Newcomb's, where if you see money in the box, you know that you will receive exactly $1 million. Some people say this fails to account for the agent in the simulator, but it's entirely possible that Omega may be able to figure out what action you will take based on high level reasoning, as opposed having to run a complete simulation of you. We'll describe later a way of tweaking the problem statement into something that is both consistent and non-trivial. General Approach to Logical Counterfactuals • We will now attempt to produce a more solid theory of external problems using FDT. This will allow us to interpret decision problems where only one decision is consistent with the problem statement in a non-trivial way. • FDT frames logical counterfactuals as "What would the world be like if f(x)=b instead of a?" which doesn't strictly make sense as noted in the discussion on raw counterfactuals. Two points: a) I think it should be clear that this question only makes sense in terms of thinking of perturbations of the map and not as a direct claim about the territory (see map and territory). b) We'll address this problem by proposing a different approach for foundations, which these proof-based approaches should ultimately be justified in terms of. • There are two possible paths towards a more consistent theory of logical counterfactuals for these situations. In both cases we interpret the question of what it would mean to change the output of a function as an informal description of a similar question that is actually well-defined. The first approach is see what consequences can be logically deduced from f(x)=b while implementing a strategy to prevent us from deducing incorrect statements from the inconsistencies. This is often done by playing chicken with the universe. We will term this a paraconsistent approach, even though it doesn't explicitly make use of paraconsistent logic as it is paraconsistent in spirit. • An alternative approach would be to interpret this sentence as making claims about raw counterfactuals. In FDT terms, the raw counterfactual approach finds an f' such that f'(x)=b and also with certain as of yet unstated similarities to f and substitute this into all subjunctively linked processes. The paraconsistent approach is easier to do informally, but I suspect that the raw counterfactually approach would be more amenable to formalisation and provides more philosophical insight into what is actually going on. In so far as the paraconsistent approach may be more convenient for an implementation perspective than the first, we can justify it by tying it to raw counterfactuals. Decisions • Now that we've outlined the broad approach, we should dig more into the question of what exactly it means to make a decision. As I explained in a previous post, there's a sense in which you don't so much 'make' a decision as implement one. If you make something, it implies that it didn't exist before and now it does. In the case of decisions, it nudges you towards believing that the decision you were going to implement wasn't set, then you made a decision, and then it was. However, when "you" and the environment are defined down to the atom, you can only implement one decision. It was always the case from the start of time that you were going to implement that decision. • We note that if you have perfect information about the agent and the environment, you need to forget or at least pretend to forget some information about the agent so that we can produce counterfactual versions of the agent who decide slightly differently. See Shminux's post on Logical Counterfactuals are Low Res for a similar argument, but framed slightly differently. The key difference is that I'm not suggesting just adding noise to the model, but forgetting specific information that doesn't affect the outcome. • In Transparent Newcomb's, it would be natural to erase the knowledge that the box is full. This would then result in two counterfactuals: a) the one where the agent sees an empty box and two boxes, b) the one where the agent sees a full box and one boxes. It would be natural to relax the scope of who we care about from the agent who sees a million in the box to the agent at the end of the problem regardless of what they see. If we do so, we then have a proper decision problem and we can see that one-boxing is better. • Actually there's a slight hitch here. In order to define the outcome an agent receives, we need to define what the predictor will predict when an agent sees the box containing the million. But it is impossible to place a two-boxer in this situation. We can resolve this by defining the predictor as simulating the agent responding to an input representing an inconsistent situation as I've described in Counterfactuals for Perfect Predictors. • In order to imagine a consistent world, when we imagine a different "you", we must also imagine the environment interacting with that different "you" so that, for example, the predictor makes a different prediction. Causal decision theorists construct these counterfactuals incorrectly and hence they believe that they can change their decision without changing the prediction. They fail to realise that they can't actually "change" their decision as there is a single decision that they will inevitably implement. I suggest replacing "change a decision" with "shift counterfactuals" when it is important to be able to think clearly about these topics. It also clarifies why the prediction can change without backwards causation (my previous post on Newcomb's problem contains further material on why this isn't an issue). Erasure • Here's how the erasure may proceed for the example of Transparent's Newcomb Problem. Suppose we erase all information about what decision the agent is going to make. This also requires erasing the fact that you see a million in the transparent box. Then we look at all counterfactually possible agents and notice that the reward depends only on whether you are an agent who ultimately one-boxes or an agent who two-boxes. Those who one-box see the million and then receive it, those who two-box see no money in the transparent box and receive$1000 only. The counterfactual involving one-boxing performs better than that involving two-boxing, so we endorse one-boxing.
• Things become more complicated if we want to erase less information about the agent. For example, we might want an agent to know that it is a utility maximiser as this might be relevant to evaluating the outcome the agent will receive from future decisions. Suppose that if you one-box in Transparent Newcomb's you'll then be offered a choice of $1000 or$2000, but if you two-box you'll be offered $0 or$10 million. We can't naively erase all information about your decision process in order to compute a counterfactual of whether you should one-box or two-box. Otherwise, we end up with a situation where, for example, there are agents who one-box and get different rewards. Here the easiest solution is to "collapse" all of the decisions and ask about a policy instead that covers all three decisions that may be faced. Then we can calculate the expectations without producing any inconsistencies in the counterfactuals as it then becomes safe to erase the knowledge that it is a utility maximiser.
• The 5 and 10 problem doesn't occur with the forgetting approach. Compare: If you keep the belief that you are a utility maximiser then the only choice you can implement is 10 so we don't have a decision problem. We can define all the possible strategies as follows, p probability of choosing 5 and 1-p probability of choosing 10, so forget everything about yourself except that you are one of these strategies. There's no downside as there is no need for an agent to know whether or not it is a utility maximiser. So we can solve the 5 and 10 problem without epsilon exploration.
• When will we be able to use this forgetting technique? One initially thought might be the same scope FDT is designed to be optimal on - problems where the reward only depends on your outputs or predicted outputs. Because only the output or predicted output matters and not the algorithm, these can be considered fair, unlike a problem where an alphabetical decision theorist (picks the first decision ordered in alphabetical order) is rewarded and every other type of agent is punished.
• However, some problems where this condition doesn't hold also seem fair. Like suppose there are long programs and short programs (in terms of running time). Further suppose programs can output either A or B. The reward is then determined purely based on these two factors. Now suppose that there exists a program that can calculate the utilities in each of these four worlds and then based upon this either terminate immediately or run for a long time and then output its choice of A or B. Assume that if it terminates immediately after the decision it'll qualify as a short program, while if it terminates after a long time it is a long program. Then, as a first approximation, we can say that this is also a fair problem since it is possible to win in all cases.
• It's actually slightly more complex than this. An AI usually doesn't have to win on only one problem. Adding code to handle more situations will extend the running time and may prevent such an AI from always being able to choose the dominant option. An AI might also want to do things like double check calculations or consider whether it is actually running on consistent outcomes, so winning the problem might put limitations on the AI in other ways.
• But nonetheless, so long as we can write such a program that picks the best option, we can call the problem "fair" in a limited sense. It's possible to extend our definition of "fair" further. Like suppose that it's impossible to analyse all the options and still return in a short amount of time. This isn't a problem if the maximum utility is in a long option.
• In regards to running time, we can also end up with a non-binary notion of "fair" according to how much extra processing a program can squeeze in before having to decide the short option. This limits the ability of the AI to check/recheck its work and examine its philosophical assumptions before having to make the decision.

Final Thoughts

• Logical counterfactuals are often framed in such a way that it seems that we should be building a model of subjunctive dependence directly from the atoms in the universe. Instead we produce these from causal model that identifies the current agent and a model of forgetting. This makes our job much easier as it allows us to disentangle this problem from many of the fundamental issues in ontology and philosophy of science.
• Comparing the method in this post to playing chicken with the universe: using raw counterfactuals clarifies that such proof-based methods are simply tricks that allow us to act as though we've forgotten without forgetting.
• In general, I would suggest that logical counterfactuals are about knowing which information to erase such that you can produce perfectly consistent counterfactuals. Further, I would suggest that if you can't find information to erase that produces perfectly consistent counterfactuals then you don't have a decision theory problem. Future work could explore exactly when this is possible and general techniques for making this work as the current exploration has been mainly informal.

Notes:

• I changed my terminology from Point Counterfactuals to Decision Counterfactuals and Timeless Counterfactuals to Raw Counterfactuals in this post as this better highlights where these models come from.

This post was written with the support of the EA Hotel

# Ω 6

New Comment
Other degenerative cases include when you already know what decision you'll make or when you have the ability to figure it out. For example, when you have perfect knowledge of the environment and the agent, unless you run into issues with unprovability. Note that degeneracy is more common than you might think since knowing, for example, that it is a utility maximiser, tells you its exact behaviour in situations without options that are tied. Again, in these cases, the answer to the question, "What should the agent do?" is, "The only action consistent with the problem statement".

Why should this be the case? What do you think of the motto decisions are for making bat outcomes inconsistent?

I kind of agree with it, but in a way that makes it trivially true. Once you have erased information to provide multiple possible raw counterfactuals, you have the choice to frame the decision problem as either choosing the best outcome or avoiding sub-optimal outcomes. But of course, this doesn't really make a difference.

It seems rather strange to talk about making an outcome inconsistent which was already inconsistent. Why is this considered an option that was available for you to choose, instead of one that was never available to choose? Consider a situation where the world and agent have both been precisely defined. Determinism means there is only one possible option, but decisions problems have multiple possible options. It is not clear which decisions that are inconsistent with what actually happened count as "could have been chosen" and which count as, "were never possible".

Actually, this relates to my post on Counterfactuals for Perfect Predictors. Talking about making your current situation inconsistent doesn't make sense literally, only analogically. After all, if you're in a situation it has to be consistent. The way that I get round this in my post is by replacing talk of decisions given a situation with talk of decisions given an input representing a situation. While you can't make your current situation inconsistent, it is sometimes possible for a program to be written such that it cannot be put in the situation representing an input as its output would be inconsistent with that. And that lets us define what we wanted to define, without having to fudge philosophically.

I kind of agree with it, but in a way that makes it trivially true. Once you have erased information to provide multiple possible raw counterfactuals, you have the choice to frame the decision problem as either choosing the best outcome or avoiding sub-optimal outcomes. But of course, this doesn't really make a difference.

I think our disagreement is around the status of decision problems before you've erased information, not after. In your post, you say that before erasing information, a problem where what you do is determined is trivial, in that you only have the one option. That's the position I'm disagreeing with. To the extent that erasing information is a useful idea, it is useful precisely for dealing with such problems -- otherwise you would not need to erase the information. The way you're describing it, it sounds like erasing information isn't something agents themselves are supposed to ever have to do. Instead, it is a useful tool for a decision theorist, to transform trivial/meaningless decision problems into nontrivial/meaningful ones. This seems wrong to me.

It seems rather strange to talk about making an outcome inconsistent which was already inconsistent. Why is this considered an option that was available for you to choose, instead of one that was never available to choose? Consider a situation where the world and agent have both been precisely defined. Determinism means there is only one possible option, but decisions problems have multiple possible options. It is not clear which decisions that are inconsistent with what actually happened count as "could have been chosen" and which count as, "were never possible".

I'm somewhat confused about what you're saying in this paragraph and what assumptions you might be making. I think it might help to focus on examples. Two examples which I think motivate the idea:

• Smoking lesion. It can often be quite a stretch to put an agent into a smoking lesion problem, because the problem assumes certain population statistics which may be impossible to achieve if the population is assumed to make decisions in a particular way. My impression is that some philosophers hold a decision theory like CDT and EDT responsible for what advice it offers in a particular situation, even if it would be impossible to put agents in that situation if they were the sort of agents who followed the advice of the decision theory in question. In other words, even if it is impossible to put EDT agents in a situation where they are representative of a population as described in the smoking lesion problem, EDT is held responsible for offering bad advice to agents in such a situation. I take the motto "decisions are for making bad outcomes inconsistent" as speaking against this view, instead giving EDT credit for making it impossible for an agent to end up in such a situation.
• (In my post on smoking lesion, I came up with a way to get EDT agents into a smoking-lesion situation; however, it required certain assumptions about their internal architecture. We could take the argument as speaking against such an architecture, rather than EDT. This interpretation seems quite natural to me, because the setup required to get EDT into a smoking lesion situation is fairly unnatural, and one could simply refuse to build agents with such an unnatural architecture.)
• Transparent Newcomb. In the usual setup, the agent is described as already facing a large sum of money. We are also told that this situation is only possible if the agent one-boxes -- a two-boxing agent won't get this opportunity (or, will get it with much smaller probability). Academic decision theorists tend to, again, judge the decision theory on the quality of advice offered under the assumption that an agent ends up in the situation, disregarding the effect of the decision on whether an agent could be in the situation in the first place. On this view, decision theories such as UDT which one-box are giving bad advice, because if you are already in the situation, you can get more money by two-boxing. In this case, the motto "decisions are for making bad outcomes inconsistent" is supposed to indicate that agents should one-box, so that they can end up in the better situation. A two-boxing decision theory like CDT is judged poorly for making it impossible to get a very good payout.

Importantly, transparent Newcomb (with a perfect predictor) is a case where the agent has enough information to know its own action: it must one-box, since it could not be in this situation if it two-boxed. Yet, we can talk about decision theories such as CDT which two-box in such cases. So it is not meaningless to talk about what happens if you take an action which is inconsistent with what you know! What you do in such situations has consequences.

I don't know that you disagree with any of this, since in your original essay you say:

For example, when you have perfect knowledge of the environment and the agent, unless you run into issues with unprovability. Note that degeneracy is more common than you might think since knowing, for example, that it is a utility maximiser, tells you its exact behaviour in situations without options that are tied.

However, you go on to say:

Again, in these cases, the answer to the question, "What should the agent do?" is, "The only action consistent with the problem statement".

which is what I was disagreeing with. We can set up a sort of reverse transparent Newcomb, where you should take the action which makes the situation impossible: Omega cooks you a dinner selected out of those which it predicts you will eat. Knowing this, you should refuse to eat meals which you don't like, even though when presented with such a meal you know you must eat it (since Omega only presents you with a meal you will eat).

(Aside: the problem isn't fully specified until we also say what Omega does if there is nothing you will eat. We could say that Omega serves you nothing in that case.)

Talking about making your current situation inconsistent doesn't make sense literally, only analogically. After all, if you're in a situation it has to be consistent. The way that I get round this in my post is by replacing talk of decisions given a situation with talk of decisions given an input representing a situation. While you can't make your current situation inconsistent, it is sometimes possible for a program to be written such that it cannot be put in the situation representing an input as its output would be inconsistent with that. And that let's us define what we wanted to define, without the nudge to fudge philosophically.

This seems basically consistent with what I'm saying (indeed, almost the same as what I'm saying), except I take strong objection to some of your language. I don't think you "analogically" make situations inconsistent; I think you actually do. Replacing "situation" with "input representing a situation" seems sort of in the right direction, but the notion of "input" is problematic, because it can be your own internal reasoning which predicts your action accurately.

Of the chicken rule, for example, it is literally (not analogically) correct to say that the algorithm takes a different action if it ever proves that it takes a certain action. It is also true that it never ends up in this situation. We could also say that you never take an action if you have an internal state representing taking certainty that you take that action. However, it is furthermore true that you never get into such an internal state.

Similarly, in the example where Omega cooks you something which you will eat, I would think it literally correct to say that you would not eat pudding (supposing that's a property of your decision algorithm).

(This comment was written before reading EDT=CDT. I think some of my views might update based on that when I have more time to think about it)

In your post, you say that before erasing information, a problem where what you do is determined is trivial, in that you only have the one option. That's the position I'm disagreeing with.

It will be convenient for me to make a slightly different claim than the one I made above. Instead of claiming that the problem is trivial in completely determined situations, I will claim that it is trivial given the most straightforward interpretation of a problem* (the set of possible actions for an agent are all those which are consistent with the problem statement and the action which is chosen is selected from this set of possible actions). In so far as both of us want to talk about decision problems where multiple possible options are considered, we need to provide a different interpretation of what decision problems are. Your approach is to allow the selection of inconsistent actions, while I suggest erasing information to provide a consistent situation.

My response is to argue as per my previous comment that there doesn't seem to be any criteria for determining which inconsistent actions are considered and which ones aren't. I suppose you could respond that I haven't provided criteria for determining what information should be erased, but my approach has the benefit that if you do provide such criteria, logical counterfactuals are solved for free, while it's much more unclear how to approach this problem in the allowing inconsistency approach (although there has been some progress with things like playing chicken with the universe).

*excluding unprovability issues

The way you're describing it, it sounds like erasing information isn't something agents themselves are supposed to ever have to do

You're at the stage of trying to figure out how agents should make decisions. I'm at the stage of trying to understand what a making a good decision even means. Once there is a clearer understanding of what a decision is, we can then write an algorithm to make good decisions or we may discover that the concept dissolves, in which case we will have to specify the problem more precisely. Right now, I'd be perfectly happy just to have a clear criteria by which an external evaluator could say whether an agent made a good decision or not, as that would constitute substantial progress.

I'm somewhat confused about what you're saying in this paragraph and what assumptions you might be making

My point was that there isn't any criteria for determining which inconsistent actions are considered and which ones aren't if you are just thrown a complete description of a universe and an agent. Transparent Newcomb's already comes with the options and counterfactuals attached. My interest is in how to construct them from scratch.

My impression is that some philosophers hold a decision theory like CDT and EDT responsible for what advice it offers in a particular situation, even if it would be impossible to put agents in that situation

I think it is important to use very precise language here. The agent isn't being rated on what it would do in such a situation, it is being rated on whether or not it can be put into that situation at all.

I suspect that sometimes when an agent can't be put into a situation it is because the problem has been badly formulated (or falls outside the scope of problems where its decision theory is defined), while in other cases this is a reason for or against utilising a specific decision theory algorithm. Holding an agent responsible for all situations it can't be in seems like the wrong move, as it feels that there is some more fundamental confusion that needs to be cleaned up.

I take the motto "decisions are for making bad outcomes inconsistent"

I'm not a fan of reasoning via motto when discussing these kinds of philosophical problems which turn on very precise reasoning.

So it is not meaningless to talk about what happens if you take an action which is inconsistent with what you know!... I don't know that you disagree with any of this... We can set up a sort of reverse transparent Newcomb, where you should take the action which makes the situation impossible

There's something of a tension between what I've said in this post about only being able to take decisions that are consistent and what I said in Counterfactuals for Perfect Predictors, where I noted a way of doing something analogous to acting to make your situation inconsistent. This can be cleared up by noting that erasing information in many decision theory problems provides a problem statement where input-output maps can define all the relevant information about an agent. So I'm proposing that this technique be used in combination with erasure, rather than separately.

In so far as both of us want to talk about decision problems where multiple possible options are considered, we need to provide a different interpretation of what decision problems are. Your approach is to allow the selection of inconsistent actions, while I suggest erasing information to provide a consistent situation.

I can agree that there's an interpretational issue, but something is bugging me here which I'm not sure how to articulate. A claim which I would make and which might be somehow related to what's bugging me is: the interpretation issue of a decision problem should be mostly gone when we formally specify it. (There's still a big interpretation issue relating to how the formalization "relates to real cases" or "relates to AI design in practice" etc -- ie, how it is used -- but this seems less related to our disagreement/miscommunication.)

If the interpretation question is gone once a problem is framed in a formal way, then (speaking loosely here and trying to connect with what's bugging me about your framing) it seems like either the formalism somehow forces us to do the forgetting (which strikes me as odd) or we are left with problems which really do involve impossible actions w/o any interpretation issue. I favor the latter.

My response is to argue as per my previous comment that there doesn't seem to be any criteria for determining which inconsistent actions are considered and which ones aren't.

The decision algorithm considers each output from a given set. For example, with proof-based decision theories such as MUDT, it is potentially convenient to consider the case where output is true or false (so that the decision procedure can be thought of as a sentence). In that case, the decision procedure considers those two possibilities. There is no "extract the set of possible actions from the decision problem statement" step -- so you don't run into a problem of "why not output 2? It's inconsistent with the problem statement, but you're not letting that stop you in other cases".

It's a property of the formalism, but it doesn't seem like a particularly concerning one -- if one imagines trying to carry things over to, say, programming a robot, there's a clear set of possible actions even if you know the code may come to reliably predict its own actions. The problem of known actions seems to be about identifying the consequences of actions which you know you wouldn't take, rather than about identifying the action set.

I suppose you could respond that I haven't provided criteria for determining what information should be erased, but my approach has the benefit that if you do provide such criteria, logical counterfactuals are solved for free, while it's much more unclear how to approach this problem in the allowing inconsistency approach (although there has been some progress with things like playing chicken with the universe).

I feel like I'm over-stating my position a bit in the following, but: this doesn't seem any different from saying that if we provide a logical counterfactual, we solve decision theory for free. IE, the notion of forgetting has so many free parameters that it doesn't seem like much of a reduction of the problem. You say that a forgetting criterion would solve the problem of logical counterfactuals, but actually it is very unclear how much or how little it would accomplish.

You're at the stage of trying to figure out how agents should make decisions. I'm at the stage of trying to understand what a making a good decision even means. Once there is a clearer understanding of what a decision is, we can then write an algorithm to make good decisions or we may discover that the concept dissolves, in which case we will have to specify the problem more precisely. Right now, I'd be perfectly happy just to have a clear criteria by which an external evaluator could say whether an agent made a good decision or not, as that would constitute substantial progress.

I disagree with the 'stage' framing (I wouldn't claim to understand what making a good decision even means; I'd say that's a huge part of the confusion I'm trying to stare at -- for similar reasons, I disagree with your foundations foundations post in so far as it describes what I'm interested in as not being agent foundations foundations), but otherwise this makes sense.

This does seem like a big difference in perspective, and I agree that if I take that perspective, it is better to simply reject problems where the action taken by the agent is already determined (or call them trivial, etc). To me, that the agent itself needs to judge is quite central to the confusion about decisions.

My point was that there isn't any criteria for determining which inconsistent actions are considered and which ones aren't if you are just thrown a complete description of a universe and an agent.

As mentioned earlier, this doesn't seem problematic to me. First, if you're handed a description of a universe with an agent already in it, then you don't have to worry about defining what the agent considers: the agent already considers what it considers (just like it already does what it does). You can look at a trace of the executed decision procedure and read off which actions it considers. (Granted, you may not know how to interpret the code, but I think that's not the problem either of us are talking about.)

But there's another difference here in how we're thinking about decision theory, connected with the earlier-clarified difference. Your version of the 5&10 problem is that a decision theorist is handed a complete specification of the universe, including the agent. The agent takes some action, since it is fully defined, and the problem is that the decision theorist doesn't know how to judge the agent's decision.

(This might not be how you would define the 5&10 problem, but my goal here is to get at how you are thinking about the notion of decision problem in general, not 5&10 in particular -- so bear with me.)

My version of the 5&10 problem is that you give a decision theorist the partially defined universe with the $5 bill on the table and te$10 bill on the table, stipulating that whatever source code the decision theorist chooses for the agent, the agent itself should know the source code and be capable of reasoning about it appropriately. (This is somewhat vague but can be given formalizations such as that of the setting of proof-based DT.) In other words, the decision theorist works with a decision problem which is a "world with a hole in it" (a hole waiting for an agent). The challenge lies in the fact that whatever agent is placed into the problem by the decision theorist, the agent is facing a fully-specified universe with no question marks remaining.

So, for the decision theorist, the challenge presented by the 5&10 problem is to define an agent which selects the 10. (Of course, it had better select the 10 via generalizable reasoning, not via special-case code which fails to do the right thing on other decision problems.) For a given agent inserted into the problem, there might be an issue or no issue at all.

We can write otherwise plausible-looking agents which take the \$5, and for which it seems like the problem is spurious proofs; hence part of the challenge for the decision theorist seems to be the avoidance of spurious proofs. But, not all agents face this problem when inserted into the world of 5&10. For example, agents which follow the chicken rule don't have this problem. This means that from the agent's perspective, the 5&10 problem does not necessarily look like a problem of how to think about inconsistent actions.

Transparent Newcomb's already comes with the options and counterfactuals attached. My interest is in how to construct them from scratch.

In the framing above, where we distinguish between the view of the decision theorist and the view of the agent, I would say that:

• Often, as is (more or less) the case with transparent newcomb, a decision problem as-presented-to-the-decision-theorist does come with options and counterfactuals attached. Then, the interesting problem is usually to design an agent which (working from generalizable principles) recovers these correctly from within its embedded perspective.
• Sometimes, we might write down a decision problem as source code, or in some other formalism. Then, it may not be obvious what the counterfactuals are / should be, even from the decision theorist's perspective. We take something closer to the agent's perspective, having to figure out for ourselves how to reason counterfactually about the problem.
• Sometimes, a problem is given with a full description of its counterfactuals, but the counterfactuals as stated are clearly wrong: putting on our interpret-what-the-counterfactuals-are hats, we come up with an answer which differs from the one given in the problem statement. This means we need to be skeptical of the first case I mentioned, where we think we know what the counterfactuals are supposed to be and we're just trying to get our agents to recover them correctly.

Point being, in all three cases I'm thinking about the problem of how to construct the counterfactuals from scratch -- even the first case where I endorse the counterfactuals as given by the problem. This is only possible because of the distinction I'm making between a problem as given to a decision theorist and the problem as faced by an agent.

The interpretation issue of a decision problem should be mostly gone when we formally specify it

In order to formally specify a problem, you will have already explicitly or implicitly expressed what an interpretation of what decision theory problems are. But this doesn't make the question, "Is this interpretation valid?" disappear. If we take my approach, we will need to provide a philosophical justification for the forgetting; if we take yours, we'll need to provide a philosophical justification that we care about the results of these kinds of paraconsistent situations. Either way, there will be further work beyond the formularisation.

The decision algorithm considers each output from a given set... It's a property of the formalism, but it doesn't seem like a particularly concerning one

This ties into the point I'll discuss later about how I think being able to ask an external observer to evaluate whether an actual real agent took the optimal decision is the core problem in tying real world decision theory problems to the more abstract theoretical decision theory problems. Further down you write:

The agent already considers what it considers (just like it already does what it does)

But I'm trying to find a way of evaluating an agent from the external perspective. Here, it is valid to criticise an agent for not selecting as action that it didn't consider. Further, it isn't always clear what actions are "considered" as not all agent might have a loop over all actions and they may use shortcuts to avoid explicitly evaluating a certain action.

I feel like I'm over-stating my position a bit in the following, but: this doesn't seem any different from saying that if we provide a logical counterfactual, we solve decision theory for free

"Forgetting" has a large number of free parameters, but so does "deontology" or "virtue ethics". I've provided some examples and key details about how this would proceed, but I don't think you can expect too much more in this very preliminary stage. When I said that a forgetting criteria would solve the problem of logical counterfactuals for free, this was a slight exaggeration. We would still have to justify why we care about raw counterfactuals, but, actually being consistent, this would seem to be a much easier task than arguing that we should care about what happens in the kind of inconsistent situations generated by paraconsistent approaches.

I disagree with your foundations foundations post in so far as it describes what I'm interested in as not being agent foundations foundations

I actually included the Smoking Lesion Steelman (https://www.alignmentforum.org/s/fgHSwxFitysGKHH56/p/5bd75cc58225bf0670375452) as Foundations Foundations research. And CDT=EDT is pretty far along in this direction as well (https://www.alignmentforum.org/s/fgHSwxFitysGKHH56/p/x2wn2MWYSafDtm8Lf), although in my conception of what Foundations Foundations research should look like, more attention would have been paid to the possibility of the EDT graph being inconsistent, while the CDT graph was consistent.

Your version of the 5&10 problem... The agent takes some action, since it is fully defined, and the problem is that the decision theorist doesn't know how to judge the agent's decision.

That's exactly how I'd put it. Except I would say I'm interested in the problem from the external perspective and the reflective perspective. I just see the external perspective as easier to understand first.

From the agent's perspective, the 5&10 problem does not necessarily look like a problem of how to think about inconsistent actions

Sure. But the agent is thinking about inconsistent actions beneath the surface which is why we have to worry about spurious counterfactuals. And this is important for having a way of determining if it is doing what it should be doing. (This becomes more important in the edge cases like Troll Bridge - https://agentfoundations.org/item?id=1711)

My interest is in how to construct them from scratch

Consider the following types of situations:

1) A complete description of a world, with an agent identified

2) A theoretical decision theory problem viewed by an external observer

3) A theoretical decision theory problem viewed reflectively

I'm trying to get from 1->2, while you are trying to get from 2->3. Whatever formalisations we use need to ultimately relate to the real world in some way, which is why I believe that we need to understand the connection from 1->2. We could also try connecting 1->3 directly, although that seems much more challenging. If we ignore the link from 1->2 and focus solely on a link from 2->3, then we will end up implicitly assuming a link from 1->2 which could involve assumptions that we don't actually want.

Sounds like the disagreement has mostly landed in the area of questions of what to investigate first, which is pretty firmly "you do you" territory -- whatever most improves your own picture of what's going on, that is very likely what you should be thinking about.

On the other hand, I'm still left feeling like your approach is not going to be embedded enough. You say that investigating 2->3 first risks implicitly assuming too much about 1->2. My sketchy response is that what we want in the end is not a picture which is necessarily even consistent with having any 1->2 view. Everything is embedded, and implicitly reflective, even the decision theorist thinking about what decision theory an agent should have. So, a firm 1->2 view can hurt rather than help, due to overly non-embedded assumptions which have to be discarded later.

Using some of the ideas from the embedded agency sequence: a decision theorist may, in the course of evaluating a decision theory, consider a lot of #1-type situations. However, since the decision theorist is embedded as well, the decision theorist does not want to assume realizability even with respect to their own ontology. So, ultimately, the decision theorist wants a decision theory to have "good behavior" on problems where no #1-type view is available (meaning some sort of optimality for non-realizable cases).

I really appreciate "Here's a collection of a lot of the work that has been done on this over the years, and important summaries" type posts. Thanks for writing this!

I should note: This is my own idiosyncratic take on Logical Counterfactuals, with many of the links referring to my own posts and I don't know if I've convinced anyone else of the merits of this approach yet.

• How does the forgetting approach differ from an updateless approach (if it is supposed to)?
• Why do you think there is a good way to determine which information should be forgotten in a given problem, aside from hand analysis? (Hand analysis utilizes the decision theorist's perspective, which is an external perspective the agent lacks.)

UDT* provides a decision theory given a decision tree and a method of determining subjunctive links between choices. I'm investigating how to determine these subjunctive links, which requires understanding what kind of thing a counterfactual is and what kind of thing a decision is. The idea is that any solution should naturally integrate with UDT.

Firstly, even if this technique were limited to hand analysis, I'd be quite pleased if this turned out to be a unifying theory behind our current intuitions about how logical counterfactuals should work. Because if it were able to cover all or even just most of the cases, we'd at least know what assumptions we were implicitly making and it would provide a target for criticism. Different subtypes of forgetting might be able to be identified; it wouldn't surprised me if it turns out that the concept of a decision actually needs to be dissolved.

Secondly, even if there doesn't turn out to be a good way to figure out what information should be forgotten, I expect that figuring out different approaches would prove insightful, as would discovering why there isn't a good way to determine what to forget, if this is indeed the case.

But, to be honest, I've not spent much time thinking about how to determine what information should be forgotten. I'm still currently in the stage of trying to figure out whether this might be a useful research direction.

*Perhaps there are other updateless approaches, I don't know about them except TDT, which is generally considered inferior

Some people say this fails to account for the agent in the simulator, but it's entirely possible that Omega may be able to figure out what action you will take based on high level reasoning, as opposed having to run a complete simulation of you.

Unless you are the simulation?

In so far as the paraconsistent approach may be more convenient for an implementation perspective than the first, we can justify it by tying it to raw counterfactuals.

Like one might justify deontology in terms of consequentialism?

However, when "you" and the environment are defined down to the atom, you can only implement one decision.

Does QM enable 'true randomness' (generators)?

They fail to realize that they can't actually "change" their decision as there is a single decision that they will inevitably implement.

Or they fail to realize others can change their minds.

When “you” is defined down to the atom, you can only implement one decision.

Once again: physical determinism is not a fact.

I'm confused, I'm claiming determinism, not indeterminism

That was a typo, although actually neither is a fact.

What?