A lightning-fast recap of counterfactuals in decision theory: generally when facing a decision problem, a natural approach is to consider questions of the form "what would happen if I took action ?". This particular construction has ambiguity in the use of the word "I" to refer to both the real and counterfactual agent, which is the root of the 5-and-10 problem. The Functional Decision Theory of Yudkowsky and Soares (2017) provides a more formal version of this construction, but suffers from (a more explicit version of) the same ambiguity by taking as its counterfactuals logical statements about its own decision function, some of which must necessarily be false. This poses an issue for an embedded agent implementing FDT, which, knowing itself to be part of its environment, might presumably therefore determine the full structure of its own decision function, and subsequently run into trouble when it can derive a contradiction from its counterlogical assumptions.

Soares and Levenstein (2017) explore this issue but defer it to future development of "a non-trivial theory of logical counterfactuals" (p. 1). I've been trying to explore a different approach: replacing the core counterfactual question of "what if my algorithm had a different output?" with "what if an agent with a different algorithm were in my situation?". Yudkowsky and Soares (2017) consider a related approach, but dismiss it:

In attempting to avoid this dependency on counterpossible conditionals, one might suggest a variant FDT′ that asks not “What if my decision function had a different output?” but rather “What if I made my decisions using a different decision function?” When faced with a decision, an FDT′ agent would iterate over functions from some set , consider how much utility she would achieve if she implemented that function instead of her actual decision function, and emulate the best . Her actual decision function is the function that iterates over , and .

However, by considering the behavior of FDT′ in Newcomb’s problem, we see that it does not save us any trouble. For the predictor predicts the output of , and in order to preserve the desired correspondence between predicted actions and predictions, FDT′ cannot simply imagine a world in which she implements instead of ; she must imagine a world in which all predictors of predict as if behaved like —and then we are right back to where we started, with a need for some method of predicting how an algorithm would behave if (counterpossibly) behaved different from usual. (p. 7)

I think it should be possible avoid this objection by eliminating from consideration entirely. By only considering a separate agent that implements directly, we don't need to imagine predictors of attempting to act as though were - we can just imagine predictors of instead.

More concretely, we can consider an embedded agent , with a world-model of type (a deliberately-vague container for everything the agent believes about the real world, constructed by building up from sense-data and axiomatic background assumptions by logical inference), and a decision function of type (interpreted as "given my current beliefs about the state of the world, what action should I take?"). Since is embedded, we can assume contains (as a background assumption) a description of , including the world-model (itself) and the decision function . For a particular possible action , we can simultaneously construct a counterfactual world-model and a counterfactual decision function such that maps to , and otherwise invokes (with some diagonalisation-adjacent trickery to handle the necessary self-references), and is the result of replacing all instances of in with . is a world-model similar to , but incrementally easier to predict, since instead of the root agent , it contains a counterfactual agent with world-model and decision function , which by construction immediately takes action . Given further a utility function of type , we can complete a description of as "return the that maximizes ".

I have a large number of words of draft-thoughts exploring this idea in more detail, including a proper sketch of how to construct and , and an exploration of how it seems to handle various decision problems. In particular seems to give an interesting response to Newcomb-like problems (particularly the transparent variant), where it is forced to explicitly consider the possibility that it might be being simulated, and prescribes only a conditional one-boxing in the presence of various background beliefs about why it might be being simulated (and is capable of two-boxing without self-modification if it believes itself to be in a perverse universe where two-boxers predictably do better than one-boxers). But that seems mostly secondary to the intended benefit, which is that none of the reasoning about the counterfactual agents requires counterlogical assumptions - the belief that returns action needn't be assumed, but can be inferred by inspection since is a fully defined function that could literally be written down if required.

Before I try to actually turn those words into something postable, I'd like to make sure (as far as I can) that it's not an obviously flawed idea / hasn't already been done before / isn't otherwise uninteresting, by running it past the community here. Does anyone with expertise in this area have any comments on the viability of this approach?

The problem of logical counterfactuals is exactly about what it means to "replace all instances" of a subprogram in a larger program, when the subprogram can occur in multiple places, in modified but logically equivalent forms, etc.

I agree that that is a problem that both this approach to counterfactuals and the FDT logical-counterfactual approach share. The particular problem I was hoping this approach avoids was the one of assuming mutually-exclusive logical facts, such that all-but-one of them must necessarily be false, and the implications this has for the agent's consistency and reasoning about its actions. Are you saying that they are the same problem, that the second problem is comparatively not worth solving, or something else?

I'm saying they are the same problem. These sentences are logically equivalent:

"A() = 1 provably implies B() = 2"

"There's a program C provably equivalent to B such that, if we replace all exact instances of A inside C with 1, the resulting program will return 2"

I think my core issue with the above is the nature of the specification of the problem of "replacing A in B". Allowing passing to an arbitrary equivalent program C before replacing exact instances of A seems overly permissive, and to allow in exactly the kind of principle-of-explosion issue that logical counterfactuals have. Suppose for instance that A and B both halt with a defined value, say A()=a. As framed above, I can take C to be a program that computes B′()+x(A()−a) (for some x), where B′ is the result of substituting all exact instances of A in B with any program returning a that is not identically A. Then C is (provably) equivalent to B, but if a≠1 I can ensure that the substituted version of C returns any value I like just by varying x. This doesn't say anything interesting about the appearance of A in B - in fact, the replacement of B with B′ is exactly to ensure that the main substitution is a no-op.

I feel like a statement that more accurately corresponds to the substitution I'm trying to do above would be something like "There's a program C that can be obtained from B by replacing subprograms that are provably equivalent to A with the constant program returning 1, such that C() = 2". For one thing, this appears to have the useful property that if B makes no mention of A or anything equivalent to it, then I can't stop such a C from being equivalent (in fact, identical) to B no matter what the value of A() is. Does this still have the same problem?

Let's say the world is a grid of Game of Life cells evolving in discrete time. What counts as a "subprogram"?

Let's further say that world contains embedded computers that run programs. Will your notion of "subprogram", specified purely in terms of Game of Life cells, be strong enough to capture subprograms of these programs?

Again, I agree that the problem of identifying what logical structures (whereever they occur) count as implementing a particular function is a deep and interesting one, and not one that I am claiming to have solved. But again, I do not agree that it is a problem I have introduced? An FDT agent correctly inferring the downstream causal results of setting FDT(P––,G––)=a would, in general, have to identify FDT(P––,G––) being computed inside a Game of Life simulation, if and where such a calculation so occured.

While I am indeed interested in exploring the answers to your questions, I don't see that they represent a specific challenge to the idea that the above notion of counterfactuals might be worth exploring further, in the way that your original claim would.

The issue of so-called logical counterfactuals has been discussed here and on the alignment forum quite a few times, including a bunch of posts by Chris_Leong, and at least one by yours truly. Consider browsing through them before embarking on original research:

https://www.google.com/search?q=logical+counterfactuals+site:lesswrong.com

I have indeed read many of those posts already (though I appreciate some reference to them in the original post would have been sensible, I apologise). Chris_Leong's Deconfusing Logical Counterfactuals comes pretty close to this - the counterfactual model I'm interested in corresponds to their notion of "Raw Counterfactual", but AFAICT they're going in a somewhat different direction with the notion of "erasure" (I don't think it should be necessary to forget that you've seen a full box in the transparent variant of Newcomb's problem, if you explicitly consider that you be in Omega's simulation), so they haven't followed the notion through to a full description of what a raw-counterfactual-based decision theory would look like. I can't find any further discussion of the idea, and Chris_Leong now seems to be walking back on it (that's part of the reason I'm asking the question).

I suspect the real underlying issue is that of free will: all decision theories assume we can make different decisions in EXACT SAME circumstances, whereas from what we understand about the physical world, there is no such thing, and the only non-dualist proposal on the table is that of Scott Aaronson's freebits. I have written a related post last year. We certainly do have a very realistic illusion of free will, to the degree where any argument to the contrary tends to be rejected, ignored, strawmanned or misinterpreted. If you read through the philosophical writings on compatibilism, people keep talking past each other all the time, never getting to the crux of their disagreement. Not that it (or anything else) matters in the universe where there is no freedom of choice, anyway.

That contains two falsehoods.

physics has not settled the issue of determinism versus determinism. See https://philpapers.org/rec/EARDWW

Scott Aaronson is not the first person in history to propose a nondualustic theory of free will.