(Disclaimer: This post was written as part of the CFAR/MIRI AI Summer Fellows Program, and as a result is a vocalisation of my own thought process rather than a concrete or novel proposal. However, it is an area of research I shall be pursuing in the coming months, and so ought to lead to the germination of more ideas later down the line. Credit goes to Abram Demski for inspiring this particular post, and to Scott Garrabrant for encouraging the AISFP participants to actually post something.)
When reading Soares' and Levinstein's recent publication on decision theory, Death in Damascus, I was struck by one particular remark:
Unfortunately for us, there is as yet no full theory of counterlogicals [...], and for [functional decision theory (FDT)] to be successful, a more worked out theory is necessary, unless some other specification is forthcoming.
For some background: an FDT agent must consider counterfactuals about what would happen if its decision algorithm on a given input were to output a particular action. If the algorithm is deterministic (or the action under consideration is merely outside of the range of possible outputs of the agent), then it is a logical impossibility that the algorithm produces a different output. Hence the initial relevance of logical counterfactuals or counterlogicals: counterfactuals whose antecedents represent a logical impossibility.
My immediate thought about how to tackle the problem of finding a full theory of counterlogicals was to find the right logical system that captures counterlogical inference in a way adequate to our demands. For example, Soares and Fallenstein show in Toward Idealized Decision Theory that the principle of explosion (from a contradiction, anything can be proved) leads to problematic results for an FDT solution. Why not simply posit that our decision theory uses paraconsistent logic in order to block some inferences?
Instead, and to my surprise, Soares and Levinstein appear to be more concerned about finding a semantics for logical counterfactuals - loosely speaking, they are looking for a uniform interpretation of such statements which shows us what they really mean. This is what I infer from reading the papers it references on theories of counterlogicals such as Bjerring's Counterpossibles, which tries to extend Lewis' possible-worlds semantics for counterfactuals to cases of logical counterfactuals.
The approaches Soares and Levinstein make reference to do not suffice for their purposes. However, this is not because they give proof-theoretic considerations a wide berth, as I thought previously; I now believe that there is something that the semantic approach does get right.
Suppose that we found some paraconsistent logical system that permitted precisely the counterlogical inferences that agreed with our intuition: would this theory of logical counterfactuals satisfy the demands of a fully-specified functional decision theory? I would argue not. Specifically, this approach seems to give us a merely a posteriori, ad hoc justification for our theory - given that we are working towards an idealised decision theory, we ought to demand that the theory that supports it is fully motivated. In this case, this cashes out as making sure our counterlogical statements and inferences has a meaning - a meaning we can resort to to settle the truth-value of these counterlogicals.
This is not to say that investigating different logical systems is entirely futile: indeed, if the consideration above were true and a paraconsistent logic could fit the bill, then a uniform semantics for the system would serve as the last piece of the puzzle.
In the next year, I would like to investigate the problem of finding a full theory of logical counterfactuals, such that it may become a tool to be applied to a functional decision theory. This will, of course, involve finding the logical system that best captures our own reasoning about logical counterfactuals. Nonetheless, I will now also seek to find an actual motivation for whichever system seems the most promising, and the best way to find this motivation will be through finding an adequate semantics. I would welcome any suggestions in the comments about where to start looking for such a theory, or if any avenues have thus far proved to be more or less promising.
My guess is that finding a fully satisfactory solution is hopeless, in much the same way as with specifying aligned goals (i.e. no solution is in closed form, without reference to human-derived systems doing decision theory/axiology).
A crucial problem is finding how agent's decisions influence a given situation, but that situation can include things that reason approximately about the agent, and worse, things that reason about different but similar agents. Agent's decision influences not just precise predictions of itself, but also approximate (and sometimes incorrect) guesses about it, and approximate guesses about similar decisions of similar agents. Judging how a decision influences a system that wrongly guesses the decision of a similar but different agent seems "arbitrary" in the same way as human goals are "arbitrary", that is not arbitrary at all, but in practice not possible to express without reference to philosophy of human-derived things.
Another practical solution might be to characterize a class of situations where decision theory is mostly clear, and make sure to keep the world that way until more general decision theory is developed. This direction can benefit from more general decision theories, but they won't be "fully general", just describe more situations or understand the familiar situations better. (See also.)
(Disclaimer: There's a good chance you've already thought about this.)
In general, if you want to understand a system (construal of meaning) forming a model of the output of that system (truth-conditions and felicity judgements) is very helpful. So if you're interested in understanding how counterfactual statements are interpreted, I think the formal semantics literature is the right place to start (try digging through the references here, for example).
The reason we want a description of counterfactuals is to allow for a model of the world where we plug in counterfactual actions and get back the expected outcome, allowing us to choose between actions/strategies. Counterfactuals don't have any reality outside of how we think about them.
Thus, the motivation for an improvement to the causal-intervention model of counterfactuals is not that it should correspond to some external reality, but that it should help reach good outcomes. We can still try to make a descriptive model of how humans do logical counterfactual reasoning, but our end goal should be to understand why something like that actually leads to good outcomes.
It's important to note that it's okay to use human reasoning to validate something that is supposedly not just a descriptive model of human reasoning. Sure, it creates selection bias, but what other reasoning are we ging to use? See Neurath's Boat (improving philosophy is like rebuilding a boat piece by piece while adrift at sea), Ironism (awareness and acceptance of the contingency of our beliefs).
In the end, I suspect that what counts as a good model for predicting outcomes of actions will vary strongly depending on the environment. See related rambling by me, particularly part 4. This is more related to Scott Garrabrant's logical inductors in hindsight than it was in any kind of foresight.