Optimal and Causal Counterfactual Worlds


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Let denote the language of Peano arithmetic. A (counterfactual) world is any subset of . These worlds need not be consistent. Let denote the set of all worlds. The actual world is the world consisting of all sentences that are true about .

Consider the function which sends the sentence to the world we get by "correctly" counterfactually assuming . The function is not formally defined, because we do not yet have a satisfactory theory of logical counterfactuals.

Hopefully we all agree that and .

Given an (infinite) directed acyclic graph , and a map from sentences to vertices of , we say that is consistent with and if for all , and whenever and disagree on a sentence there must exist some causal chain such that:

  1. ,

  2. ,

  3. and disagree on every , and

  4. is a parent of .

These conditions give a kind of causal structure such that changes from and must propagate through the graph .

Given a function , we say that optimizes if for all and we have .

Many approaches to logical counterfactuals can be described either as choosing the optimal world (under some function) in which is true or observing the causal consequences of setting to be true. The purpose of this post is to prove that these frameworks are actually equivalent, and to provide a strategy for possibly showing that no attempt at logical counterfactuals which could be described within either framework could ever be what we mean by "correct" logical counterfactuals.

A nontrivial cycle in is a list of sentences , such that , , and the worlds are not all the same for all .

Given a partial order , we say that optimizes if for all and we have .

Our main result is that the following are equivalent:

  1. optimizes for some function .

  2. is consistent with and for some DAG and map .

  3. has no nontrivial cycles.

  4. optimizes for some partial order .


1 2: Construct the graph with a vertex for every world in the image of . The map sends to the vertex associated with . Insert an edge to from every other vertex. Insert an edge from the vertex associated with to the vertex associated with whenever . Clearly for all .

Assume and disagree on a sentence . If then , since is the vertex associated with which is a child of every vertex. Therefore you get a length 1 path from to . Otherwise, , so . In this case, note that since is in both and , and these worlds are not the same, it must be that . Again, this means that you get a length 1 path from to . Therefore is consistent with and .

2 3: Consider a nontrivial cycle . If any of these sentences were in , then they would all be true in , since whenever . This would contradict the fact that is a nontrivial cycle.

Otherwise, since , there must be a path from to in , since and differ on . Concatenating these paths together would give a cycle in unless the are all the same vertex. However, if all of the are the same vertex, then all of the would be the same world, which would contradict the fact that is a nontrivial cycle.

3 4: Consider the partial order on the image of constructed by saying that if for some , and taking the transitive closure of these rules. If this were not a partial order, it would have to be because we created a cycle of worlds , such that each with and This is would be a nontrivial cycle.

Extend this partial order to all of by saying that if is in the image of and is not, then . Note that this optimizes this partial order by definition.

4 1: Let optimize the partial order . Consider the restriction of to the image of . This is a partial order on a countable set. Order the worlds in this partial order Embed the partial order into by repeatedly defining such that:

  1. ,

  2. for all and , and

  3. for all and .

Define if is not in the image of . Clearly this function is constructed such that C optimizes .

This result is useful not just for establishing the equivalence of (a certain type of) optimization and acyclic causal networks, but also for providing a strategy for showing that "correct" logical counterfactuals cannot arise as an optimization process or through an acyclic causal network. To show this, one only has to exhibit a single nontrivial cycle. In the simplest case, this can be done by exhibiting a pair of sentences and such that and are clearly counterfactual consequences of each other, but do not correspond to identical counterfactual worlds.

Do the "correct" logical counterfactuals exhibit a nontrivial cycle?