Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post will assume an understanding of the finite factored set ontology. It will be more speculative that the main FFS sequence, and I will leave out some proofs. It seems likely that I will later regret some of the definitions laid out here. I will be particularly sloppy in defining evidential and causal counterfactuals, because this post is not advocating for them, only demonstrating that it is possible to represent them within our ontology. This post was (approximately) written a year ago. I am currently in the process of moving a bunch of my ideas from the last year to LessWrong.

Nontechnical Summary

The main thing in this post is introduce a new concept of counterfactability. A counterfactable event  is one that screens off its own history (i.e. everything usptream of ) from everything that you care about. 

Decision theory is easy when considering choices that represent counterfactable events. When trying to take counterfactuals on non-counterfactable events, things are under-defined. This is because non-counterfactable events have artificially low resolution. It is like asking what would happen if I either took action  or took action . The question does not carve reality at the joints. Different worlds were artificially merged. It is as though details about the event were forgotten.

We can reframe questions in decision theory as follows: When given a non-counterfactable event , how do we add some more details to  to form a counterfactable , while only adding details that feel like they were artificially forgotten? In this new frame, we are settled on the question of how to take counterfactuals on counterfactable things, and are only asking "What is the counterfactable thing we were meant to be counteracting on?"

Both CDT and EDT can be reframed as providing an answer to this question. They are giving the wrong answers, but this fact shows that the reframe is sufficient to capture both CDT and EDT. Further, we can see that when choosing between counterfactable choices, CDT and EDT give the same answer. I believe the correct way to find the natural events to counteract on requires introspection on the gears of the agent's cognition, and will not be either of the CDT and EDT extremes.

Finally, I show that this concept of counterfactability is sufficiently natural that it applies outside the context of decision theory, and use it as a lens on the eliciting latent knowledge problem.

Counterfactability

Let  be a finite factored set, and let  be a nonempty proper subset of , and let  be a partition of .

We say that  is counterfactable relative to  if for all , if , then . ( screens off its own history from .)

We will generally think of  as a high level description of the world that contains all of the features we care about.

Whenever  is counterfactable relative to , we can define a counterfactual function  given by , where . In order for  to be well defined, we need this to be independent of the choice of .

Claim: When  is counterfactable relative to  is well defined.

Proof: Let , and let . Observe that . Consider , and let  and . Assume for the purpose of contradiction that . Since , we have . Thus,  and  are disjoint.

Since , and , there must be some  such that , and observe that further, we must have  and .

Since , we have , so , so , so . Note that , since , and , so . Thus , so  is a singleton. This contradicts the fact that  and  are in different parts in 

 here can be thought of as a high level world model up to which we want our counterfactuals to be well defined. If we take , then we will have that  is counterfactable relative to  if and only if there exists an  and a  such that  if and only if  for all .

So, in the FFS framework, we have this simple notion of counterfactability (relative to ), together with a way of counterfacting (up to ) on any counterfactable event.

Extending Beyond the Counterfactable

The question then becomes "How do you counterfact on non-counterfactable events?" I will start by presenting two (in my opinion) bad strategies for extending counterfactuals to (some) non-counterfactable events. These two strategies will give us evidential counterfactuals and causal counterfactuals. Both will require some extra structure beyond the finite factored set structure in order to be defined.

We will see that counterfactibility is actually a very strong notion, and for any counterfactable , if you sample an , and then counterfact on , you will get the same distribution on , regardless of whether you use , evidential counterfactuals, or causal counterfactuals. If we take  to be the level sets of a utility function, this leads the result that (updateless) CDT is the same as (updateless) EDT whenever the agent is choosing between counterfactable events.

However, the interesting part is that both evidential and causal counterfactuals (while agreeing with the standard intuitions) will be defined in terms of . In both cases, we will be given a not necessarily counterfactable , and then (possibly randomly) find a counterfactable , and counterfact on  instead. Thus, the whole of evidential and causal counterfactuals can be summarized as "Given an event , first find the correct counterfactable subset , and then counterfact on ."

Note that the above is not saying much, since we could just take  to be a singleton. The interesting part is that both evidential and causal counterfactuals can be viewed as  for  no later than  (). Thus, we are not just forcing evidential and causal counterfactuals to fit our notion of counterfactuals by counterfacting on an overly specific description of the whole world. We are counterfacting on a local event, no later than  itself.

This gives a new orientation on the problem of counterfactuals. We are given an event  that we want to counterfact on, and unfortunately it is not counterfactable, because it is not specific enough, so we instead need to counterfact on some subset  that is both counterfactable and no later than . Evidential and causal counterfactuals are just (bad) ways of choosing that subset. Now instead of trying to figure out how to define counterfactuals, we can instead think of ways to choose a counterfactable subset of the event we want to counterfact on.

The emphasis on  that are no later than  is important, because it is capturing that non-counterfactable events are somewhat artificially not specific enough to countefact on. It is as though they were constructed by unioning together counterfactable events. The "were constructed by" is important here. Events can be expressed as unions of counterfactable events in many ways, but it feels more like the resolution was artificially removed, when we express the event as as union of events that came weakly earlier. Sometimes, I know how to counterfact on "I do ," and I know how to counterfact on "I do ," but I get confused when I try to counterfact on "I do  or I do ," because the resolution was artificially lowered. This is not to say we can just not lower the resolution. This is the curse of embedded agency.

To make it more pithy, "Counterfactuals are sometimes under-defined for events that are not at their native resolution." I actually like the analogy with native resolution on a monitor here. When the image has less resolution than the monitor, there isn't a well defined best way to display it. If the image has resolution that is an integer fraction the monitor (in each dimension), there is a well defined best way to display it, but that is because dividing by an integer corresponds in this analogy to taking out a factor, and thus having a smaller history.

Defining Decision Theories

Evidential Counterfactuals

Defining evidential counterfactuals will require more than just a finite factored set . We will also need a probability distribution  on  that is nowhere zero. Recall that a probability distribution on a finite factored set is the product distribution of a separate probability distribution on each of the factors.

Given a nonempty proper subset , we will sample a subset of  as follows. Let . Note that  is a partition of . For each , sample  with probability . Note that the sum of these probabilities will be 1.

Further, note that the  sampled by the above procedure will always be a subset of , and will always satisfy .

Further,  will be counterfactable relative to  for all . This is because if we take  and , we have that  if and only if  for all .

Thus, we have successfully specified a (randomized) procedure, which given an , produces an  that is counterfactable and no later than .

That gives our our evidential counterfactuals: , which are given by setting  to the probability that , where  is defined as above from  and .

Note that even the concept of evidential counterfactuals is going against the native ontology of evidential decision theory. Evidential decision theory doesn't really talk about interventions, and the type signatures above are about taking an intervention on an initial .

However, note that if you sample an  according to , and then sample a  according to , you will end up sampling each  with probability , so you get the same end result as if you just conditioned on .

However, the evidential counterfactuals we define here also have the nice property that counterfacting on  will always leave unchanged all variables that are orthogonal to , so our counterfactuals are local in a sense.

Causal Counterfactuals

We will now define causal counterfactuals similarly to the above evidential counterfactuals. We will again need some extra structure. This time, we will imagine that our finite factored set  was constructed from some Pearlian causal DAG, .

For this, we will first need to describe a procedure for constructing a FFS from a Pearlian DAG. We will take one factor for each node in our DAG. The factor corresponding to the node  will have one part for each function from assignments of states to the parents of  to assignments of a state to . Note that when we construct a factored set in this way, for each , we have a well defined state for each node, which can be recursively defined using the functions you get by projecting onto each factor.

Note that this also means that for each node , we get a function  which takes in an element of , and outputs the state that element assigns to .

We will not be able to describe causal counterfactuals for general subsets of . For any set of nodes , and any assignment of states , where  is a state of  for each , we can take . We will only be able to define a causal counterfactual for subsets of this form. These can be thought of events that can be described by assigning states to some set of nodes.

If we have  of this form, we can define  to to be the set of all elements such that for all , the factor corresponding to  has the value corresponding to the constant function .

Observe that , that , and that  is counterfactable relative to  for all .

If  as above, let .

This is basically saying that we are take an  which corresponds to a collection of nodes having a specific assignment of states. We can't directly counterfact on that event, because there are many different assignments of the parents of the nodes that could result in those states, so instead we counterfact on the nodes being constantly equal those states, independent of the states of their parents.

CDT=EDT (for Counterfactable Events)

We have defined evidential and causal counterfactuals, and we can use them to now define EDT and CDT. Let  be a finite factored set. Let  be a partition of , and let  be a utility function.

(We are assuming here that  has enough resolution to capture the agent's utility. For example, we could start with a utility function on , and define  to be exactly the partition into level sets of that utility function.)

Let  be a partition of , representing the agent's action. Let  be a distribution on , representing the agent's beliefs. (We are ignoring any observations here, so A is more like a space of updateless policies.)

We can then define the EDT choice, which is the element  that maximizes the expectation of , where  is sampled according to , and  is sampled according to 

Similarly, if we have a Pearlian DAG , and  was generated from , and for all  is of the form , then we can define the CDT choice to be the  which maximizes the expectation of , where  is sampled according to .

Finally, if  is counterfactable for all , we can define a third decision theory, where we choose the , which maximizes the expectation of , where  is sampled according to .

Note that if  is counterfactable for all , then this third decision theory will give the same result as EDT. Further if CDT is also well defined, all three decision theories will give the same result. CDT, EDT, and the third decision theory need not counterfact on the same events (i.e. the  might be different, but the decisions will end up the same).

This is mostly saying that counterfactability is very strong: once you have counterfactable events, decision theory is over-determined.

In all of the above, we are doing updateless versions of the decision theories. Our agent is not making any observations.

Other Counterfactuals

I described evidential and causal counterfactuals above, not because I think they are the right way to take counterfactuals, but because I wanted to demonstrate that they both fit into the framework where when counterfacting on , you apply  for some counterfactable , no later than  itself. The fact that you have to pass to an  comes from the fact that  was not actually at the native resolution of the action being counterfacted on.

There are other ways we could select an , that look more at the gears of the process by which the decision is being made. When I am in a prisoner's dilemma with someone with similar psychology, part of my decision making process is happening on the part of me that is shared with my opponent, while some of my decision making process uses methods that are unique to me. Thus, my opponent's action is partially downstream from the calculation I am currently doing, and partially independent of it. Determining how much of the decision is in each part is difficult, and will not just be one extreme (EDT) or the other (CDT).

Counterfactability and ELK

An event is counterfactable if it screens off its own history from everything you care about. Dealing with non-counterfactable events is confusing, so instead of dealing with a non-counterfactable event , we would rather deal with a counterfactable , and luckily, there is always a counterfactable  that uses no more information that . This story is sufficiently natural that it also applies outside of decision theory.

Say you have some opaque machine learning system that gives some output. Let  partition the possible worlds according to the output of the system. The history of the output is all the information/thinking/computation/knowledge that goes into computing the output. (This is not the history according to FFS taken literally, but there is an analogy here that is deep, and part of the motivation for defining the FFS toy model.)

There are a bunch of details in the history of the output that do not make it into the output. This is fine. However, it is scary when there are details in the history of the output that are about things we care about, that do not make it into the output (or into our understanding of the output). Thus, we would like it if (our understanding of) the output of our ML system successfully screened off its own history from that which we care about.

Unfortunately, we have an opaque ML system that does not have this property, and thus will not have to tell us if it is trying to deceive us. What can we do?

We would like to add some additional notes to the output of the system, without necessarily changing the original system. These notes refine the partition of possible worlds according to output, and thus the worlds corresponding to a given output-notes pair will be a subset of the worlds corresponding to just the output.  We are finding a subset  of our original  by adding more information through the notes.

We would like  to be counterfactable, meaning that it contains all the information in its own history that is relevant to what we care about. Luckily, there is this intuition that this shouldn't be that hard. All the information is already there. We just need to pull it out, we shouldn't need to add any new information to do this.

All this is to say that we would like our system to instead of outputting , output a counterfactable , and luckily there should always be a counterfactable  no later than . (I am being sloppy here with conflating the output and our understanding of the output, but still there is a rhyme in the structure that is hard to deny.)

I think that the thing that is going on here is that FFS gives us a nice definition of a good summary:  is a good summary of  if  screens off  from everything you care about. The goal of informed oversight is to have systems that output good summaries of themselves. Without this, the overseer cannot evaluate the consequences of the output in an unbiased way. Similarly, when considering non-counterfactable actions, an agent cannot judge the consequences of those actions in an unbiased way.

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 12:40 PM

If I understand you correctly, the reason that this notion of counterfactable connects with what we normally call a counterfactual is that when an event screens of its own history, it's easy to consider other "values" of the "variable" underlying that event without coming into any logical contradictions with other events ("values of other variables") that we're holding fixed.

For example if I try to consider what would have happened if there had been a snow storm in Vermont last night, while holding fixed the particular weather patterns observed in Vermont and surrounding areas on the preceding day, then I'm in kind of a tricky spot, because on the one hand I'm considering the weather patterns from the previous day as fixed (which did not in fact give rise to a snow storm in Vermont last night), and yet I'm also trying to "consider" a snow storm in Vermont. The closer I look into this the more confused I'm going to get, and in the end I'll find that this notion of "consider a snow storm took place in Vermont last night" is a bit ill-defined.

What I would like to say is: let's consider a snow storm in Vermont last night; in order to do that let's forget everything that would mess with that consideration.

My question for you is: in the world we live in, the full causal history of any real event contains almost the whole history of Earth from the time of the event backwards, because the Earth is so small relative to the speed of light, and everything that could have interacted with the event is part of the history of that event. So in practice, won't all counterfactable events need to be a more-or-less a full specification of the whole state of the world at a certain point in time?

Yeah, remember the above is all for updateless agents, which are already computationally intractable. For updateful agents, we will want to talk about conditional counterfactability. For example, if you and I are in a prisoners dilemma, we could would conditional on all the stuff that happened prior to us being put in separate cells, and given this condition, the histories are much smaller. 

Also, we could do all of our reasoning up to a high level world model that makes histories more reasonably sized.

Also, if we could think of counterfactability as a spectrum. Some events are especially hard to reason about, because there are lots of different ways we could have done it, and we can selectively add details to make it more and more counterfactable, meaning it approximately screens off its history from that which you care about.

Regarding your point on ELK: to make the output of the opaque machine learning system counterfactable, wouldn't it be sufficient to include the whole program trace? Program trace means the results of all the intermediate computations computed along the way. Yet including a program trace wouldn't help us much if we don't know what function of that program trace will tell us, for example, whether the machine learning system is deliberately deceiving us.

So yes it's necessary to have an information set that includes the relevant information, but isn't the main part of the (ELK) problem to determine what function of that information corresponds to the particular latent variable that we're looking for?

I agree, this is why I said I am being sloppy with conflating the output and our understanding of the output. We want our understanding of the output to screen off the history.