Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Counterfactability

8Alex Flint

2Scott Garrabrant

6Alex Flint

2Scott Garrabrant

New Comment

If I understand you correctly, the reason that this notion of counterfactable connects with what we normally call a counterfactual is that when an event screens of its own history, it's easy to consider other "values" of the "variable" underlying that event without coming into any logical contradictions with other events ("values of other variables") that we're holding fixed.

For example if I try to consider what would have happened if there had been a snow storm in Vermont last night, while holding fixed the particular weather patterns observed in Vermont and surrounding areas on the preceding day, then I'm in kind of a tricky spot, because on the one hand I'm considering the weather patterns from the previous day as fixed (which did not in fact give rise to a snow storm in Vermont last night), and yet I'm also trying to "consider" a snow storm in Vermont. The closer I look into this the more confused I'm going to get, and in the end I'll find that this notion of "consider a snow storm took place in Vermont last night" is a bit ill-defined.

What I would like to say is: let's consider a snow storm in Vermont last night; in order to do that let's forget everything that would mess with that consideration.

My question for you is: in the world we live in, the full causal history of any real event contains almost the whole history of Earth from the time of the event backwards, because the Earth is so small relative to the speed of light, and everything that could have interacted with the event is part of the history of that event. So in practice, won't all counterfactable events need to be a more-or-less a full specification of the whole state of the world at a certain point in time?

Yeah, remember the above is all for updateless agents, which are already computationally intractable. For updateful agents, we will want to talk about conditional counterfactability. For example, if you and I are in a prisoners dilemma, we could would conditional on all the stuff that happened prior to us being put in separate cells, and given this condition, the histories are much smaller.

Also, we could do all of our reasoning up to a high level world model that makes histories more reasonably sized.

Also, if we could think of counterfactability as a spectrum. Some events are especially hard to reason about, because there are lots of different ways we could have done it, and we can selectively add details to make it more and more counterfactable, meaning it approximately screens off its history from that which you care about.

Regarding your point on ELK: to make the output of the opaque machine learning system counterfactable, wouldn't it be sufficient to include the whole program trace? Program trace means the results of all the intermediate computations computed along the way. Yet including a program trace wouldn't help us much if we don't know what function of that program trace will tell us, for example, whether the machine learning system is deliberately deceiving us.

So yes it's necessary to have an information set that includes the relevant information, but isn't the main part of the (ELK) problem to determine what function of that information corresponds to the particular latent variable that we're looking for?

I agree, this is why I said I am being sloppy with conflating the output and our understanding of the output. We want our understanding of the output to screen off the history.

This post will assume an understanding of the finite factored set ontology. It will be more speculative that the main FFS sequence, and I will leave out some proofs. It seems likely that I will later regret some of the definitions laid out here. I will be particularly sloppy in defining evidential and causal counterfactuals, because this post is not advocating for them, only demonstrating that it is possible to represent them within our ontology. This post was (approximately) written a year ago. I am currently in the process of moving a bunch of my ideas from the last year to LessWrong.

## Nontechnical Summary

The main thing in this post is introduce a new concept of counterfactability. A counterfactable event E is one that screens off its own history (i.e. everything usptream of E) from everything that you care about.

Decision theory is easy when considering choices that represent counterfactable events. When trying to take counterfactuals on non-counterfactable events, things are under-defined. This is because non-counterfactable events have artificially low resolution. It is like asking what would happen if I either took action X or took action Y. The question does not carve reality at the joints. Different worlds were artificially merged. It is as though details about the event were forgotten.

We can reframe questions in decision theory as follows: When given a non-counterfactable event E, how do we add some more details to E to form a counterfactable E′, while only adding details that feel like they were artificially forgotten? In this new frame, we are settled on the question of how to take counterfactuals on counterfactable things, and are only asking "What is the counterfactable thing we were meant to be counteracting on?"

Both CDT and EDT can be reframed as providing an answer to this question. They are giving the wrong answers, but this fact shows that the reframe is sufficient to capture both CDT and EDT. Further, we can see that when choosing between counterfactable choices, CDT and EDT give the same answer. I believe the correct way to find the natural events to counteract on requires introspection on the gears of the agent's cognition, and will not be either of the CDT and EDT extremes.

Finally, I show that this concept of counterfactability is sufficiently natural that it applies outside the context of decision theory, and use it as a lens on the eliciting latent knowledge problem.

## Counterfactability

Let F=(S,B) be a finite factored set, and let E be a nonempty proper subset of S, and let W be a partition of S.

We say that E is counterfactable relative to W if for all X∈Part(S), if X≤F{E,S∖E}, then X⊥W∣E. (E screens off its own history from W.)

We will generally think of W as a high level description of the world that contains all of the features we care about.

Whenever E is counterfactable relative to W, we can define a counterfactual function doWE:S→W given by doWE(s)=[χFhF({E,S∖E})(e,s)]W, where e∈E. In order for doWE to be well defined, we need this to be independent of the choice of e∈E.

Claim:When E is counterfactable relative to W, doWE is well defined.Proof:Let H=hF({E,S∖E}), and let X=⋁S(H). Observe that H=hF(X). Consider e0,e1∈E, and let s0=χFH(e0,s) and s1=χFH(e1,s). Assume for the purpose of contradiction that [s0]W≠[s1]W. Since X≤F{E,S∖E}, we have X⊥W∣E. Thus, hF(X|E) and hF(W|E) are disjoint.Since [s0]W≠[s1]W, and s0,s1∈E, there must be some b∈hF(W|E) such that s0≁bs1, and observe that further, we must have b∈H and b∉hF(X|E).

Since b∈H, we have b≤SX, so b|E≤EX|E, so hF(b|E)⊆hF(X|E), so b∉hF(b|E). Note that B∖hF(b|E)⊢F(b|E), since χFB∖hF(b|E)(E,E)=E, and b∈B∖hF(b|E), so b|E≤E⋁S(B∖hF(b|E))|E. Thus {}=hF(b|E)∩(B∖hF(b|E))⊢F(b|E), so b|E is a singleton. This contradicts the fact that s0 and s1 are in different parts in b|E. □

W here can be thought of as a high level world model up to which we want our counterfactuals to be well defined. If we take W={{s}|s∈S}, then we will have that E is counterfactable relative to W if and only if there exists an s∈S and a C⊆B such that e∈E if and only if e∼bs for all b∈C.

So, in the FFS framework, we have this simple notion of counterfactability (relative to W), together with a way of counterfacting (up to W) on any counterfactable event.

## Extending Beyond the Counterfactable

The question then becomes "How do you counterfact on non-counterfactable events?" I will start by presenting two (in my opinion) bad strategies for extending counterfactuals to (some) non-counterfactable events. These two strategies will give us evidential counterfactuals and causal counterfactuals. Both will require some extra structure beyond the finite factored set structure in order to be defined.

We will see that counterfactibility is actually a very strong notion, and for any counterfactable E, if you sample an s∈S, and then counterfact on E, you will get the same distribution on W, regardless of whether you use doWE, evidential counterfactuals, or causal counterfactuals. If we take W to be the level sets of a utility function, this leads the result that (updateless) CDT is the same as (updateless) EDT whenever the agent is choosing between counterfactable events.

However, the interesting part is that both evidential and causal counterfactuals (while agreeing with the standard intuitions) will be defined in terms of doW. In both cases, we will be given a not necessarily counterfactable E, and then (possibly randomly) find a counterfactable E′⊆E, and counterfact on E′ instead. Thus, the whole of evidential and causal counterfactuals can be summarized as "Given an event E, first find the correct counterfactable subset E′, and then counterfact on E′."

Note that the above is not saying much, since we could just take E′ to be a singleton. The interesting part is that both evidential and causal counterfactuals can be viewed as doWE′ for E′ no later than E ({E′,S∖E′}≤F{E,S∖E}). Thus, we are not just forcing evidential and causal counterfactuals to fit our notion of counterfactuals by counterfacting on an overly specific description of the whole world. We are counterfacting on a local event, no later than E itself.

This gives a new orientation on the problem of counterfactuals. We are given an event E that we want to counterfact on, and unfortunately it is not counterfactable, because it is not specific enough, so we instead need to counterfact on some subset E′ that is both counterfactable and no later than E. Evidential and causal counterfactuals are just (bad) ways of choosing that subset. Now instead of trying to figure out how to define counterfactuals, we can instead think of ways to choose a counterfactable subset of the event we want to counterfact on.

The emphasis on E′ that are no later than E is important, because it is capturing that non-counterfactable events are somewhat artificially not specific enough to countefact on. It is as though they were constructed by unioning together counterfactable events. The "were constructed by" is important here. Events can be expressed as unions of counterfactable events in many ways, but it feels more like the resolution was artificially removed, when we express the event as as union of events that came weakly earlier. Sometimes, I know how to counterfact on "I do X," and I know how to counterfact on "I do Y," but I get confused when I try to counterfact on "I do X or I do Y," because the resolution was artificially lowered. This is not to say we can just not lower the resolution. This is the curse of embedded agency.

To make it more pithy, "Counterfactuals are sometimes under-defined for events that are not at their native resolution." I actually like the analogy with native resolution on a monitor here. When the image has less resolution than the monitor, there isn't a well defined best way to display it. If the image has resolution that is an integer fraction the monitor (in each dimension), there is a well defined best way to display it, but that is because dividing by an integer corresponds in this analogy to taking out a factor, and thus having a smaller history.

## Defining Decision Theories

## Evidential Counterfactuals

Defining evidential counterfactuals will require more than just a finite factored set F=(S,B). We will also need a probability distribution P on F that is nowhere zero. Recall that a probability distribution on a finite factored set is the product distribution of a separate probability distribution on each of the factors.

Given a nonempty proper subset E⊆S, we will sample a subset of E as follows. Let XE=⋁S(hF({E,S∖E}))|E. Note that XE is a partition of E. For each E′∈XE, sample x with probability P(E′|E). Note that the sum of these probabilities will be 1.

Further, note that the E′ sampled by the above procedure will always be a subset of E, and will always satisfy {E′,S∖E′}≤F{E,S∖E}.

Further, E′ will be counterfactable relative to W for all W∈Part(S). This is because if we take s∈E′ and C=hF({E,S∖E}), we have that e∈E′ if and only if e∼b for all b∈C.

Thus, we have successfully specified a (randomized) procedure, which given an E⊆S, produces an E′⊆E that is counterfactable and no later than E.

That gives our our evidential counterfactuals: EC(W,P)E:S→ΔW, which are given by setting EC(W,P)E(s)(w) to the probability that doWE′(s)=w, where E′ is defined as above from E and P.

Note that even the concept of evidential counterfactuals is going against the native ontology of evidential decision theory. Evidential decision theory doesn't really talk about interventions, and the type signatures above are about taking an intervention on an initial s∈S.

However, note that if you sample an s∈S according to P, and then sample a w∈W according to EC(W,P)E(s), you will end up sampling each w∈W with probability P(w|E), so you get the same end result as if you just conditioned on E.

However, the evidential counterfactuals we define here also have the nice property that counterfacting on E will always leave unchanged all variables that are orthogonal to {E,S∖E}, so our counterfactuals are local in a sense.

## Causal Counterfactuals

We will now define causal counterfactuals similarly to the above evidential counterfactuals. We will again need some extra structure. This time, we will imagine that our finite factored set F=(S,B) was constructed from some Pearlian causal DAG, D.

For this, we will first need to describe a procedure for constructing a FFS from a Pearlian DAG. We will take one factor for each node in our DAG. The factor corresponding to the node v will have one part for each function from assignments of states to the parents of v to assignments of a state to v. Note that when we construct a factored set in this way, for each s∈S, we have a well defined state for each node, which can be recursively defined using the functions you get by projecting onto each factor.

Note that this also means that for each node v, we get a function fv which takes in an element of S, and outputs the state that element assigns to v.

We will not be able to describe causal counterfactuals for general subsets of S. For any set of nodes V, and any assignment of states t, where t(v) is a state of v for each v∈V, we can take E(V,t)={s∈S|∀v∈V,fv(s)=t(v)}. We will only be able to define a causal counterfactual for subsets of this form. These can be thought of events that can be described by assigning states to some set of nodes.

If we have E=E(V,t) of this form, we can define E′ to to be the set of all elements such that for all v∈V, the factor corresponding to v has the value corresponding to the constant function const t(v).

Observe that E′⊆E, that {E′,S∖E′}≤F{E,S∖E}, and that E′ is counterfactable relative to W for all W∈Part(S).

If E=E(V,t) as above, let CC(W,D)E(s)=doWE′(s).

This is basically saying that we are take an E which corresponds to a collection of nodes having a specific assignment of states. We can't directly counterfact on that event, because there are many different assignments of the parents of the nodes that could result in those states, so instead we counterfact on the nodes being constantly equal those states, independent of the states of their parents.

## CDT=EDT (for Counterfactable Events)

We have defined evidential and causal counterfactuals, and we can use them to now define EDT and CDT. Let F=(S,B) be a finite factored set. Let W be a partition of S, and let U:W→[0,1] be a utility function.

(We are assuming here that W has enough resolution to capture the agent's utility. For example, we could start with a utility function on S, and define W to be exactly the partition into level sets of that utility function.)

Let A be a partition of S, representing the agent's action. Let P be a distribution on F, representing the agent's beliefs. (We are ignoring any observations here, so A is more like a space of updateless policies.)

We can then define the EDT choice, which is the element E∈A that maximizes the expectation of U(w), where s is sampled according to P, and w is sampled according to EC(W,P)E(s).

Similarly, if we have a Pearlian DAG D, and F was generated from D, and for all E∈A, E is of the form E(VE,tE), then we can define the CDT choice to be the E∈A which maximizes the expectation of U(EC(W,D)E(s)), where s is sampled according to P.

Finally, if E is counterfactable for all E∈A, we can define a third decision theory, where we choose the E∈A, which maximizes the expectation of U(doWE(s)), where s is sampled according to P.

Note that if E is counterfactable for all E∈A, then this third decision theory will give the same result as EDT. Further if CDT is also well defined, all three decision theories will give the same result. CDT, EDT, and the third decision theory need not counterfact on the same events (i.e. the E′ might be different, but the decisions will end up the same).

This is mostly saying that counterfactability is very strong: once you have counterfactable events, decision theory is over-determined.

In all of the above, we are doing updateless versions of the decision theories. Our agent is not making any observations.

## Other Counterfactuals

I described evidential and causal counterfactuals above, not because I think they are the right way to take counterfactuals, but because I wanted to demonstrate that they both fit into the framework where when counterfacting on E, you apply doWE′ for some counterfactable E′⊂E, no later than E itself. The fact that you have to pass to an E′ comes from the fact that E was not actually at the native resolution of the action being counterfacted on.

There are other ways we could select an E′, that look more at the gears of the process by which the decision is being made. When I am in a prisoner's dilemma with someone with similar psychology, part of my decision making process is happening on the part of me that is shared with my opponent, while some of my decision making process uses methods that are unique to me. Thus, my opponent's action is partially downstream from the calculation I am currently doing, and partially independent of it. Determining how much of the decision is in each part is difficult, and will not just be one extreme (EDT) or the other (CDT).

## Counterfactability and ELK

An event is counterfactable if it screens off its own history from everything you care about. Dealing with non-counterfactable events is confusing, so instead of dealing with a non-counterfactable event E, we would rather deal with a counterfactable E′⊆E, and luckily, there is always a counterfactable E′ that uses no more information that E. This story is sufficiently natural that it also applies outside of decision theory.

Say you have some opaque machine learning system that gives some output. Let {E1,…,En} partition the possible worlds according to the output of the system. The history of the output is all the information/thinking/computation/knowledge that goes into computing the output. (This is not the history according to FFS taken literally, but there is an analogy here that is deep, and part of the motivation for defining the FFS toy model.)

There are a bunch of details in the history of the output that do not make it into the output. This is fine. However, it is scary when there are details in the history of the output that are about things we care about, that do not make it into the output (or into our understanding of the output). Thus, we would like it if (our understanding of) the output of our ML system successfully screened off its own history from that which we care about.

Unfortunately, we have an opaque ML system that does not have this property, and thus will not have to tell us if it is trying to deceive us. What can we do?

We would like to add some additional notes to the output of the system, without necessarily changing the original system. These notes refine the partition of possible worlds according to output, and thus the worlds corresponding to a given output-notes pair will be a subset of the worlds corresponding to just the output. We are finding a subset E′i of our original Ei by adding more information through the notes.

We would like E′i to be counterfactable, meaning that it contains all the information in its own history that is relevant to what we care about. Luckily, there is this intuition that this shouldn't be that hard. All the information is already there. We just need to pull it out, we shouldn't need to add any new information to do this.

All this is to say that we would like our system to instead of outputting E, output a counterfactable E′⊆E, and luckily there should always be a counterfactable E′ no later than E. (I am being sloppy here with conflating the output and our understanding of the output, but still there is a rhyme in the structure that is hard to deny.)

I think that the thing that is going on here is that FFS gives us a nice definition of a good summary: Y is a good summary of X if Y screens off X from everything you care about. The goal of informed oversight is to have systems that output good summaries of themselves. Without this, the overseer cannot evaluate the consequences of the output in an unbiased way. Similarly, when considering non-counterfactable actions, an agent cannot judge the consequences of those actions in an unbiased way.