An environment for studying counterfactuals

by Nisan3 min read11th Jul 20186 comments


Ω 5

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary: I introduce a decision theory framework where successful agents are those with good counterfactuals.


The problem of logical counterfactuals is how to define probabilities when is known to be false. (I'll ignore more general counterfactuals in this post.)

The theory of logical induction provides a joint distribution over sentences, so the problem becomes: How do you condition on when has negligible probability?

Exploration tries to solve this by making sure that never has negligible probability. But it doesn't work in problems like Agent Simulates Predictor that contain predictors who can't tell when the agent explores.

A better solution is early exploration, which uses an early stage of the logical inductor to do exploration. But then the later stages of the inductor know that is false, and we're back where we started.

I'm going to describe an environment that captures these features of the problem — it's got reflection, early exploration, counterfactuals, and a Bayesian update that stands in for the evolution of a logical inductor.

Informal definition

The agent outputs counterfactual distributions . This determines an expected utility for each action. Most of the time, an action is chosen for the agent that maximizes this expected utility. But a small fraction of the time, an exploration action is chosen instead.

The agent receives an observation as input, from which it can infer whether exploration will occur. The agent also receives a prior as input, and this prior accurately reflects the behavior of the agent as a function of and . (This uses a fixed-point theorem.)

If action is chosen, then the counterfactual is factual; the rest are counterfactual. We judge an agent according to how accurate its factual counterfactual is, in addition to how much utility it gets.

Here's an agent that does okay in this environment: It adopts as its epistemic state and ignores . Because of exploration, it can compute counterfactuals by conditioning. This agent does okay but not great, since it ignores .

You could try to make a better agent as follows: Adopt as a prior and then do a Bayesian update on . But now you've inferred whether exploration occurs, so some actions have probability zero, and it's not clear how to compute counterfactuals.

If you find a good agent for this environment, you'll probably have learned something about making good counterfactuals.

Formal definition

A decision problem consists of a tuple of random variables:

  • is a finite set of actions.
  • determines whether the agent explores. If , no exploration takes place.
  • is a finite set of observations.
  • is the space of utilities.
  • is the space of distributions over the above variables.
  • represents an infinite source of random bits that the agent can use.
  • is the agent's output, representing a counterfactual distribution for each action.

along with some likelihoods:

  • for all , and , for some choice of .
  • Likelihoods , depending on the problem.
  • i.i.d. uniform distributions on each bit of .
  • is mostly determined by and as follows: If , then . (If there is a tie, is undetermined.) Otherwise, .
  • A distribution over conditional on each value of and , depending on the problem.

An agent is a function . A decision and an agent together almost determine a joint distribution over all the variables. What's missing is and tiebreakers for . These are determined by finding a fixed point satisfying:

  • If is the resulting marginal distribution over , then .
  • For each and , there is a distribution over the set such that is sampled from that distribution.

(I might prove the existence of a fixed point in a comment.)

We'll informally say that an agent does well on a decision problem if, for every fixed point, the following are true:

  • is high.
  • The factual counterfactual is accurate — say, it's close to the marginal over conditional on the true action in total variation distance: