# Ω 4

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary: Take the argmax of counterfactual expected utility, using a thin counterfactual which itself maximizes a priori expected utility.

In An environment for studying counterfactuals, I described a family of decision problems that elicit counterfactuals from an agent. In this post I'll redefine that family using reflective oracles, and define an optimal agent. I was aided in this work by discussions with Alex Appell, James Brooks, Abram Demski, Linda Linsefors, and Alex Mennen.

# The problem

Here's the (not quite) Bayes net for the problem. I changed it a little from the last post, to make the thin counterfactuals more explicit:

controls exploration and is a hidden state. The observation is isomorphic as a random variable to , but the particular isomorphism used by the problem is unknown to the agent.

The agent is given and a prior ; it outputs a set of counterfactual expected utilities, one for each action. The agent's action is equal to the argmax of these expectations, unless it's overridden by . The action and hidden state determine the utility .

The dotted lines mean that is equal to the marginal over . The machinery of reflective oracles resolves the circularity.

Now I'll define this distribution in detail, using Python-like pseudocode. I ought to use a probabilistic programming language, but I don't know any of those. Note that if you have a reflective oracle, you can compute expectations of distributions in constant time, and you can compute argmaxes over sets of real numbers, with ties broken in an undetermined way. Real numbers and distributions can be represented by programs.

## environment()

The function takes an agent and outputs a sample from a joint distribution over . The agent is a function that takes the environment and an observation and stochastically returns a set of counterfactual expectations, which is itself encoded as a function from actions to real numbers.

The environment samples from a uniform distribution. Usually is equal to the symbol , meaning no exploration; but with probability it's a uniformly random action. It computes using the secret isomorphism . It calls the agent to get its counterfactual expectations. It argmaxes over those expectations to get the action, unless overrides that. Finally, is sampled from a problem-dependent likelihood function that depends on and .

define :

sample

sample

if :

else: sample

sample

return

# The agent

The agent's strategy is to choose a thin counterfactual that maximizes its a priori expected utility. We'll define a couple helper functions first.

## cfact()

The function takes a prior , a thin counterfactual , an observation , and an action , and deterministically outputs the expected utility conditional on and counterfactually conditional on , taking the expectation with respect to and the counterfactual with respect to .

This is how to compute a thin counterfactual distribution: First condition on and use to infer a posterior on . Then forget but retain your posterior on . Then condition on . The reason this works is that is independent of , so there's still a chance for by exploration.

define :

return

## utility()

This function takes a prior and a thin counterfactual, and samples from the marginal on that would result if the agent used that thin counterfactual. The previous sentence uses the word "would", so you might think we need another thin counterfactual to make sense of it; but our agent is a program, and programs come with their own "thick" notion of counterfactual.

The calculation goes like this: Sample and . If makes us explore, then that determines the action and we can sample a utility conditional on and . Otherwise, we assume the action is chosen to maximize .

define :

sample

if : return

else: return

## agent()

The agent is given a prior and an observation, and outputs the expected utility conditional on , for every . This output is encoded in a program that takes an action and outputs a real number. The output may be stochastic because contains an argmax which may be stochastic in case of ties.

The true thin counterfactual determines a projection . The agent can deduce this projection by analyzing the prior: For every , for all but one value of .

The argmax is taken over all isomorphisms that commute with projecting to . I won't bother writing that out in the pseudocode:

define :

return

# Optimality theorems

One of the counterfactual expectations that the agent emits is factual. It's reasonable to ask whether that one is accurate.

Theorem: If is bounded by and , then with probability , if we sample , , , we have

Proof: for some . That equals

where and . With a little algebra, that gives us

By the following lemma, with probability , we have . A little more algebra concludes the proof.

Lemma: Sample from a joint distribution over . With probability , we have .

Now we'll prove that gets an optimal amount of utility. First, a couple lemmas.

Lemma:

Proof: By the definition of ,

By definition of , this is

where the expectation is taken over the output distribution of . By linearity of expectation,

This is just

Recall that is the secret isomorphism . The next lemma says that represents the highest utility that an agent can get:

Lemma:

Proof: Using the fact that

and

and the definition of , we have

.

Plugging that in to the definition of , we get

Finally, this theorem says that the agent gets the highest possible utility:

Theorem:

Proof: Call the right-hand side . It's easy to see that . By the previous two lemmas, we also have

Troll Bridge is a problem that punishes exploration. You can't represent it in this framework because the action screens off exploration. But if you modified the framework to implement Troll Bridge, the agent would fail. (It would be afraid to cross the bridge.)

I suspect the solution to Troll Bridge is to use continuous exploration: Explore at every timestep of a logical inductor. If Troll Bridge punishes exploration at timestep , then exploration at succeeds. I haven't worked this out yet.

This agent doesn't get counterfactually mugged. (This is bad.) This agent only exhibits one of the two types of updatelessness.

# Next steps

The notion of "thin counterfactual" used here is more specific than it needs to be. I'd like to generalize it.

Once I do that, I might be able to formulate the symmetric Prisoner's Dilemma. I'm not sure how this agent does on that problem.

I also want to implement this agent in the setting of logical induction. I'm imagining traders getting licenses to participate in thin counterfactual markets, and the licenses prohibit them from learning the price of . This will also involve some kind of continuous exploration.

Mentioned in
New Comment