# Ω 1

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This idea was informed by discussions with Abram Demski, Scott Garrabrant, and the MIRIchi discussion group.

# Summary

For a logical inductor , define logical counterfactuals by

for a suitable and a random variable independent of with respect to . Using this definition, one can construct agents that perform well in ASP-like problems.

# Motivation

Recall the Agent Simulates Predictor problem:

Naively, we want to solve this by argmaxing:

Hopefully, , , and . Also, two-boxing should be less attractive than one-boxing:

However, if we make this well-defined with -exploration, we'll get

and then the agent will two-box, contradiction. Instead we'd like to use predictable exploration and set

for small enough that the right-hand side is sensible. Let's see how.

# Predictable exploration

Choose so that . Our agent decides whether to explore at stage , and uses its beliefs at stage as a substitute for counterfactuals:

Here are small positive numbers. It's easy to see that, under reasonable assumptions, this agent 1-boxes on Agent Simulates Predictor. But it can't use the full strength of in its counterfactual reasoning, and this is a problem.

# Differential privacy

To illustrate the problem, add a term to the utility function that sometimes rewards two-boxing:

The agent should two-box if and only if . Assuming that's the case, and knows this, we have:

So if , two-boxing is the more attractive option, which is a contradiction. (I'm rounding to zero for simplicity.)

The problem is that the counterfactual has to rely on 's imperfect knowledge of . We want to combine 's ignorance of with 's knowledge of .

If is independent of conditioned on with respect to , then we can do this:

Then replace with :

This is more accurate than , and unbiased.

If is not independent of conditional on , we can introduce an auxilliary variable and construct a version of that is independent. This construction is a solution to the following differential privacy problem: Make a random variable that is a function of and independent randomness, maximizing the mutual conditional information , subject to the constraint that is independent of . Using the identity

we see that the maximum is attained when , which means that is a function of and .

Now here's the construction of :

Let be the finite set of possible values of , and let be the finite set of possible values of . We'll iteratively construct a set and define a random variable taking values in . To start with, let .

Now choose

and for each , choose some such that . Then make a random binary variable such that

Then let be the event defined by

and add to . After repeating this process times, we are done.

We can do this with a logical inductor as well. In general, to get a sentence such that , take .

Now given random variables and , and some informative sentences , let be the random variable encoding the values of . The above construction works approximately and conditional on to give us a random variable that is approximately independent of conditional on with respect to . Now we define

whenever .

This succeeds on the problem at the beginning of this section: Assume , and assume that knows this. Then:

which does not lead to contradiction. In fact, there are agents like this that do at least as well as any constant agent:

## Theorem

Let be a utility function defined with metasyntactic variables , , and . It must be computable in polynomial time as a function of , , and , where can be any polytime functions that doesn't grow too slowly and such that . Then there exists a logical inductor such that for every , there exists , , and a pseudorandom variable such that the agent defined below performs at least as well on as any constant agent, up to a margin of error that approaches as :

## Proof sketch

Choose smaller than the strength parameter of the weakest predictor in . If is the best constant policy for , assume . Since can compute , our agent's factual estimate is accurate, and the counterfactual estimate for is an accurate estimate of the utility assigned to the constant policy , as long as we make rich enough. So the agent will choose . Thus we have an implication of the form "if believes , then is true", and so we can create a logical inductor that always believes that for every by adding a trader with a large budget that bids up the price of .

# Isn't this just UDTv2?

This is much less general than UDTv2. If you like, you can think of this as an agent that at time chooses a program to run, and then runs that program at time , except the program always happens to be "argmax over this kind of counterfactual".

Also, it doesn't do policy selection.

# Next steps

Instead of handing the agent a pseudorandom variable that captures everything important, I'd like to have traders inside a logical inductor figure out what should be on their own.

Also, I'd rather not have to hand the agent an optimal value of .

Also, I hope that these counterfactuals can be used to do policy selection and win at counterfactual mugging.