Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

My counterfactual Oracle design uses a utility function/reward function in order to train it to give the right predictions. Paul Christiano asked whether the whole utility function approach was necessary, and whether it could be replaced with a simple sequence predictor.

It turns out the answers are no, and yes. The counterfactual nature of the approach can be preserved, without needing a utility function.

Counterfactual sequence prediction

There is a boxed Oracle, O, which gets fed a series of observations o0,o1,…on−1, and must make a prediction pn about the next observation.

If the prediction was sent out into the world, then O is attempting to make pn into a self-confirming prediction, with all the problems that this could entail.

However, we make O into a counterfactual Oracle; on some occasions, the output pn is erased, and not seen by anyone. In that case, the Oracle will get a special message e (for "erasure"), that will be added on after of on−1.

Then the job of the counterfactual Oracle is, given a history hn−1=o0p1o1…pn−1on−1 (including past "e"s), to produce a prediction pn that is the best prediction for on given the history hn−1e. In terms of words, the Oracle is trying to best predict the next observation, given that erasure is assumed to happen (the definition of "best" depends on the details of the design).

If our setup is well designed and there is no information leakage, then any on has a probability that is independent of the prediction pn:

∀on,hn−1,pn,p′n:P(on∣hn−1epn)=P(on∣hn−1ep′n).

Meaning of predictions

If we ran the oracle without the assumption that e would be observed, we'd have the standard self-confirming predictor (that being the only correct predictions); with that assumption, it becomes a counterfactual predictor.

Though counterfactuals are conceptually tricky, it's actually easier to understand what a counterfactual predictor is doing than a general sequence predictor Oracle. We understand that a general sequence predictor is trying to find a self-confirming prediction - one where pn is the best estimate of on, given hn−1pn - but unless we understand the structure of the world very well, we don't have a clue where that self-confirming prediction might be.

In contrast, the counterfactual prediction is easy to interpret: "had we not seen pn, that is what on would have been"; there are no loops in the definition.

Extent of manipulation via prediction

Note that the Oracle can be used to estimate the extent to which the prediction is manipulative, by contrasting its predictions for on given the hn−1, and given hn−1e. This reveals the extent to which the prediction is potentially manipulative.

Non-explicit erasure

If the Oracle isn't given an explicit erasure message, but has a causal model of the world, we can still run the counterfactual Oracle by having it choose pn to be the best fit for the on, conditional on hnpn and on the erasure happening.

Factoring out certain routes

Assume now that the AI is an active AI, that can take actions in the world, affecting the observation on. So we'll replace pi terminology with ai; note that the predictions pi were already actions, just actions of a particular type in particular setup.

In any case, label the banned routes of influence by the variables b:

Note that the Oracle case is just a special case of this, where b corresponds to "humans reading the prediction", and there being no direct arrow from an/pn to on.

Factoring out

Assume that ^a is the default action. Then we want the AI to maximise some expected score S of on and an, but acting as if the values of b were set by ^a rather than the value of an.

Let {βi} be the possible values of b. We require one more assumption: that if b=βi is possible given ^a, then it is also possible given any other action a (technically, we only need for the conditional probabilities P(on∣b=βi,an=a) to make sense).

Then if the AI has sufficient data to estimate expressions like P(b=βi∣a), we act as if the AI was off-policy where b is concerned. Namely, we get it to compute the action a that maximises:

∑βiP(b=βi∣an=^a,hn−1)E[S(on,a)∣b=βi,an=a,hn−1].

Note that if we had an=a instead of an=^a in that expression, then this would just be the classical E[S(on,a)∣an=a,hn−1].

This construction is essentially the same as the counterfactually uninfluenceable agent design, with b representing the weights of the different reward functions that the AI was to "learn".

My counterfactual Oracle design uses a utility function/reward function in order to train it to give the right predictions. Paul Christiano asked whether the whole utility function approach was necessary, and whether it could be replaced with a simple sequence predictor.

It turns out the answers are no, and yes. The counterfactual nature of the approach can be preserved, without needing a utility function.

## Counterfactual sequence prediction

There is a boxed Oracle, O, which gets fed a series of observations o0,o1,…on−1, and must make a prediction pn about the next observation.

If the prediction was sent out into the world, then O is attempting to make pn into a self-confirming prediction, with all the problems that this could entail.

However, we make O into a counterfactual Oracle; on some occasions, the output pn is erased, and not seen by anyone. In that case, the Oracle will get a special message e (for "erasure"), that will be added on after of on−1.

Then the job of the counterfactual Oracle is, given a history hn−1=o0p1o1…pn−1on−1 (including past "e"s), to produce a prediction pn that is the best prediction for on given the history hn−1e. In terms of words, the Oracle is trying to best predict the next observation,

given that erasure is assumed to happen(the definition of "best" depends on the details of the design).If our setup is well designed and there is no information leakage, then any on has a probability that is independent of the prediction pn:

## Meaning of predictions

If we ran the oracle without the assumption that e would be observed, we'd have the standard self-confirming predictor (that being the only correct predictions); with that assumption, it becomes a counterfactual predictor.

Though counterfactuals are conceptually tricky, it's actually easier to understand what a counterfactual predictor is doing than a general sequence predictor Oracle. We understand that a general sequence predictor is trying to find a self-confirming prediction - one where pn is the best estimate of on, given hn−1pn - but unless we understand the structure of the world very well, we don't have a clue where that self-confirming prediction might be.

In contrast, the counterfactual prediction is easy to interpret: "had we not seen pn, that is what on would have been"; there are no loops in the definition.

## Extent of manipulation via prediction

Note that the Oracle can be used to estimate the extent to which the prediction is manipulative, by contrasting its predictions for on given the hn−1, and given hn−1e. This reveals the extent to which the prediction is potentially manipulative.

## Non-explicit erasure

If the Oracle isn't given an explicit erasure message, but has a causal model of the world, we can still run the counterfactual Oracle by having it choose pn to be the best fit for the on, conditional on hnpn and on the erasure happening.

## Factoring out certain routes

Assume now that the AI is an active AI, that can take actions in the world, affecting the observation on. So we'll replace pi terminology with ai; note that the predictions pi were already actions, just actions of a particular type in particular setup.

We want to have the AI only consider certain routes to influence on. This might be because there is a reward-function learning process involved, and we don't want the AI to influence the learning process. Or maybe there is a corrigibility button involved, and we don't want the AI to try and ensure it is pressed or not pressed.

In any case, label the banned routes of influence by the variables b:

Note that the Oracle case is just a special case of this, where b corresponds to "humans reading the prediction", and there being no direct arrow from an/pn to on.

## Factoring out

Assume that ^a is the default action. Then we want the AI to maximise some expected score S of on and an, but acting as if the values of b were set by ^a rather than the value of an.

Let {βi} be the possible values of b. We require one more assumption: that if b=βi is possible given ^a, then it is also possible given any other action a (technically, we only need for the conditional probabilities P(on∣b=βi,an=a) to make sense).

Then if the AI has sufficient data to estimate expressions like P(b=βi∣a), we act as if the AI was off-policy where b is concerned. Namely, we get it to compute the action a that maximises:

Note that if we had an=a instead of an=^a in that expression, then this would just be the classical E[S(on,a)∣an=a,hn−1].

This construction is essentially the same as the counterfactually uninfluenceable agent design, with b representing the weights of the different reward functions that the AI was to "learn".