*A putative new idea for AI control; index here.*

This is part of the process of rigourising and formalising past ideas.

Paul Christiano recently asked why I used utility changes, rather than probability changes, to have an AI believe (or act as if it believed) false things. While investigating that, I developed several different methods for achieving the belief changes that we desired. This post analyses these methods.

## Different models of forced beliefs

Let x and ¬x refer to the future outcome of a binary random variable X (write P(x) as a shorthand for P(X=x), and so on). Assume that we want P(x):P(¬x) to be in the 1:λ ratio for some λ (since the ratio is all that matters, λ=∞ is valid, meaning P(x)=0). Assume that we have an agent, who has utility u, has seen past evidence e, and wishes to assess the expected utility of their action a.

Typically, for expected utility, we sum over the possible worlds. In practice, we almost always sum over sets of possible worlds, the sets determined by some key features of interest. In assessing the quality of health interventions, for instance, we do not carefully and separately treat each possible position of atoms in the sun. Thus let V be the set of variables or values we can about, and v a possible value vector V can take. As usual, we'll be writing P(v) as a shorthand for P(V=v). The utility function u assigns utilities to possible v's.

One of the advantages of this approach is that it can avoid many issues of conditionals like P(A|B) when P(B)=0.

The first obvious idea is to condition on x and ¬x:

- (1) Σ
_{v}u(v)(P(v|x,e,a)+λP(v|¬x,e,a))

The second one is to use intersections rather than conditionals (as in this post):

- (2) Σ
_{v}u(v)(P(v,x|e,a)+λP(v,¬x|e,a))

Finally, imagine that we have a set of variables H, that "screen off" the effects of e and a, up until X. Let h be a set of values H can take. Thus P(x|h,e,a)=P(x|h). One could see H as the full set of possible pre-X histories, but it could be much smaller - maybe just the local environment around X. This gives a third definition:

- (3) Σ
_{v}Σ_{h}u(v)(P(v|h,x,e,a)+λP(v|h,¬x,e,a))P(h|,e,a)

## Changing and unchangeable P(x)

An important thing to note is that *all three definitions are equivalent for fixed P(x)**, up to changes of λ*. The equivalence of (2) and (1) derives from the fact that Σ_{v} u(v)(P(v,x|e,a)+λP(v,¬x|e,a)) = Σ_{v} u(v)(P(x)P(v|x,e,a)+λP(¬x)P(v|¬x,e,a)) (we write P(x) rather than P(x|e,a) since the probability of x is fixed). Thus a type (2) agent with λ is equivalent with a type (1) agent with λ'=λP(x)/P(¬x).

Similarly, P(v|h,x,e,a)=P(v,h,x|e,a)/(P(x|h,e,a)*P(h|e,a)). Since P(x|h,e,a)=P(x), equation (3) reduces to Σ_{v} Σ_{h} u(v)(P(x)P(v,h,x|e,a)+λP(¬x)P(v,h,¬x|e,a)). Summing over h, this becomes Σ_{v} u(v)(P(x)P(v,x|e,a)+λP(¬x)P(v,¬x|e,a))=Σ_{v} u(v)(P(v|x,e,a)+λP(v|¬x,e,a)), ie the same as (1).

What about non-constant x? Let c(x) and c(¬x) be two contracts that pay out under x and ¬x, respectively. If the utility u is defined as 1 if a payout is received (and 0 otherwise), it's clear that both agent (1) and agent (3) assess c(x) as having an expected utility of 1 while c(¬x) has an expected utility of λ. This assessment is unchanging, whatever the probability of x. Therefore agents (1) and (3), in effect, see the odds of x as being a constant ratio 1:λ.

Agent (2), in contrast, gets a one-off artificial 1:λ update to the odds of x and then proceeds to update normally. Suppose that X is a coin toss that the agent believes is fair, having extensively observed the coin. Then it will believe that the odds are 1:λ. Suppose instead that it observes the coin has a λ:1 odd ratio; then it will believe the true odds are 1:1. It will be accurate, with a 1:λ ratio added on.

The effects of this percolate backwards in time from X. Suppose that X was to be determined by the toss of one of two unfair coins, one with odds ε:1 and one with odds 1:ε. The agent would assess the odds of the first coin being used rather than the second as around 1:λ. This update would extend to the process of choosing the coins, and anything that that depended on. Agent (1) is similar, though its update rule always assumes the odds of x:¬x being fixed; thus any information about the processes of coin selection is interpreted as a change in the probability of the processes, not a change in the probability of the outcome.

Agent (3), in contrast, is completely different. It assess the probability of H=h objectively, but then assumes that the odds of x and ¬x, given any h, is 1:λ. Thus if given updates about the probability of which coin is used, it will assess those updates objectively, but then assume that both coins are "really" giving 1:1 odds. It cuts off the update process at h, thus ensuring that it is "incorrect" only about x and its consequences, not its pre-h causes.

## Utility and probability: assessing goal stability

Agents with unstable goals are likely to evolve towards being (equivalent to) expected utility maximisers. The converse is more complicated, but we'll assume here that an agent's goal is stable if it is an expected utility maximiser for some probability distribution.

Which one? I've tended to shy away from changing the probability, preferring to change the utility instead. If we divide the probability in equation (2) by 1+λ, it becomes a u-maximiser with a biased probability distribution. Alternatively, if we defined u'(v,x)=u(v) and u'(v,¬x)=λu(v), then it is a u'-maximiser with an unmodified probability distribution. Since all agents are equivalent for fixed P(x), we can see that in that case, all agents can be seen as expected utility maximisers with the standard probability distribution.

Paul questioned whether the difference was relevant. I preferred the unmodified probability distribution - maybe the agent uses the distribution for induction, maybe having false probability beliefs will interfere with AI self-improvement, or maybe agents with standard probability distributions are easier to corrige - but for agent (2) the difference seems to be arguably a matter of taste.

Note that though agent (2) is stable, it's definition is not translation invariant in u. If we add c to u, we add c(P(x|e,a)+λP(¬x|e,a)) to u'. Thus, if the agent can affect the value of P(x) through its actions, different constants c likely give different behaviours.

Agent (1) is different. Except for the cases λ=0 and λ=∞, the agent cannot be an expected utility maximiser. To see this, just notice that an update about the process that could change the probability of x, gets reinterpreted as an update on the probability of that process. If we have the ε:1 and 1:ε coins, then any update about their respective probabilities of being used gets essentially ignored (as long as the evidence that the coins are biased is much stronger than the evidence as to which coin is used).

In the cases λ=0 and λ=∞, though, agent (1) is a u-maximiser that uses the probability distribution that assumes x or ¬x is certain, respectively. This is the main point of agent (1) - providing a simple maximiser for those cases.

What about agent (3)? Define u' by: u'(v,h,x)=u(v)/P(x|h), and u'(v,h,¬x)=λu(v)/P(¬x|h). Then consider the u'-maximiser:

- (4) Σ
_{v}Σ_{h}u'(v,h,x)P(v,h,x|e,a)+u'(v,h¬x)P(v,h,¬x|e,a)

Now P(v,h,x|e,a)=P(v|h,x,e,a)P(x|h,e,a)P(h|e,a). Because of the screening off assumptions, the middle term is the constant P(x|h). Multiplying this by u'(v,h,x)=u(v)/P(x|h) gives u(v)P(v|h,x,e,a)P(h|e,a). Similarly, the second term becomes λu(v)P(v|h,¬x,e,a)P(h|e,a). Thus a u'-maximiser, with the standard probability distribution, is the same as agent (3), thus proving the stability of that agent type.

## Beyond the future: going crazy or staying sane

What happens after the event X has come to pass? In that case, agent (4), the u'-maximiser will continue as normal. Its behaviour will not be unusual as long as neither λ nor 1/λ is close to 0. The same goes for agent (2).

In contrast, agent (3) will no longer be stable after X, as H no longer screens off evidence after that point. And agent (1) was never stable in the first place, and now it denies all the evidence it sees to determine that impossible events actually happened. But what of those two agents, or the stable ones if λ or 1/λ were close to 0? In particular, what if λ falls below the probability that the agent is deluded in its observation of X?

In those cases, it's easy to argue that the agents would effectively go insane, believing wild and random things to justify their delusions.

But maybe not, in the end. Suppose that you, as a human, believe an untrue fact - maybe that Kennedy was killed on the 23rd of November rather than the 22nd. Maybe you construct elaborate conspiracy theories to account for the discrepancy. Maybe you posit an early mistake by some reporter that was then picked up and repeated. After a while, you discover that all the evidence you can find points to the 22nd. Thus, even though you believe with utter conviction that the assassination was on the 23rd, you learn to expect that the next piece of evidence will point to the 22nd. You look for the date-changing conspiracy, and never discover anything about it; and thus learn to expect they have covered their tracks so well they can't be detected.

In the end, the expectations of this "insane" agent could come to resemble those of normal agents, as long as there's some possibility of a *general* explanation of *all* the normal observations (eg a well-hidden conspiracy) given the incorrect assumption.

Of course, the safer option is just to corrige the agent to some sensible goal soon after X.