Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In equation form, the AI is maximising

P(¬X)∗C+P(X)∗u(X,A)

for some constant C, some unlikely event X that the AI cannot affect, some set of relevant descriptors A, and some utility u. Since C is constant, this is exactly the same as maximising u(X,A) - the probability P(X) is irrelevant.

The whole setup described is simply a way to ensure that if W is the likely set of worlds consistent with observations after ¬X/X, then

P(W)≈P(¬X)≈1 (we "know" that X doesn't happen and that we end up in W),

while

P(W|X)<<1 (in the worlds it cares about, the AI behaves as if W was incredibly unlikely to come about).

In equation form, the AI is maximising

P(¬X)∗C+P(X)∗u(X,A)

for some constant C, some unlikely event X that the AI cannot affect, some set of relevant descriptors A, and some utility u. Since C is constant, this is exactly the same as maximising u(X,A) - the probability P(X) is irrelevant.

The whole setup described is simply a way to ensure that if W is the likely set of worlds consistent with observations after ¬X/X, then

P(W)≈P(¬X)≈1 (we "know" that X doesn't happen and that we end up in W),

while

P(W|X)<<1 (in the worlds it cares about, the AI behaves as if W was incredibly unlikely to come about).