# Ω 3

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We know a decent amount about how to get tiling in proof-based environments where the objective is for the AI to achieve a goal, or to write an AI to achieve a goal, or to write an AI that writes an AI that achieves a goal, and so on, as long as the chain of deferral is finite. The probabilistic setting is far less explored. This post will outline some necessary conditions to achieve tiling of an expected utility maximizer in fully general environments. Some of these conditions are fulfilled by a logical inductor in the limit. It's entirely possible that there are serious problems with this proof, so I'm hoping that it gets a good kicking.

## DEFINITIONS:

is an inputless computation that outputs a bitstring, which will be interpreted as a 2-tuple of an action, and another computation, an element of .

is the n'th computation in the infinite sequence of computations defined by starting at and taking the computation that each computation outputs. When n=1, the superscript will be omitted.

is the action selected by .

is some probability distribution or something that can be used to evaluate complicated computations. Like the n'th stage of a logical inductor, although we aren't necessarily assuming that it has to be a logical inductor.

underlining something refers to replacing the symbol with the thing being described. If the variable is used as a way of abstractly representing some action, then refers to the actual action chosen by the policy.

is the utility function, which is some large computation of type . So this would be the utility function output when it is fed the past actions generated by starting at , the current action, and the infinite sequence of future actions is given by whatever sequence of actions is produced by the self-modification chain starting with . Note that this only explicitly writes the starting code, and the code that might be modified to, not the past or future action sequence! This is important for the agent to be able to reason about this computation, despite it taking an infinite input.

Given some sequence of probability distributions, and some sequence of sets of policies, let be the computation

This is basically maximizing expected utility, conditional on the agent itself taking some action, with the past sequence of actions given by an expected utility maximizer on all previous steps.

Actually, in the proof, we will be assuming a 2-step process where some essential things are proved, and then argmax is done. This is done so that the future probability distribution respects some important information. The three important bits of computation are that the action of the previous computation must be computed/observed, that all the policies in must be emulated to know what policy and action they produce, with all of the produced policies added to , and the proof in the next section must be done to ensure that the conditional probabilities work out appropriately.

Now, time for the assumptions.

The first assumption is that abstract expected utility equals concrete expected utility.

This essentially says that if you're going to actually take action 4, then the expected utility of "I take an action" had better be the same as the expected utility of action 4. This is probably fulfilled by a logical inductor, as it would be possible to buy a conditional share/regular share in the lower-priced utility computation, and sell a conditional share/regular share in the higher-priced utility computation, and when the action is actually taken, the provable equivalence between the action actually taken and the action in the abstract means that the prices converge to each other, and the theorems in the logical induction probably enable profiting off the predictable future equality.

One important note is that CDT spectacularly fails this property. Namely, consider a game of matching pennies against a powerful predictor. Since the environment takes actions as input, it's possible to recompute what would have happened if a different action is plugged in. The CDT agent that keeps losing is going to learn to randomize between actions since it keeps seeing that the action it didn't take would have done better. So it eventually gets to a state where it predicts the reward from "pick heads" and "pick tails" is 0.5 (because there's a 50% chance it doesn't pick heads/tails), but it predicts the reward from "I take an action" is 0, violating this assumption. EDT, however, will predict that picking heads, picking tails, and picking some action all have utility 0, thus fufilling this condition.

The second assumption is that the environment is fair. Namely:

This says something like "after finding out that you take a particular action and policy, that should be the same expected utility as some other policy picking the same action and successor policy", or, more snappily, "finding out that you take a particular action shouldn't be bad news for the expected utility of a particular sequence of actions". The matching pennies predictor that is just penalizing the sequence of actions selected by an argmaxer is unfair in this sense for the argmaxer. This condition (or something like it) seems necessary, in the sense that you can come up with examples where a failure of this condition incentivizes modifying yourself to a different way of selecting actions.

The third assumption (which has already been stated), is that strategy stealing works. More specifically, if , then , and is computed at time n. This is another necessary condition. Consider a fair environment that rewards you when you do the hamster dance. You are pretty sure you remember the steps, but aren't completely sure. There's another policy that will make you correctly perform the motions. The expected utility of "move your arm left" is different than the expected utility of "hamster dance policy does its move". However, if you were able to compute what the hamster dance policy would output as its next move, you could just copy the move, without giving up your argmaxiness property. Therefore, if you don't know what move a policy outputs, there may be an incentive to modify to it, even if the environment is fair.

The fourth assumption is conditional future-trust of the following form:

This is emphatically false in general, but there's a special condition that makes it viable, namely that the distribution at time n is guaranteed to assign probability 1 to iff . My epistemic state about this is "this seems extremely plausible, but I don't know for sure if logical inductors attain this property in the limit"

## PROOF:

We will reason from the perspective of the expected utility maximizer in the proof phase shortly before turn n-1, and show that the future distribution that it trusts endorses expected utility maximization over all other policies, and use conditional self-trust to move this conclusion back to the current maximizer to inform its choice of which policy to pick.

To begin (we aren't reasoning from the perspective of the agen