# Ω 3

Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We know a decent amount about how to get tiling in proof-based environments where the objective is for the AI to achieve a goal, or to write an AI to achieve a goal, or to write an AI that writes an AI that achieves a goal, and so on, as long as the chain of deferral is finite. The probabilistic setting is far less explored. This post will outline some necessary conditions to achieve tiling of an expected utility maximizer in fully general environments. Some of these conditions are fulfilled by a logical inductor in the limit. It's entirely possible that there are serious problems with this proof, so I'm hoping that it gets a good kicking.

## DEFINITIONS:

is an inputless computation that outputs a bitstring, which will be interpreted as a 2-tuple of an action, and another computation, an element of .

is the n'th computation in the infinite sequence of computations defined by starting at and taking the computation that each computation outputs. When n=1, the superscript will be omitted.

is the action selected by .

is some probability distribution or something that can be used to evaluate complicated computations. Like the n'th stage of a logical inductor, although we aren't necessarily assuming that it has to be a logical inductor.

underlining something refers to replacing the symbol with the thing being described. If the variable is used as a way of abstractly representing some action, then refers to the actual action chosen by the policy.

is the utility function, which is some large computation of type . So this would be the utility function output when it is fed the past actions generated by starting at , the current action, and the infinite sequence of future actions is given by whatever sequence of actions is produced by the self-modification chain starting with . Note that this only explicitly writes the starting code, and the code that might be modified to, not the past or future action sequence! This is important for the agent to be able to reason about this computation, despite it taking an infinite input.

Given some sequence of probability distributions, and some sequence of sets of policies, let be the computation

This is basically maximizing expected utility, conditional on the agent itself taking some action, with the past sequence of actions given by an expected utility maximizer on all previous steps.

Actually, in the proof, we will be assuming a 2-step process where some essential things are proved, and then argmax is done. This is done so that the future probability distribution respects some important information. The three important bits of computation are that the action of the previous computation must be computed/observed, that all the policies in must be emulated to know what policy and action they produce, with all of the produced policies added to , and the proof in the next section must be done to ensure that the conditional probabilities work out appropriately.

Now, time for the assumptions.

The first assumption is that abstract expected utility equals concrete expected utility.

This essentially says that if you're going to actually take action 4, then the expected utility of "I take an action" had better be the same as the expected utility of action 4. This is probably fulfilled by a logical inductor, as it would be possible to buy a conditional share/regular share in the lower-priced utility computation, and sell a conditional share/regular share in the higher-priced utility computation, and when the action is actually taken, the provable equivalence between the action actually taken and the action in the abstract means that the prices converge to each other, and the theorems in the logical induction probably enable profiting off the predictable future equality.

One important note is that CDT spectacularly fails this property. Namely, consider a game of matching pennies against a powerful predictor. Since the environment takes actions as input, it's possible to recompute what would have happened if a different action is plugged in. The CDT agent that keeps losing is going to learn to randomize between actions since it keeps seeing that the action it didn't take would have done better. So it eventually gets to a state where it predicts the reward from "pick heads" and "pick tails" is 0.5 (because there's a 50% chance it doesn't pick heads/tails), but it predicts the reward from "I take an action" is 0, violating this assumption. EDT, however, will predict that picking heads, picking tails, and picking some action all have utility 0, thus fufilling this condition.

The second assumption is that the environment is fair. Namely:

This says something like "after finding out that you take a particular action and policy, that should be the same expected utility as some other policy picking the same action and successor policy", or, more snappily, "finding out that you take a particular action shouldn't be bad news for the expected utility of a particular sequence of actions". The matching pennies predictor that is just penalizing the sequence of actions selected by an argmaxer is unfair in this sense for the argmaxer. This condition (or something like it) seems necessary, in the sense that you can come up with examples where a failure of this condition incentivizes modifying yourself to a different way of selecting actions.

The third assumption (which has already been stated), is that strategy stealing works. More specifically, if , then , and is computed at time n. This is another necessary condition. Consider a fair environment that rewards you when you do the hamster dance. You are pretty sure you remember the steps, but aren't completely sure. There's another policy that will make you correctly perform the motions. The expected utility of "move your arm left" is different than the expected utility of "hamster dance policy does its move". However, if you were able to compute what the hamster dance policy would output as its next move, you could just copy the move, without giving up your argmaxiness property. Therefore, if you don't know what move a policy outputs, there may be an incentive to modify to it, even if the environment is fair.

The fourth assumption is conditional future-trust of the following form:

This is emphatically false in general, but there's a special condition that makes it viable, namely that the distribution at time n is guaranteed to assign probability 1 to iff . My epistemic state about this is "this seems extremely plausible, but I don't know for sure if logical inductors attain this property in the limit"

## PROOF:

We will reason from the perspective of the expected utility maximizer in the proof phase shortly before turn n-1, and show that the future distribution that it trusts endorses expected utility maximization over all other policies, and use conditional self-trust to move this conclusion back to the current maximizer to inform its choice of which policy to pick.

To begin (we aren't reasoning from the perspective of the agent yet), assume that there is some pair that achieves a strictly higher expected utility (according to ) than all other pairs where the second element is .

In particular, there is some pair that achieves a higher expected utility than .

Now, reason from the perspective of the agent, and take as a starting assumption that .

To begin with, by the definition of , and because is in because of the strategy-stealing assumption, then for whatever action and policy that the argmax policy picks,

holds, because of the definition of argmax. (Very important note: this inequality isn't necessarily true if we have exploration steps! Exploration steps prevent this tiling argument from going through. However, we require well-defined conditionals for the rest of the reasoning, which is attained by exploration steps, so there is a weird tension here.)

Now, by the argument for the equality of abstract and concrete utility, if there's some policy in particular that is getting a high conditional expected utility, the conditional expected utility of "I self-modify to whatever I'm going to self-modify to" should be the same as the expected utility of "I self-modify to this particular policy that looks really good".

Now, by assumption 1, about the equality of abstract and concrete utility,

And because by assumption, and the future probability distribution will have seen that and assigned it probability 1, (so conditioning on it doesn't change anything)

Now, we'll begin working with the other side of the inequality. By assumption 2, that the environment is fair,

Now, since the agent at time n has computed and proved that

And, by the previous argument about how the action taken is x and the future-agent knows it,

Now, because, the agent has proved that

if the agent responds to proofs of by having the expected value of the LUV conditional on being then in the argmax estimating phase,

And then, by linearity of expectation,

and then, by assumption 4 about conditional self-trust, we get that

And this contradicts the starting assumption.

Therefore, if all the assumptions hold, for all timesteps, as well as stuff like linearity of expectation under conditionals, and the argmax policy is first in the tie-breaking ordering, then if there's an expected-utility maximizer on the first step, it will keep tiling to expected utility maximizers forever afterwards. Note that this also makes the implicit assumption in the argmax that the past action sequence was done by expected utility maximizers true.

## ANALYSIS:

There are a few interesting features of this proof, even though it's ultimately unsatisfactory. One interesting feature is that the argument fails when the agent has exploration steps. This feels tied to the issue where AIXI is optimal for the starting policy distribution, while AIXI+exploration is optimal for the starting policy distribution. The proof is dissatisfactory because it heavily relies on exact equalities, which don't hold for logical inductors (rather, the relevant properties hold in the limit).

Another reason for dissatisfaction is that this is an EDT agent, and it doesn't take into account most of the more exotic decision-theory stuff. In particular, it's quite unclear what an agent of this type would do on Parfit's Hitchiker, and I would at least like to generalize it to handle cases where the future action affects the probability of some observation in the past, even if it can't quite attain the full force of UDT. There's also the issue of the conditional probabilities becoming increasingly nonsensical (and self-fulfilling prophecies showing up, such as the cosmic ray problem) if we throw out exploration steps. After all, the nice properties in the logical inductor paper only apply to sequences of actions where the probability approaches 0 slowly enough that the sum of probabilities across turns limits to infinity (such as the harmonic sequence). Finally, I'm unsure what this does on classic problems such as the procrastination paradox where you have to avoid pressing a button for as many turns as you can, but still press it eventually, and I'd be interested in analyzing that.

There are still some interesting lessons that can be taken from this, however. Namely, that the ability to steal the action of another policy is essential to show that you won't tile to it (because you can just copy the action while keeping your decision-making architecture intact, while otherwise, fair environments can have incentives to self-modify). Also, the expected utility of "I take an action" being equal to the expected utility of "I take this specific action that looks like the best so far" seems to be unusually important, because in conjunction with future-trust, it allows working around the core problem of current you not knowing what future-you will do.

Looking at the problems from Logical Inductor Tiling and Why It's Hard, we can make the following recap:

The first problem of CDT sucking can be overcome by conditioning on "I take this action"

The second problem of being bad at considering sequences of actions can be overcome by just looking one step ahead, and if you want to get certain sequences of actions that look good, you just need to strategy-steal at each step from a policy that enacts that sequence of actions, and this also has the nice property of preserving the freedom of what your action will be in the future if future-you thinks that some other action sequence is actually better.

The third problem is solved by the solution presented in the linked post.

The fourth problem of forgetting information isn't really solved head-on, because the future-trust property probably fails if the future-agent throws away information. However, it's highly suggestive that this tiling result just involves an abstract sequence of past actions, instead of unpacking them into an explicit sequence of past actions, and also this tiling result just looks one step ahead instead of considering a far-future version of itself.

The fifth problem of having strategies that react to the environment isn't really solved, but it does manage to get around the problem of not knowing what some other algorithm will eventually do. All that is really needed is that the future version of the agent knows what the other algorithm do so it can strategy-steal. This is still not a fully satisfactory solution, since there's one policy that's clearly more powerful than the rest of them, but it still seems like substantial progress has been made.

The sixth problem isn't really a fixable problem, it's just a situation where the agent has an incentive to modify its policy to something else because it violates the fairness condition.

The seventh problem about being unable to experiment with the consequences of a code rewrite is partially solved by assuming that the environment only cares about actions, and partially solved by conditioning on actions instead of "I pick this policy" (this conditional is higher probability)

Problem 8 is dodged, and is pretty much the same as the fifth problem.

Problem 9 is not addressed, because this tiling proof assumes exact equalities, and doesn't deal with the noise from logical inductors not being perfect impairing sufficiently long chains of trust. However, the fact that the proof only looks one step ahead feels promising for not having compounding noise over time breaking things.

Frontpage