Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Probabilistic Tiling (Preliminary Attempt)

4AlexMennen

2Raemon

2AlexMennen

1Diffractor

2AlexMennen

1Diffractor

2AlexMennen

1Vanessa Kosoy

New Comment

I suggest stating the result you're proving before giving the proof.

You have some unusual notation that I think makes some of this unnecessarily confusing. Instead of this underlined vs non-underlined thing, you should have different functions $ and , where the first maps action sequences to utilities, and the second maps a pair consisting of an action and a future policy to the utility of the action sequence beginning with , followed by , followed by the action sequence generated by . Your first assumption would then be stated . Your second assumption (fairness of the environment) is implicit in the type signature of the utility function . If your utility depends on something other than the action sequence, then it doesn't make sense to write it as a function of the action sequence. It's good to point out assumptions that are implicit in the formalism you're using, but by the time you identify utility as a function of action sequences, you don't need to assume fairness of the environment as an additional axiom. I do not understand what your third assumption is.

This is emphatically false in general, but there's a special condition that makes it viable, namely that the distribution at time n is guaranteed to assign probability 1 to iff . My epistemic state about this is "this seems extremely plausible, but I don't know for sure if logical inductors attain this property in the limit"

They don't. For instance, let be any true undecidable sentence. The logical inductor does not assign probability 1 to even in the limit. Your fourth assumption does not seem reasonable. Does not give you what you want?

Note that this only explicitly writes the starting code, and the code that might be modified to, not the past or future action sequence! This is important for the agent to be able to reason about this computation, despite it taking an infinite input.

I think this is exactly backwards. The property that makes spaces easy to search through and reason about is compactness, not finiteness. If is finite, then is compact, and thus easy to search through and reason about, provided the relevant functions on it are continuous. But the space of computer programs is an infinite discrete space, hence non-compact, and hard to search through and reason about, except by remembering that the purpose of selecting a program is so that it will generate an element of the nice, easily-searchable compact space of action sequences.

First: That notation seems helpful. Fairness of the environment isn't present by default, it still needs to be assumed even if the environment is purely action-determined, as you can consider an agent in the environment that is using a hardwired predictor of what the argmax agent would do. It is just a piece of the environment, and feeding a different sequence of actions into the environment as input gets a different score, so the environment is purely action-determined, but it's still unfair in the sense that the expected utility of feeding action into the function drops sharply if you condition on the argmax agent selecting action . The third condition was necessary to carry out this step. . The intuitive interpretation of the third condition is that, if you know that policy B selects action 4, then you can step from "action 4 is taken" to "policy B takes the actions it takes", and if you have a policy where you don't know what action it takes (third condition is violated), then "policy B does its thing" may have a higher expected utility than any particular action being taken, even in a fair environment that only cares about action sequences, as the hamster dance example shows.

Second: I think you misunderstood what I was claiming. I wasn't claiming that logical inductors attain the conditional future-trust property, even in the limit, for all sentences or all true sentences. What I was claiming was: The fact that is provable or disprovable in the future (in this case, is ), makes the conditional future-trust property hold (I'm fairly sure), and for statements where there isn't guaranteed feedback, the conditional future-trust property may fail. The double-expectation property that you state does not work to carry the proof through, because the proof (from the perspective of the first agent), takes as an assumption, so the "conditional on " part *has* to be outside of the future expectation, when you go back to what the first agent believes.

Third: the sense I meant for "agent is able to reason about this computation" is that the computation is not extremely long, so logical inductor traders can bet on it.

Ok, understood on the second assumption. is not a function to , but a function to the set of -valued random variables, and your assumption is that this random variable is uncorrelated with certain claims about the outputs of certain policies. The intuitive explanation of the third condition made sense; my complaint was that even with the intended interpretation at hand, the formal statement made no sense to me.

I'm pretty sure you're assuming that is resolved on day , not that it is resolved eventually.

Searching over the set of all Turing machines won't halt in a reasonably short amount of time, and in fact won't halt ever, since the set of all Turing machines is non-compact. So I don't see what you mean when you say that the computation is not extremely long.

Ah, the formal statement was something like "if the policy A isn't the argmax policy, the successor policy B must be in the policy space of the future argmax, and the action selected by policy A is computed so the relevant equality holds"

Yeah, I am assuming fast feedback that it is resolved on day .

What I meant was that the computation isn't extremely long in the sense of description length, not in the sense of computation time. Also, we aren't doing policy search over the set of all turing machines, we're doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on)

Also I'm less confident in conditional future-trust for all conditionals than I used to be, I'll try to crystallize where I think it goes wrong.

What I meant was that the computation isn't extremely long in the sense of description length, not in the sense of computation time. Also, we aren't doing policy search over the set of all turing machines, we're doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on)

Wouldn't the set of all action sequences have lower description length than some large finite set of policies? There's also the potential problem that all of the policies in the large finite set you're searching over could be quite far from optimal.

I second the suggestion to state what is being proved before proving it.

One important note is that CDT spectacularly fails this property. Namely, consider a game of matching pennies against a powerful predictor. Since the environment takes actions as input, it's possible to recompute what would have happened if a different action is plugged in. The CDT agent that keeps losing is going to learn to randomize between actions since it keeps seeing that the action it didn't take would have done better. So it eventually gets to a state where it predicts the reward from "pick heads" and "pick tails" is 0.5 (because there's a 50% chance it doesn't pick heads/tails), but it predicts the reward from "I take an action" is 0, violating this assumption.

Note, however, that ordinary Bayes-optimal RL works perfectly (assuming there are no traps in the prior or paranoia is otherwise avoided), since it would believe that taking a certain action *causes* the predictor to make the optimal response. This is similar to RL one-boxing in the repeated Newcomb's problem.

We know a decent amount about how to get tiling in proof-based environments where the objective is for the AI to achieve a goal, or to write an AI to achieve a goal, or to write an AI that writes an AI that achieves a goal, and so on, as long as the chain of deferral is finite. The probabilistic setting is far less explored. This post will outline some necessary conditions to achieve tiling of an expected utility maximizer in fully general environments. Some of these conditions are fulfilled by a logical inductor in the limit. It's entirely possible that there are serious problems with this proof, so I'm hoping that it gets a good kicking.

## DEFINITIONS:

πx is an inputless computation that outputs a bitstring, which will be interpreted as a 2-tuple of an action, and another computation, an element of A×Π .

πnx is the n'th computation in the infinite sequence of computations defined by starting at πx and taking the computation that each computation outputs. When n=1, the superscript will be omitted.

anx is the action selected by πnx.

En is some probability distribution or something that can be used to evaluate complicated computations. Like the n'th stage of a logical inductor, although we aren't necessarily assuming that it has to be a logical inductor.

underlining something refers to replacing the symbol with the thing being described. If the variable aπ is used as a way of abstractly representing some action, then aπ––– refers to the actual action chosen by the policy.

U(a1:n−1st,x,a1:∞π) is the utility function, which is some large computation of type Aω→[0,1]. So this would be the utility function output when it is fed the past actions generated by starting at πstart, the current action, and the infinite sequence of future actions is given by whatever sequence of actions is produced by the self-modification chain starting with π . Note that this only explicitly writes the starting code, and the code that might be modified to, not the past or future action sequence! This is important for the agent to be able to reason about this computation, despite it taking an infinite input.

Given some sequence of probability distributions, and some sequence of sets of policies, let π∗n be the computation

argmaxx,π∈A×ΠnEn(U(a∗1:n−1,x,a1:∞π)|a∗n=x)

This is basically maximizing expected utility, conditional on the agent itself taking some action, with the past sequence of actions given by an expected utility maximizer on all previous steps.

Actually, in the proof, we will be assuming a 2-step process where some essential things are proved, and then argmax is done. This is done so that the future probability distribution respects some important information. The three important bits of computation are that the action of the previous computation must be computed/observed, that all the policies in Πn−1 must be emulated to know what policy and action they produce, with all of the produced policies added to Πn, and the proof in the next section must be done to ensure that the conditional probabilities work out appropriately.

Now, time for the assumptions.

The first assumption is that abstract expected utility equals concrete expected utility.

En(U(a∗1:n−1,a1:∞∗n))=En(U(a∗1:n−1,a∗n––––,a2:∞∗n)|a∗n=a∗n––––)

This essentially says that if you're going to actually take action 4, then the expected utility of "I take an action" had better be the same as the expected utility of action 4. This is probably fulfilled by a logical inductor, as it would be possible to buy a conditional share/regular share in the lower-priced utility computation, and sell a conditional share/regular share in the higher-priced utility computation, and when the action is actually taken, the provable equivalence between the action actually taken and the action in the abstract means that the prices converge to each other, and the theorems in the logical induction probably enable profiting off the predictable future equality.

One important note is that CDT spectacularly fails this property. Namely, consider a game of matching pennies against a powerful predictor. Since the environment takes actions as input, it's possible to recompute what would have happened if a different action is plugged in. The CDT agent that keeps losing is going to learn to randomize between actions since it keeps seeing that the action it didn't take would have done better. So it eventually gets to a state where it predicts the reward from "pick heads" and "pick tails" is 0.5 (because there's a 50% chance it doesn't pick heads/tails), but it predicts the reward from "I take an action" is 0, violating this assumption. EDT, however, will predict that picking heads, picking tails, and picking some action all have utility 0, thus fufilling this condition.

The second assumption is that the environment is fair. Namely:

∀π:En(U(a∗1:n−1,aπ–––,a2:∞π))=En(U(a∗1:n−1,aπ–––,a2:∞π)|a∗n=aπ–––)

This says something like "after finding out that you take a particular action and policy, that should be the same expected utility as some other policy π picking the same action and successor policy", or, more snappily, "finding out that you take a particular action shouldn't be bad news for the expected utility of a particular sequence of actions". The matching pennies predictor that is just penalizing the sequence of actions selected by an argmaxer is unfair in this sense for the argmaxer. This condition (or something like it) seems necessary, in the sense that you can come up with examples where a failure of this condition incentivizes modifying yourself to a different way of selecting actions.

The third assumption (which has already been stated), is that strategy stealing works. More specifically, if π∈Πn−1∧π≠π∗n−1, then π2∈Πn, and a1 is computed at time n. This is another necessary condition. Consider a fair environment that rewards you when you do the hamster dance. You are pretty sure you remember the steps, but aren't completely sure. There's another policy that will make you correctly perform the motions. The expected utility of "move your arm left" is different than the expected utility of "hamster dance policy does its move". However, if you were able to compute what the hamster dance policy would output as its next move, you could just copy the move, without giving up your argmaxiness property. Therefore, if you don't know what move a policy outputs, there may be an incentive to modify to it, even if the environment is fair.

The fourth assumption is conditional future-trust of the following form:

En−1(En(U)|ϕ)=En−1(U|ϕ)

This is emphatically false in general, but there's a special condition that makes it viable, namely that the distribution at time n is guaranteed to assign probability 1 to ϕ iff ϕ . My epistemic state about this is "this seems extremely plausible, but I don't know for sure if logical inductors attain this property in the limit"

PROOF:We will reason from the perspective of the expected utility maximizer in the proof phase shortly before turn n-1, and show that the future distribution that it trusts endorses expected utility maximization over all other policies, and use conditional self-trust to move this conclusion back to the current maximizer to inform its choice of which policy to pick.

To begin (we aren't reasoning from the perspective of the agent yet), assume that there is some pair (x,π?) that achieves a strictly higher expected utility (according to En−1) than all other pairs where the second element is π∗n.

In particular, there is some pair (x,π?) that achieves a higher expected utility than (x,π∗n) .

Now, reason from the perspective of the agent, and take as a starting assumption that a∗n−1=x.

To begin with, by the definition of π∗n, and because (π2?,a1?) is in Πn×A because of the strategy-stealing assumption, then for whatever action and policy that the argmax policy picks,

En(U(a∗1:n−1,a∗n––––,a1:∞π––)|a∗n=a∗n––––)≥En(U(a∗1:n−1,a?–––,a2:∞?)|a∗n=a?–––)

holds, because of the definition of argmax. (Very important note: this inequality isn't necessarily true if we have exploration steps! Exploration steps prevent this tiling argument from going through. However, we require well-defined conditionals for the rest of the reasoning, which is attained by exploration steps, so there is a weird tension here.)

Now, by the argument for the equality of abstract and concrete utility, if there's some policy in particular that is getting a high conditional expected utility, the conditional expected utility of "I self-modify to whatever I'm going to self-modify to" should be the same as the expected utility of "I self-modify to this particular policy that looks really good".

En(U(a∗1:n−1,a∗n––––,a1:∞π––)|a∗n=a∗n––––)=En(U(a∗1:n−1,a∗n––––,a2:∞∗n)|a∗n=a∗n––––)

Now, by assumption 1, about the equality of abstract and concrete utility,

En(U(a∗1:n−1,a∗n––––,a2:∞∗n)|a∗n=a∗n––––)=En(U(a∗1:n−1,a1:∞∗n))

And because π∗n−1=x by assumption, and the future probability distribution will have seen that and assigned it probability 1, (so conditioning on it doesn't change anything)

En(U(a∗1:n−1,a1:∞∗n))=En(U(a∗1:n−2,x,a1:∞∗n))

Now, we'll begin working with the other side of the inequality. By assumption 2, that the environment is fair,

En(U(a∗1:n−1,a?–––,a2:∞?)|a∗n=a?–––)=En(U(a∗1:n−1,a?–––,a2:∞?))

Now, since the agent at time n has computed and proved that a?=a?–––

En(U(a∗1:n−1,a?–––,a2:∞?))=En(U(a∗1:n−1,a1:∞?))

And, by the previous argument about how the action taken is x and the future-agent knows it,

En(U(a∗1:n−1,a1:∞?))=En(U(a∗1:n−2,x,a1:∞?))

Now, because, the agent has proved that

a∗n−1=x→En(U(a∗1:n−2,x,a1:∞∗n))≥En(U(a∗1:n−2,x,a1:∞?))

if the agent responds to proofs of ϕ→X≥c by having the expected value of the LUV X conditional on ϕ being ≥c, then in the argmax estimating phase,

En−1(En(U(a∗1:n−2,x,a1:∞∗n))−En(U(a∗1:n−2,x,a1:∞?))|a∗n−1=x)≥0

And then, by linearity of expectation,

En−1(En(U(a∗1:n−2,x,a1:∞∗n))|a∗n−1=x)≥En−1(En(U(a∗1:n−2,x,a1:∞?))|a∗n−1=x)

and then, by assumption 4 about conditional self-trust, we get that

En−1(U(a∗1:n−2,x,a1:∞∗n)|a∗n−1=x)≥En−1(U(a∗1:n−2,x,a1:∞?)|a∗n−1=x)

And this contradicts the starting assumption. □

Therefore, if all the assumptions hold, for all timesteps, as well as stuff like linearity of expectation under conditionals, and the argmax policy is first in the tie-breaking ordering, then if there's an expected-utility maximizer on the first step, it will keep tiling to expected utility maximizers forever afterwards. Note that this also makes the implicit assumption in the argmax that the past action sequence was done by expected utility maximizers true.

## ANALYSIS:

There are a few interesting features of this proof, even though it's ultimately unsatisfactory. One interesting feature is that the argument fails when the agent has exploration steps. This feels tied to the issue where AIXI is optimal for the starting policy distribution, while AIXI+exploration is not optimal for the starting policy distribution. The proof is dissatisfactory because it heavily relies on exact equalities, which don't hold for logical inductors (rather, the relevant properties hold in the limit).

Another reason for dissatisfaction is that this is an EDT agent, and it doesn't take into account most of the more exotic decision-theory stuff. In particular, it's quite unclear what an agent of this type would do on Parfit's Hitchiker, and I would at least like to generalize it to handle cases where the future action affects the probability of some observation in the past, even if it can't quite attain the full force of UDT. There's also the issue of the conditional probabilities becoming increasingly nonsensical (and self-fulfilling prophecies showing up, such as the cosmic ray problem) if we throw out exploration steps. After all, the nice properties in the logical inductor paper only apply to sequences of actions where the probability approaches 0 slowly enough that the sum of probabilities across turns limits to infinity (such as the harmonic sequence). Finally, I'm unsure what this does on classic problems such as the procrastination paradox where you have to avoid pressing a button for as many turns as you can, but still press it eventually, and I'd be interested in analyzing that.

There are still some interesting lessons that can be taken from this, however. Namely, that the ability to steal the action of another policy is essential to show that you won't tile to it (because you can just copy the action while keeping your decision-making architecture intact, while otherwise, fair environments can have incentives to self-modify). Also, the expected utility of "I take an action" being equal to the expected utility of "I take this specific action that looks like the best so far" seems to be unusually important, because in conjunction with future-trust, it allows working around the core problem of current you not knowing what future-you will do.

Looking at the problems from Logical Inductor Tiling and Why It's Hard, we can make the following recap:

The first problem of CDT sucking can be overcome by conditioning on "I take this action"

The second problem of being bad at considering sequences of actions can be overcome by just looking one step ahead, and if you want to get certain sequences of actions that look good, you just need to strategy-steal at each step from a policy that enacts that sequence of actions, and this also has the nice property of preserving the freedom of what your action will be in the future if future-you thinks that some other action sequence is actually better.

The third problem is solved by the solution presented in the linked post.

The fourth problem of forgetting information isn't really solved head-on, because the future-trust property probably fails if the future-agent throws away information. However, it's highly suggestive that this tiling result just involves an abstract sequence of past actions, instead of unpacking them into an explicit sequence of past actions, and also this tiling result just looks one step ahead instead of considering a far-future version of itself.

The fifth problem of having strategies that react to the environment isn't really solved, but it does manage to get around the problem of not knowing what some other algorithm will eventually do. All that is really needed is that the future version of the agent knows what the other algorithm do so it can strategy-steal. This is still not a fully satisfactory solution, since there's one policy that's clearly more powerful than the rest of them, but it still seems like substantial progress has been made.

The sixth problem isn't really a fixable problem, it's just a situation where the agent has an incentive to modify its policy to something else because it violates the fairness condition.

The seventh problem about being unable to experiment with the consequences of a code rewrite is partially solved by assuming that the environment only cares about actions, and partially solved by conditioning on actions instead of "I pick this policy" (this conditional is higher probability)

Problem 8 is dodged, and is pretty much the same as the fifth problem.

Problem 9 is not addressed, because this tiling proof assumes exact equalities, and doesn't deal with the noise from logical inductors not being perfect impairing sufficiently long chains of trust. However, the fact that the proof only looks one step ahead feels promising for not having compounding noise over time breaking things.