Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Policy Selection Solves Most Problems

2paulfchristiano

2abramdemski

0Stuart_Armstrong

0abramdemski

New Comment

4 comments, sorted by Click to highlight new comments since: Today at 4:21 PM

Without reading closely, this seems very close to UDT2. Is there a problem that this gets right which UDT2 gets wrong (or for which there is ambiguity about the specification of UDT2?)

Without thinking too carefully, I don't believe the troll bridge argument. We have to be super careful about "sufficiently large," and about Lob's theorem. To see whether the proof goes through, it seems instructive to consider the case where a trader with 90% of the initial mass really wants to cross the bridge. What happens when they try?

The differences between this and UDT2:

- This is something we can define precisely, whereas UDT2 isn't.
- Rather than being totally updateless, this is just mostly updateless, with the parameter determining how updateless it is.

I don't think there's a problem this gets right which we'd expect UDT2 to get wrong.

If we're using the version of logical induction where the belief jumps to 100% as soon as something gets proved, then a weighty trader who believes crossing the bridge is good will just get knocked out immediately if the theorem prover starts proving that crossing is bad (which helps that step inside the Löbian proof go through). (I'd be surprised if the analysis turns out much different for the kind of LI which merely rapidly comes to believe things which get proved, *but* I can see how that distinction might block the proof.) But certainly it would be good to check this more thoroughly.

policy selection converges to giving Omega the money so long as the difficulty of computing the coin exceeds the power of the market at time.

Would it be sensible to just look for muggings (and ASPs) at the very beginning of the process, and then decide immediately what to do as soon as one is detected?

Come to think of that, precommitting to ignoring knowledge about the result of the coin seems to be the best strategy here; does this cash out into anything useful in this formalism?

Looking "at the very beginning" won't work -- the beliefs of the initial state of the logical inductor won't be good enough to sensibly detect these things and decide what to do about them.

While ignoring the coin is OK as special-case reasoning, I don't think everything falls nicely into the bucket of "information you want to ignore" vs "information you want to update on". The more general concept which captures both is to ask "how do I want to react to thin information, in terms of my action?" -- which is of course the idea of policy selection.

It seems like logically updateless reasoning is what we would want in order to solve many decision-theory problems. I show that several of the problems which seem to require updateless reasoning can instead be solved by selecting a policy with a logical inductor that's run a small amount of time. The policy specifies how to make use of knowledge from a logical inductor which is run longer. This addresses the difficulties which seem to block logically updateless decision theory in a fairly direct manner. On the other hand, it doesn't seem to hold much promise for the kind of insights which we would want from a real solution.

## LI Policy Selection

Rather than running a logical inductor all the way to Pn and making a decision via the expected utilities implied by that distribution, we want to first run it to Pf(n) for f(n)<<n, and use

thatdistribution to make a decision about how to utilize Pn to determine which action to take. (For example, we could choose f(n)=log(n).)For simplicity, I will assume logical induction over exponential-time traders rather than the usual poly-time assumption. I also assume that numbers can be represented efficiently in the language of the logical inductor, IE, the written length is logarithmic in the number.

We can select from a set Pn of policies, each of which is a program taking the market state Pn and outputting an action. With m=f(n) in the following:

bestn:=argmaxp∈PnEm(U(x)|f(x)=m∧policy(x)=p)

policyn:=if explorationn then random(Pn) else bestn

A(n):=policyn(Pn)

The utility U(n) is a function of n because it points to the universe where A(n) lives. The argmax is taken over U(x), where x is uncertain, because n may be too large for Pm to have good beliefs about; the sentences involving n can be so large that they aren't being traded on the market yet. However, since we are using a market which can't be exploited by exp-time traders, Pm

musthave good beliefs about sentences whose length is asymptotically a multiple of the length of m. (The uncertainty should not harm us much, since the policy can look at sentences involving the true n and act accordingly.)The values of explorationn and randomn are made pseudorandom relative to the market state at f(n) by diagonalizing against it, like the exploration defined here. The subscript n is on Pn because we need to grow the policy set, to ensure that the optimal policy is eventually included. This must be done somewhat carefully. If Pn grows too quickly, then it could be that at

anyn the majority of policies have very inaccurate expected utility estimates, even though it converges to an accurate estimate for each individual policy. This is easily prevented, however. If Pn did indeed grow too quickly, then it could be predicted that the expected value of the strategy chosen by the argmax would be systematically lower than its market estimate. Suppose we knew it was systematically overestimated by δ. Although the traders at f(n) might not be able to predict precisely which policy will be chosen by the argmax (due to imprecision in the continuous-function trading strategies), a trader can find the set of policies whose expected values are within some ϵ of the maximum. If ϵ<δ, that trader can make money by betting that all of the top policies are overestimated.So, we just need to grow Pn slowly enough that a trader can implement such a strategy in poly time (f(poly(n)), that is, since we're doing this at the early market time), with shrinking ϵ as n gets larger. The details of what speed this implies seem a bit tedious to work out, since I'd have to dig into the formalism for expressing trading strategies and the runtimes involved. But, it seems quite likely that there's some rate of growth at which this works.

This is not

quitejust a reformulation of son-of-X. It's true that if we're letting the early market state choose any program to run, it doesn't matter much that we feed Pn to that program -- it could recreate Pn itself, or any number of alternate belief structures which might be preferred. However, most of the interesting policies in cases we've thought about seem to in fact be fairly restricted in runtime themselves, letting Pn do most of the thinking. For example, implementing the LIDT policy of taking the max-expected-utility action just requires comparing the expected value of each action. Implementing fairbot in prisoner's dilemma just requires looking at the probability that the other player cooperates, and the probability of a self-diagonalization sentence for pseudorandomness. And so on. (Although it makes sense to put runtime restrictions on Pn, I won't specify any here, since it wouldn't really help achieve any interesting properties. Ideally we would want some sort ofstructureto the good policies to render them legible, rather than just blindly choosing programs.)On the positive side, this sort of policy selection is similar to the idea of observation-counterfactuals mentioned in the happy dance problem; the market state Pn is the observation. It's also an implementation of predictable exploration, so there's some intuitive grounds to think that the counterfactuals will make more sense than those of regular epsilon-exploration, and be more useful for game theory (at least, I think so).

I also think there's a sense in which this approach recovers the hopes of ADT, providing a way to get Agent Simulates Predictor right without getting game theory horribly wrong. In effect, the approach here uses the early-stage logical inductor as the embedding function; its state of ignorance allows it to believe in the correlations needed to see the implications of a policy choice, unlike LIDT. Certainly it is a better approach to solving the problems average decision theory was trying to solve, since it produces "updateless-like" behavior without an arbitrary placement of the decision problem in a sequence of decision problems.

On the other hand, policy selection is just a dirty trick which doesn't provide any insight and which in practice would depend on careful choice of the function f(n), so that the policy selection is done after beliefs have stabilized to a "sensible" degree but before the critical information which we'd like to behave updatelessly about arrives. No principled solution to this problem is offered here; only asymptotic results. I will be speaking of what strategy policy selection "will converge to"; but note, of course, that we're slowing down convergence by the function f(n).

Nonetheless, a dirty hack that cleanly solves more problems at once that any other dirty hack we know of is worth recognizing as such. So, let's get on with the show.

## 5 and 10

This one barely merits mentioning, but, it solves the 5 and 10 problem: if there is an option where it gets utility 5, and an option where it gets utility 10, converges to taking the utility 10 option. This is because its exploration ensures that it eventually has correct beliefs about the result of using different policies; it can't think taking the 10 gives -10 utility, never take the 10, and then never learn better.

## Counterfactual Mugging (with a logical coin)

This one is also pretty straightforward. In counterfactual mugging with a logical coin, policy selection converges to giving Omega the money so long as the difficulty of computing the coin exceeds the power of the market at f(n) time. This is because, before we know which way the coin lands, the expected value of giving Omega the money is higher than that of not doing so.

One might object that the cases where the coin is computable at f(n) time aren't solved by this approach. However, you can always choose a slower-growing f to include more cases. The trade-off is that slower-growing f will take longer to converge to reasonable policies. As far as I'm concerned, this is a real trade-off, reflecting the fact that to accept counterfactual mugging for easier-to-compute logical coins, you have to accept crazier possibilities. The problem with policy selection is more that it provides no guidance with respect to choice of f.

## Agent Simulates Predictor

In proof-based DT, ASP is set up by giving the agent more processing power than the predictor, but giving the predictor more proof-theoretic strength than the agent. This allows the predictor to prove what the agent will do, while allowing the agent to find the output of the predictor by simulating it accurately.

In the realm of logical induction, the two dimensions collapse into one, since a logical inductor with enough processing power will learn to trust the outputs of stronger proof systems. However, we can still set up a scenario similar to ASP. The agent has time to compute Pn, while the predictor only has time to run pg(n), for g(n)<n. The predictor then fills the box based on its expectation that the agent will one-box. It could have some cutoff, filling the box if the agent one-boxes with 95% probability or greater, say. Or, it could fill the box with money

proportionalto its belief that the agent will one-box.Although the predictor may not be exactly accurate, it will be a good predictor in an asymptotic sense: if the agent converges to either one-boxing or two-boxing, the predictor eventually predicts this correctly all the time. On the other hand, as long as g(n) is small enough, the agent can see exactly what the predictor will do.

The idea, of course, is that policy selection does well here as long as f(n) is sufficiently small compared to g(n). If the predictor can simulate the computation which selects the policy, then it will respond to that policy. So, the policy selection can get the best score by predictably one-boxing. If the predictor uses a cutoff like the 95% mentioned earlier, the policy selection can cooperate with a probability just slightly higher than that. Or, if the predictor fills the box with money proportional to its trust in the agent, the agent will select the policy which one-boxes all the way. In other words, policy selection seems to just do the right thing.

## Planting Flowers

Suppose that the agent is created for the purpose of planting flowers, but is given a utility function in some arbitrary way. However, the creator examined the code very carefully, and only ran the agent if it would predictably plant flowers. Planting flowers is costly, but not more costly than never having existed. Should the agent plant flowers?

This is very similar to the previous problem, except the agent is reasoning about whether it exists, rather than whether a box has some money in it. f(n) must be chosen to be small enough that the agent doesn't know yet whether it exists, at policy-selection time. Whether this is feasible depends on the problem statement. If the entire problem is packed into U(n), like we've been doing so far, then the agent's existence isn't implied; U(n) must also evaluate the utility of worlds where the agent isn't created. Then, the problem seems just like Agent Simulates Predictor: it can be solved by choosing f(n) slow enough that the creator's decision is not known at policy-selection time. On the other hand, it might seem very reasonable to put information in the deductive state from which the agent can infer its own existence, such as axioms about itself, or sense-data. This would prevent policy selection from solving the policy so easily, since the policy must be selected before self-existence can be inferred.

## XOR Blackmail

Setting up XOR blackmail is a bit tricky, especially the case with a perfect predictor (or any predictor who knows the agent better than the agent knows itself, really).

In order for the agent to know whether it got a letter, the letter has to be written directly to the deductive state of Pn. But, also, the predictor needs to know what the agent's response to that is in order to send the letter or not. So, we have a simulated agent A′(n) whose deductive state includes the letter, whether the letter is really sent or not. The response of the simulated agent is observed by the predictor, as is the information concerning whether the rare disaster has occurred; then, the predictor sends the letter in the case that the disaster will occur xor the agent will pay up in response to the letter.

If the policy selection chooses not to respond to the letter, the predictor would expect this, and not send the letter except in the case of disaster. On the other hand, if the policy does respond to the letter, then the letter is sent except in case of disaster. Either way, the probability of disaster does not change. So, policy selection will converge to choosing a policy which doesn't respond to the letter.

## Prisoner's Dilemma

I don't know what space of behaviors is really possible in this case. But, a plausible outcome is the NicerBot strategy: cooperate with probability slightly higher than your Pn belief that the other player will cooperate. Cooperating with the probability that they do incentivises cooperation, and adding a little probability of cooperation on top of this ensures convergence to cooperation rather than other fixed-points when both players play NicerBot. (This isn't actually an equilibrium; if one player plays NicerBot, the other is incentivised to cooperate with slightly

lessprobability. But, it's plausible that it bounces around near NicerBot.) The randomness for this can come from self-diagonalizing sentences, like the randomness for exploration.In general, there could be strategies which access a lot of other sentences in the market state, rather than just the ones relevant to the game. NicerBot is a policy which is a function of the other player's probability of cooperation. I could

alsohave a policy which is a function of some features of the other player'spolicy. There's some resemblance to the meta-threat hierarchy, though it likely differs greatly in detail and would need modifications (such as some kind of multi-stage meta-policy selection) in order to mimic meta-threat.There are a lot of other logical propositions which the agent's strategy can utilize, too, beyond anything about the other player's strategy or even anything going on in the game. This makes it seem like the situation will resemble correlated equilibria more than Nash equilibria.

The situation will also likely shift dramatically if the agents have different f(n). Overall, I'm quite curious what the game theory between policy-selection agents will look like. (Granted, we don't even know very well what the game theory between

LIDTagents looks like.)## Troll Bridge

This deserves its own more detailed write-up. I have not checked the proof sketch in detail. However, I'm fairly sure policy selection does not solve Troll Bridge. (Ideas in this section are mainly due to Sam. Oversimplifications in this section are mainly due to me.)

Suppose that the agent wants to cross a bridge, but there is a troll who blows up the bridge exactly when the exploration clause in the agent is active. In our case, explorationn. Obviously, this poses a problem for learning the right counterfactuals. The agent could easily start (due to a poor choice of prior) thinking that it is a bad idea to cross the bridge. Early policies would refuse to cross the bridge except on exploration steps. Experience would seem to reinforce the bad prior, ensuring that the agent always thinks crossing the bridge is a bad idea.

However, the situation is actually much worse than this. Even with a "good" prior which initially thinks it can safely cross the bridge on non-exploration rounds, it seems we will converge to refusing to cross the bridge.

Let A(n)=1 indicate crossing the bridge, and A(n)=0 indicate not crossing. Let □nϕ mean that the logical inductor's deductive state has a proof of ϕ by stage n. U(n) is defined as follows:

U(n):=if A(n)=0 then 5 else if explorationn then 0 else 10

Here's the proof sketch, using indentation to notate reasoning under a supposition, and de-indentation to come back out:

So, even if the agent starts with a "good" distribution on traders which predicts than crossing the bridge on non-exploration rounds is the best policy, at some point the agent finds the above proof. From then on, it avoids the bridge.

It might be tempting to say that policy selection improves the situation somewhat; small enough f(n) can prevent the proof from going through, so isn't it just a tradeoff of slow-growing f solving more decision problems asymptotically but having worse beliefs at small runtimes, as with the decision problems considered earlier? No. No matter what f you choose, so long as it does keep growing, it'll eventually start seeing the above proof.

## Reflective Consistency

In order to talk about whether a policy-selection agent would try to replace itself with a different kind of agent, we have to decide how to represent sequential decisions. The sequence of decision problems U(n) is

notsupposed to represent a sequential decision problem: the agent A(n) is only supposed to care about optimizing the single instance U(n), not any kind of aggregation over the sequence.We could select a single policy for a whole sequential decision problem. Rather than the agent A(n) determining a single action, we introduce A(n,i), and also a sequence of observations On,i, so A(n,i)=policyn(On,i). On,i could include a logical inductor state which has been run for ni steps, plus sensory observations (although we could and probably should fold the sensory observations into the inductive state). Even this case is difficult to analyse. We need to set aside the question of self-modification for simple code-optimization reasons, and situations where Omega offers a large reward for self-modification. Could there be cases where the policy initially selected would want to replace itself with a different policy? It seems likely that some kind of temporal consistency result could be proved, because any policy which would want to replace itself with a similarly fast-running policy could instead look at the logical inductor's beliefs about what that policy would do, and do that. (Note that this result would likely need to make use of the resource bound on policies, which I mentioned in the beginning as being intuitively appealing but not important to anything I show in this post.)

This might not be very satisfying, though, because it may not represent a solution to the problem of allowing a system to think longer about what to do while remaining reflectively consistent. The policy is locked-in. The policy

maybe clever enough to think about what other policies might do better and strategy-steal from them, but it might not -- one would want an analysis of how it decides to do that, to see that it's reasonable. It would be more compelling to have something that could keep thinking longer about what policy to use, and still have some kind of reflective consistency. This seems unlikely to fall out of the approach.## Counterfactuals

(Edited to add this section.)

The problem of counterfactual reasoning seems almost impossible to solve, because our intuitions about "what would happen if only the agent did X" largely come from what

wecan see standingoutside ofthe decision problem. From our perspective, a decision problem actuallydoeshave a functional form: we can substitute in different agents to see what happens. We can design a solution. From an agent's perspective, situated inside a decision problem, itdoes nothave a functional form; things are just the way they are, so it makes significantly less sense to talk about how things could be if only the agent acted differently.From that perspective, CDT-like proposals to solve decision theory problems with revised notions of counterfactual feel fake: using the word "counterfactual" gives you enough freedom to claim just about anything, so that you can pack what

youthink the agent should be thinking into the counterfactuals, and have some false hope that one day you'll figure out how to formally specify counterfactuals which have all the properties you claim.The policy-selection perspective is that EDT-like conditioning is all you'll have, but we

canin fact get the "functional form" of the problem mostly as we want it by backing up to an early state and reasoning about what to do from there. From this perspective, there is not one correct "functional form" to be discovered and embodied in the right theory of counterfactuals. Rather, there's a series of functional forms which you get by viewing the problem from different levels of processing power. Unfortunately, none of these have a distinguishing character of "correctness". Nonetheless, certain stagesdoseem to have most of the properties we would intuitively want of a notion of counterfactual.## Conclusion

Policy selection seems to solve the decision problems which call for updateless reasoning in a direct way, so long as f(n) is chosen well. However, it does fail trickier problems like Troll Bridge, and it doesn't seem to provide as much insight into reflective consistency as we would want.

It's still worth thinking more about how much reflective consistency can be gotten -- the story is far from clear to me at the moment, and it could be that nice results actually can come from it.

I'm also interested in thinking about what the notion of "carefully chosen f(n)" actually means. Can anything formal be said about where the sweet spot is? Is there a principled way to avoid the chaos of a too-early market state while also steering clear of knowledge we need to be updateless toward? It would be surprising if there were, but it would also be surprising if there were

nothingformal to say about it.I'm more excited about approaches which somehow break the concepts used here.