# Ω 3

Personal Blog
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Scott describes two obstacles for logical inductor decision theory: reasoning sanely about untaken actions, and getting a working notion of logical updatelessness. Here, I consider the question: if we could solve the first problem, could we solve the second? I provide a somewhat plausible substitute for updatelessness which relies on taking a logical counterfactual.

## Motivation

Suppose we have a notion of logical counterfactual, , represented by a hypothetical axiomatic system which is also rich enough to talk about computations. We trust ; specifically, if it is a theorem of that an agent would have achieved a particular utility had it followed an alternative strategy, we put as much or more faith in this than our own intuition about what could have been. The 5-and-10 problem is solved by decree: if an agent reasoning with thinks taking the $10 would be worse, then thinks so because it's true. I take it that this does not solve problems which involve updateless-like reasoning about logical uncertainty. In counterfactual mugging with a logical coin, the agent can see that counterfactually giving Omega$100 doesn't get it anything in the real mathematical universe; it doesn't seem like the correct hypothetical would be anything like "If I give Omega $100 having seen that the 1000th digit of pi is even, then the digit might turn out to be odd rather than even, so I might get$10000.". This is the realm of updateless reasoning rather than counterfactuals; but, we have no satisfactory account of logical updatelessness.

So, we can't directly say that it would be better to give Omega $100. However, in a sequence of scenarios using consecutive digits of pi, we can see that the strategy which always gives the$100 when Omega asks for it will do better in the average case than the strategy which refuses; half the time, we get $10K, if we're the sort of person who gives up$100 the other half of the time.

## Average DT

I'll represent a decision problem as a computation which takes no arguments and yields a rational number; The evaluation of will be written . A sequence of decision problems, , can be defined by a function from natural numbers to computations. The average utility up to is . An agent will be a function of . We'll also consider the utility which might have been achieved had agent acted like agent , which is the value such that . ( is supposed to be a directional assignment from the language of , which says that outputs what normally would for all . Hopefully can define such a concept.) I make the assumption that is computable.

Definition: Average DT. Take a sequence of decision problems , an accuracy , and a finite set of alternative strategies . The average-DT agent computes the strategy which selected; call it . It then computes for all . If for any , then selects to determine the output of this round; . Otherwise, .

The idea is that a good way to achieve high utility for large is to choose outputs according to the strategy which does best on average in the limit. We can eventually do that if we keep switching to whatever strategy currently does best on average (but with "stickiness" to prevent us from oscillating strategies forever). This only works under some conditions, though.

Definition: Independent. A sequence of decision problems is independent if implies . That is, hypothetically changing the output of the agent on has no effect on the utility at .

Definition: Tail-dependent. Call a pair of strategies convergent if there exists an such that for all , . A sequence of decision problems is tail-dependent if, for a convergent pair , if exists, then also exists, and .

Tail-dependence is a more general condition than independence. Whereas independence says that results at only depend on actions at , tail-dependence allows to depend on , , and so on, as long as the effects of any initial mistakes can eventually be washed out by following an optimal strategy in the limit. Independence is only defined here because I think of it as the more typical case. The subscript does not represent time; each instance is thought of as a decision problem of its own, and the aggregation of the utility which average-DT does is only in service of getting high utility on individual instances. If there is some dependence between instances, it is to be thought of as interaction with a separate version of oneself.

Theorem. For an average-decision-theory agent , if is tail-dependent, and the limits exist for all in the set of strategies, then the limit exists, and furthermore for all .

Proof. In the tail-dependent case, since the limits of the exist, eventually will vary by less than for all . At that point, the agent will never switch strategies again, because the running average for the strategy currently in use, , will never fall more than below the running average for any other. This means is convergent with . Since exists, does as well, and the two are equal.

(If we knew that only one of the is optimal, we could allow , since the ordering of the running-average utility of strategies can only oscillate forever when the limits are the same.)

Note that the definition of \$A(n) computes rather than . This was because, at least in the independent case, will typically involve itself; it seems potentially unrealistic to assume that this case can be computed by (although the fact that we're only evaluating it under the hypothetical case that we act like may make this feasible). It also costs us nothing, in terms of the optimality guarantee. However, it may be that even is too intractable for to realistically evaluate. We can solve this in theory by using logical induction to approximate for policy selection. No matter how slowly the become computable, , and the same optimality proof applies.

## Discussion

This is very similar in flavor to asymptotic decision theory. I'll use the abbreviations AsDT and AvDT. Like AsDT, AvDT has an accuracy parameter,has a finite set of strategies which we choose from, treats the method of taking counterfactuals as given, deals with infinite sequences of decisions, and cares only about properties in the limit. Unlike AsDT, I choose the highest average reward over all rounds so far rather than the highest expected reward on this round. AsDT is also an epsilon-exploration approach, whereas AvDT relies on to give the correct expected utility alternate actions would have yielded.

These differences make AvDT behave in a more updateless manner than AsDT. In counterfactual mugging with a logical coin, AsDT always uses a logical inductor's best-estimate of the utility it would get right now, so it sees the coin as already determined, and sees no benefit from giving Omega money in the cases where Omega asks for money. AvDT uses logical induction (or simple evaluation) to determine the average utility which a policy gets, so it sees giving Omega money as beneficial.

I doubt that an approach to learning analogous to AsDT's optimistic learning of counterfactuals will work very well here, but that is mostly because I don't think it works very well for AsDT in the first place. (It does not learn the counterfactual structure we would like it to when it plays prisoner's dilemma against a copy of itself, for example.)

A big limitation of this approach is the need to represent decision problems as part of some infinite sequence. Different infinite sequences will yield different 'optimal' approaches for the same decision problem. For example, AvDT's strategy for counterfactual mugging would be much different if were the subsequence of cases where the digit of pi is even. On the one hand, this is like saying that Omega only goes through the whole procedure in cases where it will make Omega money. But on the other hand, we're facing the exact same situation in the world -- it's just the representation as part of a sequence of problems which has changed. It seems the sequence plays a role similar to the prior in UDT, telling us what spread of possibilities to prepare for.

Perhaps there are some conditions under which we can guarantee good asymptotic behavior for not only one sequence of decision problems, but many. If none of the sequences are subsequences of each other, then it seems possible to guarantee eventual optimality for all of them. However, that's a very limiting condition, as illustrated by the counterfactual mugging example in the previous paragraph; we need to know which sequence to pick when one is a subsequence of the other.

But my uneasiness here may mainly be due to the fact that I am accustomed to an agent requiring an arbitrary prior, but unaccustomed to an agent requiring an arbitrary problem sequence. As with the problem of selecting a prior, it seems a good rule of thumb will be to pick the broader one, in absence of contrary information. I'm not holding my breath for a good notion of "universal problem sequence" here, but it would be nice if there could be such a concept.

The desirability of the approach will also depend on whether the dependence on and the restriction to finite sets of strategies can be eliminated. I'm skeptical it will be possible to get rid of either one.

Oh, and another problem: the behavior breaks down quickly outside the optimality conditions. If decision of even one early instance matters forever, the agent will ignore this in its choices. This seems problematic.

Despite all these issues, my current intuition is that this approach has "the right flavor" in some sense: an optimality condition showing that we converge to the best strategy in the limit seems about as good a property as we can hope for in a decision theory compatible with logical uncertainty.

# Ω 3

Pingbacks
New Comment

In counterfactual mugging with a logical coin, AsDT always uses a logical inductor’s best-estimate of the utility it would get right now, so it sees the coin as already determined, and sees no benefit from giving Omega money in the cases where Omega asks for money.

The way I would think about what's going on is that if the coin is already known at the time that the expectations are evaluated, then the problem isn't convergent in the sense of AsDT. The agent that pays up whenever asked has a constant action, but it doesn't receive a constant expected utility. You can think of the averaging as introducing artificial logical uncertainty to make more things convergent, which is why it's more updateless. (My understanding is that this is pretty close to how you're thinking of it already.)

I think AsDT has a limited notion of convergent problem. It can only handle situations where the optimal strategy is to make the same move each time. Tail-dependence opens this up, partly by looking at the limit of average payoff rather than the limit of raw payoff. This allows us to deal with problems where the optimal strategy is complicated (and even somewhat dependent on what's done in earlier instances in the sequence).

I wasn't thinking of it as introducing artificial logical uncertainty, but I can see it that way.

Yeah, I like tail dependence.

There's this question of whether for logical uncertainty we should think of it more as trying to "un-update" from a more logically informed perspective rather than trying to use some logical prior that exists at the beginning of time. Maybe you've heard such ideas from Scott? I'm not sure if that's the right perspective, but it's what I'm alluding to when I say you're introducing artificial logical uncertainty.

I don't think it's much like un-updating. Un-updating takes a specific fact we'd like to pretend we don't know. Plus, the idea there is to back up the inductor. Here I'm looking at average performance as estimated by the latest stage of the inductor. The artificial uncertainty is more like pretending you don't know which problem in the sequence you're at.