Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Logical Counterfactuals Consistent Under Self-Modification

3Vanessa Kosoy

1Vanessa Kosoy

New Comment

2 comments, sorted by Click to highlight new comments since: Today at 6:25 PM

I think it's possible to interpret the metathreat approach to superrationality along the same lines. Namely, the metathreat is a "policy" that maps possible "logical universes" (the unknown lower level metathreats of the other players) into actions (lower level metathreats of the agent).

I argue that the correct notion of logical counterfactual is the one which an agent would self-modify to have if the agent lacked it, and that this significantly constrains the possibilities. The argument is rather informal, and I'm not really sure if many of the points are new, so I'm posting it in Discussion.

## Where the idea is coming from

(This section is an extended motivation, and may be skipped.)

At the most recent MIRIxLA, Alex Mennen led a discussion attempting to apply conservation of expected evidence to the Demski prior. The basic idea was that approximations of the Demski prior discussed so far can do quite unreasonable things before converging to good probabilities, even those that are uniformly coherent. Can we describe a property for approximations such that the probabilities given at finite times are in some sense reasonable? Based on the idea of conservation of expected evidence, is it possible for an approximation to give finite probabilities such that (in some sense) the probability that the estimate will drift up later is balanced by the probability it will drift down?

We didn't come up with any formulations of this idea which seemed both possible and nontrivial, and we came away with the feeling that something like what we were after would not be possible in the end. The version which we spent the longest time discussing was this one:

(1) ∀t,a≤b:Eϕ(P(ϕ)|Pt(ϕ)∈[a,b])∈[a,b]

Pt() represents the probability estimate at time t, while P() is the probability reached in the limit of the approximation process. The expectation is being taken over ϕ; we suppose a probability distribution μ(ϕ) which we're sampling sentences from. At any time t, sampling only ϕ whose probability

estimatelands in a range [a,b], we want the expected value of theeventualprobabilities to be in that same range. (Although inspired by conservation of expected evidence, this might also be seen as a notion of calibration.)Although we convinced ourselves that this was very likely impossible, we didn't come up with a proof. None of our variations to the definition seemed very promising either.

Somewhat facetiously, I offered the following resolution to our difficulties. One thing which seems artificial about (1) is the expectation over ϕ based on μ. What this accomplishes is to inject a kind of made-up uncertainty into the logically certain world; if we turned a blind eye to the exact sentence ϕ and only considered the current probability estimate, would it be too high or too low on average? We're acting as if we forgot which sentence we're dealing with.

What we

reallywant is an expectation in ourrealuncertainty, namely, our logical uncertainty. Therefore, let's explicitly model our uncertainty about the consistency of theories. We can pretend that the consistency check is a stochastic process which has some chance of eventually declaring sets of sentences inconsistent. Approximate the Demski prior by a sampling procedure as usual, except incorporating these random declarations of inconsistency as well as the usual, so that sentences have some chance of being thrown out. Pt(ϕ) runs the real proof checker for time t, but additionally declares some fraction of sentence-sets inconsistent at random; this fraction decreases with t so that we still accept consistent sentences in the limit. This turns Pt(ϕ) into a computable distribution.Now, with respect to this distribution, conservation of expected evidence and other typical properties will hold. This doesn't

reallyget us anything nice, though -- it's a bit like saying you're well-calibrated when things are sampled according to your own prior. The consistency check isn't anywhere near random, so although our belief structure follows conservation of expected evidence with respect to the approximation procedure (equally expecting probabilities to shift up or down over time), theactualaverage movement based on therealconsistency checking may well be much more predictable in practice.The approach is quite odd, in that I'm suggesting that the way to get a good approximate probability distribution over logic is to start with a good approximate probability distribution over logic. (Consistency facts are, after all, just logical facts.)

I first formulated this "cheater's solution" when I was first thinking about the two update problem. It seemed as if the two-update problem was pointing at different updates for observing X vs observing a proof of X. Rather than keeping possibilities until they're shown to be inconsistent and then throwing them out, we maintain some uncertainty about this. This turns the "second update" from a non-Bayesian surgery on beliefs into a Bayesian update in an expanded belief structure, thanks to the explicit model of consistency.

Again, this doesn't seem to

reallysolve anything for the two-update problem: by modeling the consistency check as a probabilistic variable, I "bayesianize" the second update; it now looks like we're conditioning on information in a proper manner. But, what is really gained? The probabilistic model of consistency checks is bound to be poor, so the probabilities coming out of the approximation won't be very meaningful. It seemed more interesting to look at the proposed solutions (such as backup machines) which might lead to a more powerful prior, rather than the same prior approximated another way. And soon, we moved on from the two-update problem to more fruitful things.One thought which I entertained

verybriefly was to get logical counterfactuals out of an approach like this, by running a version of updateless decision theory using a prior which is logically ignorant, by modeling consistency checks like this. I now think this idea may be a good one, as I'll argue in the remainder of the post.## Motivation

The original TDT paper argued that CDT agents who considered decision-theoretic problems such as Newcomb's problem, and who had the option to self-modify to TDT, would do so.

Roughly speaking, there are two separate claims being made by TDT/UDT: (A) decisions should be made in an timeless/updateless manner, and (B) different actions should be considered according to logical counterfactuals. (In TDT, this means a causal diagram of mathematics; in UDT, this was initially called a "mathematical intuition model" to make sure it was obvious that it was a blank to be filled in.)

It's straightforward to see why an agent given the opportunity to self-modify to conform to point (A) would do so. The very idea of (A) is to choose the actions you

would have wanted yourself to chooseaccording to your prior. The argument for (B) is, I think, less obvious. The point of the current post is to argue that a CDT agent or other sufficiently reasonable agent will in fact self-modify to conform to (B) as well; furthermore, this tells us something about what makes a good or bad logical counterfactual.As in that paper, I'll ignore details of self-modification such as the Lobian obstacle and reason informally, under the assumption that the agent is capable of making reasonable predictions about the results of self-modification.

## Betting on logical coins

The idea came out of a conversation with Scott about betting on logical coin-flips. Suppose that Sandy and Omega are making bets on digits of pi. The digits are far enough out in the sequence that even Omega cannot predict them at better than chance, and Omega can predict Sandy perfectly before running a long computation to determine the outcome. Omega gives Sandy favorable odds to make it worth Sandy's time. However, Omega lacks an enforcement mechanism for the bets, and must trust Sandy to pay out. If Omega predicts that Sandy won't pay for losses, Omega will not offer to make a bet with Sandy. (Notice this means Omega possesses a notion of counterfactual.) This provides an incentive to be a trustworthy betting partner.

Suppose that Sandy accepts arguments for (A), and so computes actions in an updateless manner, but does not accept (B). (Expected utility is evaluated by conditioning on actions as in EDT, or causal conditioning as in CDT, or perhaps something else, but not logical conditioning.) Sandy is a trustworthy betting partner for non-logical (that is, empirical) bets; if the optimal strategy was to commit to the bet prior to seeing the outcome, the optimal strategy will be the same after seeing the outcome, so Sandy pays up without any commitment mechanism. For a logical coin, however, Sandy might falter: discovering new logical information allows Sandy to compute a better approximation of the updateless decision. If the bet is lost, Sandy may now logically rule out the case where the bet was won. (This is especially tempting if Omega provides a proof along with the correct answer.) If so, Sandy will now compute a negative utility for paying out, and refuse to do so. Hence, Omega never offers Sandy a bet.

The conclusion is that Sandy, realizing such problems, will accept (B) as well as (A).

I can see two objections to this argument. The first is that I haven't specified Sandy's decision algorithm at all well, so it's not really clear that she fails to be a good betting partner. The second is that Omega already has a notion of counterfactual, and might be essentially forcing it on the agent; perhaps we can get Sandy to accept any notion of counterfactual we choose, depending on the notion Omega employs. I think both objections can be addressed with a more general argument.

## Time-consistent logical gambles

Sandy's mistake, on my analysis, was to modify the approximation of the updateless decision. Although the updateless expected utility must be approximated (in logically uncertain situations), any

changeto this approximation can lead to temporal inconsistency of decisions, and therefore, reflective inconsistency (Sandy wishes to make decisions in a more temporally consistent way). So, for the same reason that a self-modifying agent will tend to accept (A) and switch to an updateless strategy to improve future expected utility, it will also wish to freeze its approximation Pt(ϕ) at the current value of t.Yet, we would not want to fix one approximation at the beginning and run with that; we

needto improve our approximation of the logical prior over time, if only to reach probability 1 for tautologies and probability 0 for contradictions. What do we do?Just as UDT refuses to update on observations, but instead expresses the policy as a function from observations to actions, we can refuse to update on our consistency check, but express the policy as a function from consistency information

andobservations. Consistency bits are treated like observations!More specifically, my proposal is this: as foreshadowed in the opening section, approximate the Demski prior with a sort of "pre-Demski" prior as follows. Build a distribution over random theories

andthe consistency checks, by sampling sentences as usual for the Demski prior, but modeling consistency checks as a random process check(Γ,t) which has a probability of outputting "inconsistent" at each time t for a given set of sentences Γ. For example, the probability could start at .25 and half with each increment of time, so that the total probability that any set of sentences is inconsistent is .5. (It may be desirable to make this more realistic, for example ensuring that Γ∪{ϕ} is more likely to be inconsistent than Γ alone, but it will always be a very rough approximation so it's unclear how much we ought to do.)Policies are specified as a function from a sequence of observations and consistency checker behaviors to actions; the agent implementation attempts to take actions from the max-utility policy, as computed via this "pre-prior". (A difficulty: there may not be a unique maximum. I'm not sure what to do about that at the moment.)

This approach optimizes strategies over all logical (im)possibilities, putting impossible possible worlds on equal footing with possible possible worlds.

It may be tempting to think this will yield terrible results analogous to using the approximation at time 1, and therefore we need to pack as much information into the initial approximation as we can. However, the result will usually be close to what we would get by the normal process of improving our approximation over time. Just as under certain conditions an updateless theory will give the same answer as a thing that updates on evidence, my suggested procedure would often generate the same answer as something which updated on consistency information.

## What remains

I don't expect writing down the formulas and algorithms of my proposal will be particularly hard, but I haven't done it here. I think there may be some regularities we need to impose on the stochastic consistency model, but I'm not sure. I don't expect to be able to formalize reflective consistency itself without also utilizing a solution to the probabilistic Tiling agents problem, or something close to that; in other words, without solutions to other significant problems. However, a formal description of a class of problems for which the decision procedure is optimal might become possible.

It would be awesome to produce a "logical Dutch book argument" to illustrate problems which arise if logical uncertainty isn't based on a probability distribution over possible impossible worlds, if that idea makes sense.

It might be that we can do

significantlybetter with a more interesting prior on the consistency checks; the best we can hope for is something which learns regularities in logical space. Keep in mind that this "learning" won't be changing the prior we use to evaluate expected utilities; it only modifies the expectations down the branch of possible worlds which our agent follows as it discovers logical facts. (It might also be that this ends up not mattering.)An interesting idea in connection with reflective consistency is that the beliefs on consistency might obey some kind of reflection principle relating them to beliefs about consistency expressed in first-order logic. It stands to reason that an agent considering its own choice of distribution over consistency-check behavior would prefer this distribution to match its logical uncertainty about consistency.