Why conditioning on "the agent takes action a" isn't enough

So8res

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post expands a bit on a point that I didn't have enough space to make in the paper Toward Idealized Decision Theory.

Say we have a description of an agent program, and a description of a universe program $U()$ , and a set of actions $A$ , and a Bayesian probability distribution over propositions about the world. Say further that for each $a \in A$ we can form the proposition "the agent takes action $a$ ".

Part of the problem with EDT is that we can't, in fact, use this to evaluate $E [U() | the agent takes action a]$ . Why not? Because the probability that the agent takes action $a$ may be zero (if the agent does not in fact take action $a$ ), and so evaluating the above might require conditioning on an event of probability zero.

There are two common reflexive responses: one is to modify the agent so that there is no action which will definitely not be taken (say, by adding code to the agent which iterates over each action, checks whether the probability of executing that action is zero, and then executes the action if it is definitely not going to be executed). The second response is to say "Yeah, but no Bayesian would be certain that an action won't be taken, in reality. There's always some chance of cosmic rays, and so on. So these events will never actually have probability zero."

But while both of these objections work -- in the sense that in most realistic universes, $v_{a} := E [U() | the agent takes action a]$ will be defined for all actions $a$ -- it does not fix the problem. You'll be able to get a value $v_{a}$ for each action $a$ , perhaps, but this value will not necessarily correspond to the utility that the agent would get if it did take that action.

Why not? Because conditioning on unlikely events can put you into very strange parts of the probability space.

Consider a universe where the agent first has to choose between a red box (worth $1) and a green box (worth $100), and then must decide whether or not to pay $1000 to meticulously go through its hardware and correct for bits flipped by cosmic rays. Say that this agent reasons according to EDT. It may be the case that this agent has extremely high probability mass on choosing "red" but nonzero mass on choosing "green" (because it might get hit by cosmic rays). But if it chooses green, it expects that it would notice that this only happens when it's been hit by cosmic rays, and so would pay $1000 to get its hardware checked. That is, $v_{r e d} = 1$ and $v_{g r e e n} = - 900$ .

What went wrong? In brief, "green" having nonzero probability does not imply that conditioning on "the agent takes the green box" is the same as the counterfactual assumption that the agent takes the green box. The conditional probability distribution may be very different from the unconditioned probability distribution (as in the example above, where conditioned on "the agent takes the green box", the agent would expect that it had been hit by cosmic rays). More generally, conditioning the distribution on "the agent takes the green box" may introduce spurious correlations with explanations for the action (e.g., cosmic rays), and therefore $v_{a}$ does not measure the counterfactual value that the agent would get if it did take the green box "of it's own volition" / "for good reasons".

Roughly speaking, evidential decision theory has us look at the probability distribution where the agent does in fact take a particular action, whereas (when doing decision theory) we want the probability distribution over what would happen if the agent did take the action. Forcing the event "the agent takes action $a$ " to have positive probability does not make the former distribution look like the latter distribution: indeed, if the event has positive probability for strange reasons (cosmic rays, small probability that reality is a hallucination, or because you played chicken with your distribution) then it's quite unlikely that the conditional distribution will look like the desired counterfactual distribution.

We don't want to ask "tell me about the (potentially crazy) corner of the probability distribution where the agent actually does take action $a$ ", we want to ask "tell me about the probability distribution that is as close as possible to the current world model, except imagining that the agent takes action $a$ ."

The latter thing is still vague and underspecified, of course; figuring out how to formalize it is pretty much our goal with studying decision theory.

[-]Diffractor9yΩ000

UDT has this same problem, though. In UDT, model uncertainty is being exploited instead of environmental uncertainty, but conditioning on "Agent takes action A" introduces spurious correlations with features of the model where it takes action A.

In particular, only one of the actions will happen in the models where con(PA) is true, so the rest of the actions occur in models where con(PA) is false, and this causes problems as detailed in "The Odd Counterfactuals of Playing Chicken" and the comments on "An Informal Conjecture on Proof Length and Logical Counterfactuals".

I suspect this may also be relevant to non-optimality when the environment is proving things about the agent. The heart of doing well on those sorts of problems seems to be the agent trusting that the predictor will correctly predict its decision, but of course, a PA-based version of UDT can't know that a PA or ZFC-based proof searcher will be sound regarding its own actions.

LESSWRONG
LW

Why conditioning on "the agent takes action a" isn't enough

9

Ω 7

9

Ω 7