UDT1.01 Essential Miscellanea (4/10)

Diffractor

This is the post with some needed concepts and discussion that didn't cleanly fit into any other section, so it might be a bit of a rambly mess.

Specifically, this post splits into two parts. One is assorted musings about when to defer to past-you vs current-you when making decisions, if we're permitting unplannable observations. The other is looking at Vanessa's dynamically consistent update rule, and combining that with our earlier reasoning on affineness to reduce the complexity of the finite-time-horizon version of UDT down from exponential to a smaller exponential.

Why Defer To Past-You?

So, one of the major features of the standard model (environmental observation space , action space $A$ , finite sequences of environmental observations are our plannables, there is a function $e : O^{< n} \times (Δ A)^{O^{< n}} \to Δ O$ that's affine in the second variable), which makes things a lot easier than they'd be in practice, is that there's an objective answer to what $e$ is. All the versions of you agree on which tree they're in and how their actions affect how probability-mass flows through it. There's an objective answer to "how much does playing a bit more of this action over here affect what observation happens over there?"

But, if our information about how actions influence results doesn't descend from heaven at the start at the time, things get more complicated. If "my beliefs about how actions affect other places" are unplanned observations that you receive after thinking for a while, you can't just go "have me-at-the-start-of-time say what to do"

If you're having to learn which tree/environment you're in, and that learning consists of unplanned observations, and your actions affect how probability-mass flows through that tree, then this raises the problem that the yous which have made different observations can disagree on how their actions affect how probability-mass flows through the tree, and act at cross-purposes.

Is there any principled way of figuring out which sorts of things you should defer to past-you's wishes on, or which sorts of things you should delegate to future-you, when you allow unplanned observations to be something that happens sometimes? Well, to be honest, I'm still sorta confused about this, but I feel more confident than I used to be.

One of the standard arguments trotted out in UDT discussions, is that if you're not behaving in an updateless way, the early versions of you would predictably regret future-you's decision-making, and act so as to seal your decision making, and precommit to what to do ahead of time. An agent where the early version of the agent never wants to seal off the ability of its future-self to freely act is called "dynamically consistent", it doesn't ever think the future version of itself is making a predictable mistake.

But, on the other hand, as we saw in the last post, the power of waiting-till-later to maximize is that the choice of maximizing action can correlate itself with more things if it's inside more expectations. This includes unplanned observations! The argument for waiting till later is that unexpected stuff might show up, which early-you didn't plan for in advance, and the rigid plans that early-you would have wanted to precommit to aren't flexible enough to deal with the situation you've (unplannably) observed yourself to be in.

In fact, it's possible to show that (modulo some assumptions which all the interesting philosophy is hiding within), logical inductors will converge to thinking it's better to think longer before making a decision for the usual reasons of "maximization does better the more expectations it's itside". More specifically, the critical reasoning step is something like "if you're picking over a small-enough set of actions, then for every specific action A, you know

${max}_{A^{'}} (what future me thinks of A') \geq what future me thinks of A$

So,

$what I think({max}_{A^{'}} (what future me thinks of A')) is$ $\geq {max}_{A} (what I think(what future me thinks of A) is)$

And, if the laws of iterated expectations hold for your actions (they do, in the limit), you can collapse the latter term down to ${max}_{A} (what I think of A)$ "

The fiddly complications on that basic reasoning come around when you're trying to keep careful track of all the conditionals you're operating within, and how to link everything to "expected score of me making a decision now" and "expected score of me deferring so future-me makes the decision later". When you do this, the most critical and dubious assumption you have to make is:

expected utility of me deciding now to play action A $\leq$ my expectation of (future me's expected utility of A) if I defer

This assumption gets broken in cases where you're up against a kinda dumb opponent who can see commitments you make now, but who isn't smart enough to correlate their action with what future-you will do. It also gets broken if you think future-you's estimates of action consequences are predictably screwed up.

How UDT1.01 Handles This

In fact, this basic reasoning (deferral is good except to the extent that you think future-you will predictably misestimate the true quantity relative to your own estimates) is sort of what UDT1.01 does. UDT1.01 (fully expanded instead of presented in a recursive way) pretty much says...

There's four sorts of terms present. One set of terms corresponds to the "causal effects", where you acting now has effects on the future. The second set of terms corresponds to the "retrocausal effects", where you acting now has effects on the past. The third set of terms corresponds to the "acausal effects", where you acting now has effects on different branches. And then there's an unexpected fourth batch of terms in UDT1.01, which weren't in my original version. They correspond to a fudge factor of "past-me thought future me would predictably screw up, because early commitments have their own special power, and future-me won't take that into account".

The second, third, and fourth terms (the retrocausal, acausal, and "future me is predictably dumb" terms) are all phrased in terms of influence measures, so as long as you think such weird decision theory issues arise rarely and you have a limited budget for saying "do this thing for weird decision theory reasons", your decision is (probably) controlled by the causal effects.

For the causal/retrocausal/acausal terms, your beliefs are deferred to the most recent version of you that would have both the ability-to-precommit-to-future-actions, and opinions-on-the-result-of-that-precommitment, and it's the most recent version because decisions made with more knowledge are good.

Deferring to any other version of you (and there are four options) will have problems.

One possibility is deferring to the beliefs of a version of you in an alternate timeline. This doesn't work because maybe figuring out the beliefs of an alternate version of you is impossible, particularly in math. For instance, there's no objective fact for "what my beliefs would be if the digits of pi were different". More importantly, those alternate versions of you have no power to bind your decision by precommitments, because they're not in your past, so why care about their opinion?

The second possibility is deferring to your future beliefs. You can't do that because you aren't future-you (yet), and don't have access to their beliefs.

The third possibility is deferring to your more recent past beliefs. Those versions of you will go "well, the event you're asking about already happened/didn't happen, so why care about it?" and further-past-you (with more precommitment power) wouldn't want you to use that reasoning.

And the fourth possibility is deferring to more distant past beliefs, who would go "Well, I can always gain more information and make the precommitment later, while better-informed", and punt the decision off till later. So, defer to the beliefs of the most-recent you with both precommitment-ability, and who hasn't seen the effects of your decision yet.

For causal effects, this means "use your own beliefs". For retrocausal effects, this means "use the beliefs of the past-you that thinks they're being affected by your decision now". For acausal effects, this means "use the beliefs of the latest version of you that didn't know which way things would go".

One neat feature is that all of these are soft constraints, not hard constraints, that leave room for going "Suprisingly, the causal effects look way more important than past-me thought the causal effects would look when they were making their promise. Screw the rules, I'm doing what's right!". Put more intuitively, promises rated for "save a life"-level circumstances aren't necessarily going to hold in "save the world"-level circumstances, though if there's some clever additional plan that keeps the promise and incurs less than 1 life worth of marginal cost, you'll take the clever additional plan.

There are critical assumptions you need to make for this breakdown to hold, which will be discussed later.

And now for something completely different!

Dynamically Consistent Updating

Let's look at the one example we have of "updating while being dynamically consistent", because it points towards a rather important simplification of the problem of "how do I act in a way that takes the complicted mess of retrocausal and acausal effects into account". And this simplification brings the finite-time-horizon version of the problem down even further in complexity.

Specifically Vanessa's InfraBayes setting has a dynamically consistent update rule, where the past agent perfectly agrees with the future agent on what to do.

It's worth going into further detail to understand how the dynamically consistent update rule works, because it makes a lot of what UDT1.01 is doing more legible.

In Vanessa's setting, environments are just the ordinary sort of environment that only reacts to your action, not what you would do in different scenarios. However, you're dealing with sets of environments, and planning for the worst-case. And this lets policy correlations arise. The (worst-case) score for a given policy is

$min e \in Ψ E_{π ⋈ e} [U]$

Ie, the worst-case expected utility (within the set $Ψ$ ) of an environment $e$ , interacting with your policy $π$ .

The way this incorporates policy-correlation effects is that the $min$ lets the choice of environment $e$ vary depending on your overall policy $π$ , and this can capture some problems where the environment depends on your policy instead of on your specific actions.

Updating is interesting, though. Properly speaking, when you update an environment in the infraBayes setting, you get a generalization of an environment, of the form $(λ e, b)$ . The interpretation is that $λ$ is "probability of getting into the situation I'm in in the first place", $e$ is "the environment describing what happens from here on out", and $b$ is the chunk of expected utility that arises from not getting into your situation in the first place.

So, if your policy is $π$ , and utility function is $U$ , and the environment is $e$ , and you update on the history $h$ , then the generalized-environment

$(e, 0)$

would update to become

$(P_{π ⋈ e} (h) \cdot (e | h), P_{π ⋈ e} (\neg h) \cdot E_{π ⋈ e} [U | \neg h])$

The probability of getting into the situation you're in, in the first place, is "probability of h occurring". The environment you're in turns into $e | h$ , ie, "normally update on having seen history h". And the chunk of expected utility coming from never getting into your situation in the first place, is "probability of h not occurring times the expected utility if h doesn't occur".

And this is how we get dynamic consistency to hold, because after seeing an observation, instead of just paying attention to what happens from here on out ( the environment part, which encodes the causal effects), we also pay attention to the probability of getting in this situation in the first place (the retrocausal effects, ie, the number $λ$ ), and the expected utility if we didn't get into this situation in the first place (the acausal effects, ie, the number $b$ ), and this gets correct behavior in a bunch of decision-theory toy problems, and turns dynamic consistency from a nearly-unattainable desideratum to a surprisingly trivial theorem.

But wait, don't you need to know your own policy in order to compute this update? Not entirely. You don't actually need to know what your policy does after you get to history $h$ . You only need to know what your policy does in the situations where $h$ never happened, in order to accurately compute the quantities "probability of h occurring in the first place" and "expected utility if h never occurred in the first place"

Actually, that's an idea for simplifying things considerably! Maybe we don't need to figure out our entire policy in all situations, as long as we've got a reasonable estimate of "expected utility if I don't get into this situation in the first place".

Expected Utility as Complexity Shield

So, we saw from post 2 that affineness in probabilities (probabilities of upcoming observations vary linearly with changes in your action probabilities) didn't automatically imply affineness in expected utility (expected utility varies linearly with changes in your action probabilities). However, we can still take a derivative of a non-affine function to make an affine approximation. Our step from general policy-selection environments to environments where our probabilities of observations were affine brought the complexity down from double-exponential to exponential. So maybe there's a similar savings to be found by stepping from that, to environments where our expected utilities are affine as well.

We want some sort of dynamic consistency result, where the past agent endorses the future-agent's decision making. The one example of a dynamically consistent update rule we have involves figuring out our entire policy and how it behaves in all the situations we aren't in. However, it's only using that information to figure out "the chunk of expected utility that arises if I'm not in the situation I'm in" and "the probability of the situation I'm in", so shouldn't those quantities be the core thing to look at?

And so, it makes sense to look at how expected utility changes as our action varies, as a way of simplifying things even further. In fact, this will let us get the complexity of specifying a "policy selection environment" down from an exponential to a smaller exponential.

More specifically, from the perspective of early-you (which UDT1.01 defers to for figuring out acausal and retrocausal effects), you lack the computational power to figure out precisely what future-you's decision is affecting. You could start off with beliefs about how future-you is affecting every other situation. But it seems perfectly coherent to have vaguer beliefs than that. An example of this reasoning is "I'm not entirely sure, but it seems that having a policy of refusing to reveal secret information even when it looks like a good idea is the sort of thing that benefits me in general, though I don't have the power to figure out precisely which situations benefit from me being like that".

In a sense, "my expectation of utility if I do action A in situation S" is a computation that lets you blackbox all the complicated fiddly bits of precisely how doing A in situation S benefits you. Maybe you could explicitly compute exactly which situations benefit from that action, and how they affect probabilities of all the other situations. Or maybe you're sorta dumb, but have experimented a decent amount, and you go "this sort of thing is expected to be about this helpful". Basically, expected utility is acting as a complexity shield. You don't need to figure out exactly how an action is good in order to know that you get good results empirically from doing it.

With locally affine environments, we were previously at "there's a single belief state, which tells you how actions at every situation affect every other situation", which takes about $| A | \cdot | O |^{2 n}$ numbers to describe.

And so, if you go from "there's a single belief state, which tells you how actions at every situation affect every other situation" to "there's many belief states, but you only have to worry about the belief states of the past yous, and the belief states of the past yous don't track all the influences, just the influences on current expected utilities and probabilities", then this should get you something on the order of $| A | \cdot | O |^{n}$ numbers to keep track of everything. I might have forgotten some multiplicative constants here, but I'm very confident that the number in the exponent is only an additive constant away from n, and definitely isn't 2n.

Pretty much, when you have to keep track of how behavior in every situation affects results at every other situation, and there are about $| O |^{n}$ situations, you should expect about $| O |^{2 n}$ numbers are needed to keep track of everything. But if you're just keeping track of how behavior at every future situation affects your current expected utility, then you need about $| O |^{n}$ numbers to keep track of that, and you can abstract away from a bunch of fiddly details of exactly how the acausal effects work, and instead shrug and go "seems like the sort of thing that works out on average."

The bulk of the complexity now is in figuring out, for a wide variety of future scenarios, their effects on current expected utility. If you "chunked" the future scenarios into polynomially many "bins", where distinct future scenarios could be thrown together as being analogous, you could bring things down to polynomial complexity!

Except that I don't know how to do that chunking in a principled way, so I'll content myself with the exponential for now.

Next post, we'll cover how we can get logical inductors to hold implicit beliefs about a really large set of possibilities, that you can retroactively query, to get a logical inductor to have sane beliefs about exponentially many things. This wasn't quite enough to let me plug a logical inductor into UDT1.01, but it feels like a critical step along the pathway, that's worth discussing.

LESSWRONG
LW

LESSWRONG
LW

28

UDT1.01 Essential Miscellanea (4/10)

28

28

28