UDT1.01: Plannable and Unplanned Observations (3/10)

Diffractor

The Omnipresence of Unplanned Observations

Time to introduce some more concepts. If an observation is "any data you can receive which affects your actions", then there seem to be two sorts of observations. A plannable observation is the sort of observation where you could plan ahead of time how to react to it. A unplanned observation is the sort which you can't (or didn't) write a lookup-table style policy for.

Put another way, if a policy tells you how to map histories of observations to actions, those "histories" are the plannables. However, to select that policy in the first place, over its competitors, you probably had to do some big computation to find some numbers like "expected utility if I prepare a sandwich when I'm in the kitchen but not hungry", or "the influence of my decisions in times of war on the probability of war in the first place", or "the probability distribution on what the weather will be if I step outside", or "my own default policy about revealing secret information". These quantities affect your choice of action. If they were different, your action would be different. In some sense you're observing these numbers, in order to pick your action. And yet, the lookup-table style policies which UDT produces are phrased entirely in terms of environmental observations. You can write a lookup-table style policy about how to react to environmental observations. However, these beliefs about the environment aren't the sort of observation that's present in our lookup table. You aren't planning in advance how to react to these observations, you're just reacting to them, so they're unplanned.

Yeah, you could shove everything in your prior. But to have a sufficiently rich prior, which catches on to highly complex patterns, including patterns in what your own policy ends up being... well, unfolding that prior probably requires a bunch of computational work, and observing the outputs of long computations. These outputs of long computations that you see when you're working out your prior would, again, be unplanned observations.

If you do something like "how about we run a logical inductor for a while, and then ask the logical inductor to estimate these numbers, and freeze our policy going forward from there?", then the observations from the environment would be the plannables, and the observations from the logical inductor state would be the unplanned observations.

The fundamental obstacle of trying to make updatelessness work with logical uncertainty (being unsure about the outputs of long computations), is this general pattern. In order to have decent beliefs about long computations, you have to think for a while. The outputs of that thinking also count as observations. You could try being updateless about them and treat them as plannable observations, but then you'd end up with an even bigger lookup table to write.

Going back to our original problem, where we'll be seeing n observations/binary bits, and have to come up with a plan to how to react to the bitstrings... Those bitstrings are our plannable observations. However, in the computation for how to react to all those situations, we see a bunch of other data in the process. Maybe these observations come from a logical inductor or something. We could internalize these as additional plannable observations, to go from "we can plan over environmental observations" to "we can plan over environmental observations, and math observations". But then that would make our tree of (plannable) observations dramatically larger and more complex. And doing that would introduce even more unplanned observations, like "what's the influence of action A in "world where I observe that I think the influence of action A on my utility is 0.03" on the expected utility in "world where I observe that I think the influence of action A on my utility is -0.02"". Don't worry, I had difficulty parsing that sentence too.

You can't plan a policy for how to react to everything, because crafting a policy of that form that isn't garbage requires you to see a bunch of data of the form "what precisely happens if I run this policy?" without having planned in advance how to react to that data.

Similarly, for our "local affineness" assumption, to write a lookup table for situations requires knowing an environment which takes about $2^{2 n}$ numbers to describe.

Let's take our learning-theory view. Assume there's a bunch of (sufficiently nice) environments that depend on your policy, and we want a policy that converges to optimal behavior in all of them, where "optimal behavior" involves taking advantage of retrocausal effects and predictability and commitments and cross-branch effects and stuff like that. Any nice policy we'd get would probably look like "figure out which world you're in, and with this information, come up with advance plans for situations. These advance plans won't involve you reasoning "I'm in branch X, time to optimize for only this branch", and will involve optimizing for global utility instead".

Basically, I think we're going to have to learn how to act like we aren't learning, and update towards a good way to be updateless.

So Now What??

I should take this moment to go over why updating is good, even. Back to the very very basics, here.

Let's say we have a function of two variables, $f (x, y)$ . You can call x "nature", and y "actions", if you feel like it. If there's a probability distribution $μ : Δ X$ , then we have $max y E_{x \sim μ} [f (x, y)] \leq E_{x \sim μ} [max y f (x, y)]$ Maximization gets more powerful the more layers of expectations it's inside, because the "choice" of a number that maximization makes can now depend on more things. The power of updating is that it lets you correlate "what happens" and "what I do about it", and gives you access to more options than just the simple plans which discard such correlations.

The ability of lookup-table policies to emulate maximization depends on $E_{x \sim μ} [max y f (x, y)] = max g : X \to Y E_{x \sim μ} [f (x, g (x))]$ Now, this equality depends on $μ$ being a probability distribution that isn't entangled with your choice of function $g$ . If $μ$ depended on your choice of $g$ , the lookup table would be more powerful. But in order to get a good lookup table, guess what, it requires you to correlate "my choice of policy" with "how well that policy does". Being able to pick a good policy $g$ depends on you knowing the function $f$ and how $μ$ depends on $g$ (or doesn't). And this might take a bit of experimentation to figure out...

However, even if we despair about our ability to be completely updateless... There's still a nontrivial desideratum to be attained, here.

Namely, regardless of the plannable situation that you're in, even if the way you react to observations (both the plannable and unplanned) has some "broad policy" (way of reacting to both plannable and unplanned observations) that beats it, you really shouldn't lose to any precommitted action that only depends on plannable information.

An agent shouldn't regret its own strength, it shouldn't flinch away from wielding its power of thought. If it thinks that its mode-of-thinking, which has the power of entangling itself with reality in ways that pre-planned strategies can't, loses out to some pre-planned strategy... Well, something's going wrong somewhere.

And now I can finally sort of say what UDT1.01 is. It was my attempt at writing a broad policy/algorithm for how to respond to unplanned information, and wield that information to do well in environments which might demand updateless behavior, without having to plan everything in advance at the start of time, and without losing to "just precommit and ignore the unplanned information, lol".

Well, actually, my first (cringeworthy) attempt was trying to directly write such a thing. The second, much better, attempt, was just doing the really-dumb-gradient ascent that I talked about last post, at a higher meta level, and taking the algorithm that dropped out of it.

More specifically, if you try to work out what a "gradient in broad-policy-space" looks like, and throw a moderately concerning amount of assumptions at unpacking the resulting equation, you can get a really interesting result.

If $P$ is our set of plannable observations, and $B$ is our set of broad policies/ways of reacting to observations (plannable or unplanned), and $P \to Δ B$ is our set of "metapolicies" (policies of the form "if I see the plannable event $h$ , sample an algorithm $A$ from the distribution $π (h)$ and run it on the unplanned data coming in from my beliefs")...

Then you can show that, for any metapolicy $π$ , plannable event $h$ , and broad policy $A$ , " $π (h)$ but with a little bit of UDT1.01 added into the mix" will match or outperform " $π (h)$ but with a little bit of algorithm $A$ added into the mix".

Put another way, no matter how you react to plannable events by running algorithms on unplanned information, and no matter the plannable event, you'll think "yup, I should increase how much I run UDT1.01 here, it's a better algorithm to play here than any of its competitors". If the optimal improvement to a meal, regardless of meal, is "replace a tiny little bit of it with salt", then the optimal meal is a giant salt cube.

Amusingly enough, for an algorithm that fell out of "do dumb-gradient-ascent in algorithm space to figure out which algorithm to run in a plannable situation", UDT1.01 itself is mostly "Using the unplanned information, do dumb-gradient-ascent in action space, to figure out which action to play in a plannable situation".

Except that there were a few weird terms showing up in it which weren't in my initial attempt, that I didn't see coming ahead of time. They wound up teaching some novel lessons about dynamic consistency, and not regretting future-you's actions.

So I'm pretty happy with it, because it feels more like a preexisting thing I unearthed than something I constructed.

However, UDT1.01 did run into the problem that I couldn't plug a logical inductor directly into it and have it work (you run into a whole bunch of interesting problems when you do that), and the assumptions needed for its optimality were concerningly strong. I'll flag and discuss those assumptions, don't worry.

The Tree Of Observations, Expanded and Contracted

Let's introduce the setting that the UDT1.01 proof works in. $O$ is the finite set of environmental observations, $A$ is the finite set of actions. We receive an infinite sequence of environmental observations. Our plannables are $O^{< ω}$ , aka histories, aka finite sequences of environmental observations. The space of narrow policies (that only depend on plannable information) is $O^{< ω} \to Δ A$ . After every plannable observation/finite history $h$ , we receive some unplanned observations $S_{h}$ , our "epistemic state at $h$ ", which contains data such as expected utilities, and guesses as to our policy, and guesses as to how various actions at various places influence the probabilities of upcoming environmental observations.

The Tree Of Observations (Contracted) is the tree of all the finite sequences of environmental observations/tree of all the plannable observations/tree of finite histories.

A key note here is that, in the original setting from the first two posts, there was some objective fact about how the environment reacted to your policy, and how probability flowed through the tree, and how your actions at some places influenced probabilities at other places. That's not the case here! There isn't an objective way that probability flows through this tree, though at each of the histories/plannables $h$ , your epistemic state $S_{h}$ does have some opinions about how probability will continue to flow through the tree. As well as opinions about how your actions in the future of the tree will affect expected utilities and probabilities now, opinions about what your policy does at other places, et cetera.

The Tree Of Observations (Expanded) is the tree of all finite sequences of environmental observation/epistemic state pairs. Ie, a path through this tree is like "our first epistemic state is $S_{\emptyset}$ . Our first environmental observation is $a$ , our second epistemic state is $S_{a}$ , our second environmental observation is $b$ , our third epistemic state is $S_{a, b}$ ...". It's got branches for the unplanned observations/information we get from thinking more, as well as the plannable observations. As you can imagine, this tree has a really ridiculously huge branching factor. Another thing to note is that "history $h$ ", while it's a single node in the contracted tree of observations, is a huge equivalence class of nodes in the expanded tree of observations. So a broad-policy on the Tree of Observations (Expanded), if you tried to crunch it down to a narrow-policy on the Tree of Observations (Contracted), would look probabilistic. If you're gonna do some fancy computations depending on unplanned information when you get to a certain plannable event $h$ , from past-you's vantage point, that will look like uncertainty about what future-you will do when it gets to $h$ .

LESSWRONG
LW

UDT1.01: Plannable and Unplanned Observations (3/10)

31

31