Coherence of Caches and Agents

8Steven Byrnes

6tailcalled

6Sheikh Abdur Raheem Ali

7johnswentworth

3martinkunev

5johnswentworth

3Sheikh Abdur Raheem Ali

4lukehmiles

3dx26

New Comment

Now, a system which doesn't satisfy the coherence conditions

couldstill maximize some other kind of utility function - e.g. utility over whole trajectories, or some kind of discounted sum of utility at each time-step, rather than utility over end states. But that's not very interesting, in general; any old system can be interpreted as maximizing some utility function over whole trajectories (i.e. the utility function which assigns high score to whatever the system actually does, and low score to everything else).

It’s probably not intended, but I think this wording vaguely implies a false dichotomy between “a thing (approximately) coherently pursues a long-term goal” and “an uninteresting thing like a rock”. There are other options like “Bob wants to eventually get out of debt, but Bob also wants to always act with honor and integrity”. See my post Consequentialism & Corrigibility.

Relatedly, I don’t think memetics is the only reason humans don’t approximately-coherently pursue states of the world in the distant future. (You didn’t say it was, but sorta gave that vibe.) For one thing, something can be pleasant or unpleasant right now. For another thing, the value function is defined and updated in conjunction with a flawed and incomplete world-model, as in your Pointers Problem post.

For agents, the "large-scale property" of interest is maximizing utility over some stuff "far away" - e.g. far in the future, for the examples in this post.

One consideration that coherence theorems often seem to lack:

It seems to me that often, optimizers establish a boundary and do most of their optimization within that boundary. E.g. animals have a skin that they maintain homeostasis under, companies have offices and factories where they perform their work, states have borders and people have homes.

These don't entirely dodge coherence theorems - typically a substantial part of the point of these boundaries is to optimize some other thing in the future. But they do set something up I feel.

I don’t understand this part:

”any value function can be maximized by some utility function over short-term outcomes.”

what is the difference between far in the future and near in the future?

Here's what it would typically look like in a control theory problem.

There's a long term utility which is a function of the final state , and a short term utility which is a function of time , the state at time , and the action at time . (Often the problem is formulated with a discount rate , but in this case we're allowing time-dependent short-term utility, so we can just absorb the discount rate into ). The objective is then to maximize

In that case, the value function is a max over trajectories starting at :

The key thing to notice is that we can solve that equation for :

So given an arbitrary value function , we can find a short-term utility function which produces that value function by using that equation to compute starting from the last timestep and working backwards.

Thus the claim from the post: for any value function, there exists a short-term utility function which induces that value function.

What if we restrict to only consider long-term utility, i.e. set ? Well, then the value function is no longer so arbitrary. That's the case considered in the post, where we have constraints which the value function must satisfy regardless of .

Did that clarify?

"Reward function" is a much more general term, which IMO has been overused to the point where it arguably doesn't even have a clear meaning. "Utility function" is less general: it always connotes an optimization objective, something which is being optimized for directly. And that basically matches the usage here.

I had to mull over it for five days, hunt down some background materials to fill in context, write follow up questions to a few friends (reviewing responses over phone while commuting), and then slowly chew through the math on pencil and paper when I could get spare time... but yes I understand now!

It might be relevant to note that the meaningfulness of this coherence definition depends on the chosen environment. For instance, in an deterministic forest MDP where an agent at a state can never return to for any and there is only one path between any two states, suppose we have a deterministic policy and let , , etc. Then for the zero-current-payoff Bellman equations, we only need that for any successor from , for any successor from , etc. We can achieve this easily by, for example, letting all values except be near-zero; since is a successor of iff (as otherwise there would be a cycle), this fits our criterion. Thus, every is coherent in this environment. (I haven't done the explicit math here, but I suspect that this also works for non-deterministic and non-stochastic MDPs.)

Importantly, using the common definition of language models in an RL setting where each state represents a sequence of tokens and each action adds a token to the end of a sequence of length to produce a sequence of length , the environment is a deterministic forest, as there is only one way to "go between" two sequences (if one is a prefix of the other, choose the remaining tokens in order). Thus, *any language model is coherent*, which seems unsatisfying. We could try using a different environment, but this risks losing stochasticity (as the output logits of an LM is determined by its input sequence) and gets complicated pretty quickly (use natural abstractions/world model as states?).

There's a lot of confusion about what coherence means for agents, and what "coherence theorems" do and don't say about agents. In this post, I'll talk about some particularly simple notions of coherence in a particularly simple setting. We'll see what nontrivial things coherence has to say, at least in a simple kind of environment, starting with an analogous notion of coherence for caches.

## What Kind Of "Coherence" We're Talking About Here

Let’s start with a standard CS-101-style example. We write a recursive python function to compute fibonacci numbers:

We pass in n = 0, then n = 1, then 2, then 3, etc. It spits out 1, 1, 2, 3, 5, 8, .... Great. Buuuuut it gets very slow very quickly as n increases; the runtime is exponential in n.

So, standard simple improvement: memoize. The first time

`fib(n)`

is computed for each value of n, cache it (i.e. "make a memo" of the result).Now the recursive calculation will only happen once for each value of n, so runtime is linear in n.

Ok, that's the CS 101 part. Now on to coherence.

Imagine that the cache in our fibonacci program gets corrupted somehow. Maybe I mess around in the debugger and stick a few wrong numbers into it, maybe some other thread writes into it, whatever. Somehow, incorrect values end up in that cache.

Key point: we can notice the cache corruption "locally", i.e. by only looking at a small subset of the cache. Say, for instance, that cache[6] is corrupted - it should be 8 (the sixth fibonacci number), but instead let's say it's 11, and let's assume for now that the rest of the cache is fine. So we're looking in the cache, and we see:

Well, just from those three entries we can tell that something's wrong, because 3 + 5 is not 11. It's supposed to be the case that cache[n] = cache[n-1] + cache[n-2] for any n bigger than 1, but that equation is not satisfied by these three cache entries. Our cache must be corrupt. And notice that we did

notneed to look at the rest of the cache in order to tell; we just needed to look at these three entries. That's what I mean when I say we can notice the cache corruption "locally".We'll want a word for when that sort of thing

isn'thappening, i.e. a word which says that cache[n]isequal to cache[n-1] + cache[n-2] (in this particular example). For that, we'll use the word "coherence".More generally: we'll say that a cache is coherent when small parts of the cache (like cache[n], cache[n-1], and cache[n-2] in this case) all locally satisfy some relationship (like cache[n] = cache[n-1] + cache[n-2]) which they're

supposedto satisfy if everything is working correctly.(Note that our usage here is a lot more general than the most common usage of "coherence" in CS; it's most similar to the use of "coherence" in formal logic. "Coherence" in CS is usually about the more specific case where different threads/processes/servers each have their own caches of the same information which might not match. That's a special case of the more general notion of "coherence" we'll use in this post.)

In the fibonacci example, if the

whole cacheis coherent, i.e. cache[n] = cache[n-1] + cache[n-2] for every n greater than 1,andcache[0] = cache[1] = 1, then the whole cache contains the values it's supposed to. In that case, the final cache entry, say e.g. cache[100], contains the result of fib(100).More generally, we're typically interested in "coherence" in cases where all the local constraints together yield some useful property "at the large scale". In logic, that might be a property like truth-preservation: put true assumptions in, get true conclusions out. In our fibonacci example, the useful "large scale" property is that the cache in fact contains the fibonacci sequence, all the way out to its largest entry. And for agents, the "large scale" property will be that the agents maximize or apply a lot of optimization pressure to something far in their future.

## Value Cache

Moving one step closer to agents, let's talk about an optimization problem, and how to use a cache in order to solve it efficiently.

Here's the problem: at each timestep, I have a whole number between 0 and 99 (inclusive). At each time, I can choose one of the following operations to apply to my number:

Each number is worth some number of points ("utility") at the end, and I get 10 turns to transform my starting number into the highest-utility number I can. (Whatever intermediate numbers I calculate along the way don't matter for scoring, only the final number.)

We can efficiently solve the problem via memoization or dynamic programming (which, for our current purposes, are the same thing). For each number at each time, we'll calculate the best score we can get starting from that number at that time; that's the "value" of the number at that time. For instance, suppose I get n points if my final number is n, i.e. utility(n) = n. Then the value of 97 in the second-to-last round is 99, since the best I can do is add 2 to end with 99.

We first define the value function recursively (note that I'm using some weird utilities here, not utility(n) = n):

Walking through that code:

This value function will be exponentially slow with respect to T (not too bad if the game only goes 10 steps, but much worse if we increase that number). By memoizing, we can make the runtime linear with respect to T:

Much like the fibonacci example, we now imagine that the cache might get corrupted somehow - e.g. I might mess with it in the debugger, or some other thread/process might write into it. Then, it makes sense to talk about coherence of the cache.

What would cache coherence mean here, in a similar sense to the fibonacci example? Well, consider five entries of the cache:

Those five entries

shouldsatisfy a local constraint:cache[(n = 23, t = 5)] = max([cache[(n = 23, t = 6)], cache[(n = 25, t = 6)], cache[(n = 22, t = 6)], cache[(n = 70, t = 6)]])

That constraint says that the entry in cache[(n = 23, t = 5)] is in fact the max of the cache entries of each number reachable in the next timestep from 23.

"Coherence" for this cache, in the sense we're using the term, would mean that

allsuch constraints are satisfied throughout the whole cache. As with the fibonacci example, if the whole cache is coherent and end-values match the utilities, then that fully determines all of the values in the cache.What "large scale" property do we get, if the whole cache is coherent? Well, in that case, each entry (n, t) of the cache tells us the best utility which can be achieved (at the "much later" final time), starting from n at time t. To actually achieve that utility, at each timestep we can look at the values achievable in the next timestep, and choose whichever operation yields the highest value in the next timestep. That policy will then achieve the global best-possible utility over the whole game; that's the "large scale" property here.

## Generalization: Agent's Value Function/Cache

We can generalize the value cache example to other agents which aim to maximize something. The agent has some terminal utility, i.e. the thing it's ultimately trying to optimize. In order to optimize that thing, it's useful to keep around a value function/cache, which represents the instrumental value of various things for achieving the terminal utility.

If we're e.g. using dynamic programming to build a controller, then we'd have a value cache much like the above example. If we're doing Q-learning, then we'd instead train a neural net to calculate the value function. In humans, it seems like we do something qualitatively like Q-learning: over the course of our lives, we learn to attach instrumental value to various things. Though notably, unlike typical Q-learning setups, we humans can also write things directly into our value-functions by talking to or watching other humans - e.g. maybe someone tells me that broccoli is healthy, so my brain reinforces a relatively-high value for broccoli in my learned value-function.

In context, it's pretty clear how the learned value function could end up incoherent: random stuff happens all the time which might reinforce "wrong" instrumental values! Memetics are a prime example of this: lots of potent memes boil down to "spreading the word about X is very good and important". If I hear that, and train it into my value function, then I'll behave to spread the meme - without the meme necessarily being tied to my actual terminal utility. It's like a virus copying itself around in humans' learned value functions.

And if the value function/cache is corrupted, then an agent acting according to those values won't actually maximize its terminal goals, whenever its trajectory runs into the corrupted parts.

## Coherence Is Not Utility-Dependent

Key thing to notice in our value cache example: the terminal utility function only shows up in the "boundary condition", i.e. the values at the very last timestep. The coherence conditions for the rest of the problem - i.e. the decisions at all

earliertimesteps - arethe sameregardless of the terminal utility function. The values themselves might be different, but they'll all satisfy e.g.cache[(n = 23, t = 5)] = max([cache[(n = 23, t = 6)], cache[(n = 25, t = 6)], cache[(n = 22, t = 6)], cache[(n = 70, t = 6)]])

regardless of what the utility is. In other words, the coherence conditions are a property of the

environment, not the agent's specific goals.Why is that interesting? Well, suppose we have some system sitting around, and it uses a value function/cache - for instance, maybe we did some Q-learning, and out popped a system which always takes whichever available action gets the highest output from the learned value function. Does that system maximize utility, for

someutility function over the end-state? Well, as a necessary condition, we can check whether the values satisfy the coherence conditions (typically called a Bellman Equation, in this context). If those conditions aren't satisfied, then the system doesn't maximize utility foranyutility function over the end-state.Now, a system which doesn't satisfy the coherence conditions

couldstill maximize some other kind of utility function - e.g. utility over whole trajectories, or some kind of discounted sum of utility at each time-step, rather than utility over end states. But that's not very interesting, in general; any old system can be interpreted as maximizing some utility function over whole trajectories (i.e. the utility function which assigns high score to whatever the system actually does, and low score to everything else). As we said earlier: coherence is interesting mainly whenlocalcoherence conditions add up to some coollarge-scaleproperty. For agents, the "large-scale property" of interest is maximizing utility over some stuff "far away" - e.g. far in the future, for the examples in this post. In other words, it's long-range planning that's of interest, not short-range planning; long-range planning is where coherence gives nontrivial constraints.## Coherence of Policies

Recap of some key points so far:

anyutility function over the outcomes.This required assuming a lot of structure into our agents: they need to have a value cache/function, which they use in a particular way in order to choose actions. Now it's time to drop that assumption, and instead talk about coherence of a

policy- i.e. the function which maps each state-of-the-world the choice the agent makes in that state.Here's the key question: given some policy, is there

anycoherent value function which is consistent with that policy? We'll operationalize that two different ways:consistent witha policy iff the policy always chooses the action with highest value according to the value function, and randomizes between highest-value options in case of a tie.consistent witha policy iff the policy never chooses an action with lower value over one with higher value according to the value function.These two operationalizations differ only in how they handle ties. Under the second operationalization, every policy is consistent with the "trivial value function" which assigns the same value to everything; so to get a nontrivial statement, we need to assume away that case. The first operationalization handles that problem by requiring the policy to randomize in case of ties, so if the policy doesn't randomize then there can't be any ties.

With those in mind, let's look at a policy which is incoherent - i.e. a policy which is not consistent with

anycoherent value function. We'll use the classic case of circular revealed preferences:Now suppose there's some value function consistent with these values, under operationalization 1 (ties must randomize). Then we

musthave:Put all that together, and we get

value(cookie, t=2) > value(pizza, t=2) > value(carrot, t=2) > value(cookie, t=2)

i.e. value(cookie, t=2) > value(cookie, t=2), which is a contradiction. So, under operationalization 1, there is no value function consistent with this policy.

How about operationalization 2? Operationalization 2 works exactly the same way in this case, except the strict inequalities become non-strict:

value(cookie, t=2)

≥value(pizza, t=2)≥value(carrot, t=2)≥value(cookie, t=2)... which implies value(cookie, t=2) = value(pizza, t=2) = value(carrot, t=2), i.e. they all have the same value. Now, in this case there could still be a nontrivial value function consistent with the policy, if there's lots of other stuff which doesn't all have the same value. But the circular preferences forced the value function to be "a little more trivial" - i.e. to assign the same value to at least those three things. If there are

enoughcircular preferences between enough different things, thenallthe values will be forced to be equal, which is the trivial case.Key takeaway here: though different operationalizations differ in the details (specifically for indifference), the traditional example of circular preferences indeed rules out all nontrivial value functions, if there's sufficient circularity. So with enough circularity, a policy cannot maximize

anynontrivial utility function over final-time outcomes.## Summary and Takeaways

We started with a notion of "coherence" for a cache similar to the concept used in formal logic - i.e.

localparts of the cache satisfy some relationships, such that the whole cache globally ends up doing something useful (like e.g. storing values of the fibonacci sequence).We then applied that notion of coherence for caches to a

value cache- i.e. a cache of instrumental values, of the sort computed in dynamic programming. That notion generalizes nicely tovalue functions, e.g. of the sort trained in Q-learning. We noted that the coherence constraints on a value function are independent of the terminal utility function - implying that an agent acting according to anincoherentvalue function does not maximizeanyutility function over final-time outcomes. The "over final-time outcomes" was important, though: we also claimed that, insofar as coherence is interesting for agents at all, it's relevant tolong-termplanning, not short-term planning; any value function can be maximized by some utility function over short-term outcomes.Finally we moved to discussing coherence of policies, and saw that the classic case of a policy with sufficiently circular revealed preferences is indeed inconsistent with any nontrivial value function, and therefore does not maximize any nontrivial long-term utility function. (Here "trivial" referred to the trivial utility/value function which assigns the same constant value to everything.)

For people who are at least somewhat familiar with coherence, I expect the most important takeaway is that coherence is nontrivial for

long-termplanning specifically; it'sshort-termutility maximization which is consistent with e.g. the behavior of a rock.Lastly, I'll emphasize that we talked about neither probability/nondeterminism, nor approximation. Intuitively, it seems clear that the arguments here should "weaken well", e.g. if the value function or policy isn't

approximatelycoherent in some sense then it won'tapproximatelymaximize any utility function. But we didn't actually operationalize any of that.