Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Let’s start with the simplest coherence theorem: suppose I’ll pay to upgrade pepperoni pizza to mushroom, pay to upgrade mushroom to anchovy, and pay to upgrade anchovy to pepperoni. This does not bode well for my bank account balance. And the only way to avoid having such circular preferences is if there exists some “consistent preference ordering” of the three toppings - i.e. some ordering such that I will only pay to upgrade to a topping later in the order, never earlier. That ordering can then be specified as a utility function: a function which takes in a topping, and gives the topping’s position in the preference order, so that I will only pay to upgrade to a topping with higher utility.

More advanced coherence theorems remove a lot of implicit assumptions (e.g. I could learn over time, and I might just face various implicit tradeoffs in the world rather than explicit offers to trade), and add more machinery (e.g. we can incorporate uncertainty and derive expected utility maximization and Bayesian updates). But they all require something-which-works-like-money.

Money has two key properties in this argument:

  • Money is additive across decisions. If I pay $1 to upgrade anchovy to pepperoni, and another $1 to upgrade pepperoni to mushroom, then I have spent $1 + $1 = $2.
  • All else equal, more money is good. If I spend $3 trading anchovy -> pepperoni -> mushroom -> anchovy, then I could have just stuck with anchovy from the start and had strictly more money, which would be better.

These are the conditions which make money a “measuring stick of utility”: more money is better (all else equal), and money adds. (Indeed, these are also the key properties of a literal measuring stick: distances measured by the stick along a straight line add, and bigger numbers indicate more distance.)

Why does this matter?

There’s a common misconception that every system can be interpreted as a utility maximizer, so coherence theorems don’t say anything interesting. After all, we can always just pick some “utility function” which is maximized by whatever the system actually does. It’s the measuring stick of utility which makes coherence theorems nontrivial: if I spend $3 trading anchovy -> pepperoni -> mushroom -> anchovy, then it implies that either (1) I don’t have a utility function over toppings (though I could still have a utility function over some other silly thing, like e.g. my history of topping-upgrades), or (2) more money is not necessarily better, given the same toppings. Sure, there are ways for that system to “maximize a utility function”, but it can’t be a utility function over toppings which is measured by our chosen measuring stick.

Another way to put it: coherence theorems assume the existence of some resources (e.g. money), and talk about systems which are pareto optimal with respect to those resources - e.g. systems which “don’t throw away money”. Implicitly, we're assuming that the system generally "wants" more resources (instrumentally, not necessarily as an end goal), and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from Y to X. If the agent reaches a world-state which it could have reached with strictly less resource expenditure in all possible worlds, then it's not an expected utility maximizer - it "threw away money" unnecessarily. We assume that the resources are a measuring stick of utility, and then ask whether the system maximizes any utility function over the given state-space measured by that measuring stick.

Ok, but what about utility functions which don’t increase with resources?

As a general rule, we don’t actually care about systems which are “utility maximizers” in some trivial sense, like the rock which “optimizes” for sitting around being a rock. These systems are not very useful to think of as optimizers. We care about things which steer some part of the world into a relatively small state-space.

To the extent that we buy instrumental convergence, using resources as a measuring stick is very sensible. There are various standard resources in our environment, like money or energy, which are instrumentally useful for a very wide variety of goals. We expect a very wide variety of optimizers to “want” those resources, in order to achieve their goals. Conversely, we intuitively expect that systems which throw away such resources will not be very effective at steering parts of the world into relatively small state-space. They will be limited to fewer bits of optimization than systems which use those same resources pareto optimally.

So there’s an argument to be made that we don’t particularly care about systems which “maximize utility” in some sense which isn’t well measured by resources. That said, it’s an intuitive, qualitative argument, not really a mathematical one. What would be required in order to make it into a formal argument, usable for practical quantification and engineering?

The Measuring Stick Problem

The main problem is: how do we recognize a “measuring stick of utility” in the wild, in situations where we don’t already think of something as a resource? If somebody hands me a simulation of a world with some weird physics, what program can I run on that simulation to identify all the “resources” in it? And how does that notion of “resources” let me say useful, nontrivial things about the class of utility functions for which those resources are a measuring stick? These are the sorts of questions we need to answer if we want to use coherence theorems in a physically-reductive theory of agency.

If we could answer that question, in a way derivable from physics without any “agenty stuff” baked in a priori, then the coherence theorems would give us a nontrivial sense in which some physical systems do contain embedded agents, and other physical systems don’t. It would, presumably, allow us to bound the number of bits-of-optimization which a system can bring to bear, with more-coherent-as-measured-by-the-measuring-stick systems able to apply more bits of optimization, all else equal.

New Comment
25 comments, sorted by Click to highlight new comments since: Today at 3:40 PM

Very very good question IMO. Thank you for this.

Consider a person who comes to a very clear understanding of the world, such that they are extremely capable of designing things, building things, fundraising, collaborating, and so on. Consider a moment where this person is just about to embark on a project but has not yet acquired any resources, perhaps has not even made any connections with anyone at all, yet is highly likely to succeed in their project when they do embark. Would you say this person has “resources”? If not, there is a kind of continuous trading-in that will take place as this person exchanges their understanding of things for resources, then later for final outcomes. Is there really a line between understanding-of-things, resources, and outcomes? The interesting part is the thing that gives this person power, and that seems to be their understanding-of-things.

I came up with a handful of ideas, but one problem my ideas ran into was that strictly speaking, utility is not monotonically increasing with resources. Rather, utility is monotonically increasing with resources under your control; if a resource is controlled by an enemy, then that decreases your expected utility.

So that raises the question of whether we should define "resources" abstractly/generally, or "resources under your control" specifically. I suspect the former could be derived from the latter by considering something like causality or analogy. And the latter would need to start out with defining "you", which could presumably be done in the style of your "optimization at a distance".

And the latter would need to start out with defining "you", which could presumably be done in the style of your "optimization at a distance".

Yup, next post will argue that a natural solution to the Measuring Stick Problem pops out of optimization-at-a-distance.

Money is an abstraction. A useful abstraction of low complexity that people can coordinate on. Calling it a resource like energy makes sense to me on that abstraction level. But there are Weird Things About Money. People are optimizing relative wealth, not absolute wealth, and happiness is proportionate to wealth relative to peers (Envy Dominates Greed). Thus coordinating on this specific resource is sub-optimal globally even if it is efficient for individual participants locally. 

The same should hold for energy as long as you can convert money to energy.

Another way to put it: coherence theorems assume the existence of some resources (e.g. money), and talk about systems which are pareto optimal with respect to those resources - e.g. systems which “don’t throw away money”. Implicitly, we're assuming that the system generally "wants" more resources (instrumentally, not necessarily as an end goal), and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from Y to X. If the agent reaches a world-state which it could have reached with strictly less resource expenditure in all possible worlds, then it's not an expected utility maximizer - it "threw away money" unnecessarily. We assume that the resources are a measuring stick of utility, and then ask whether the system maximizes any utility function over the given state-space measured by that measuring stick.

That is the best explanation of "what the tool of coherence theorems does and is useful for" that I've ever read, including discussions with you. Thanks for that!

The key properties that make money a useful measuring stick are:

  • It's fungible: agents can purchase things they want through money (including other resources).
  • Agent preferences over money are monotonically non decreasing as a function of quantity (more money is not worse [it may not always be better: e.g. some agents may have bounded utility functions]. I think you alluded to this by describing money as "additive", but I don't think that's the right property; additive resources aren't a useful measuring stick if the agent doesn't want more of the resource.)

Something that you did not address is how the measuring stick relates to the utility function.

Consider five resources , and functions that measure the utility to the agent of a given quantity of each resource. .

Which of the four resources is a measuring stick of utility? Are all of them?

I don't think this is a significant hindrance. If we know how utility of the resource grows as a function of quantity, we could back out a linear measure of utility by applying said function to the quantity of the commodity possessed.

 

As for the question of how to identify measuring sticks of utility in the wild, I think the two conditions highlighted earlier purchase a lot of mileage:

Any commodity that:

  1. Can be traded for known resources (e.g. energy, money [specifically, it's commensurable with every other resource]) or is otherwise fungible for things the agent cares about
    • This fungibility should be in such a manner as to preserve the "monotonically nondecreasing as a function of quantity" property
  2. Agent preferences over the commodity are monotonically non decreasing as a function of the quantity of the commodity possessed

Note that the measuring stick for utility does NOT require a common attribute of measuring sticks for length - every dollar is better to have than not, but does not need to be better by the same amount as other dollars.  It's perfectly acceptable to have a ranking of outcomes ( mushroom > pepperoni > anchovy > nothing ), with no numbers attached, and without implying that the pepperoni->mushroom upgrade is worse than two anchovy->pepperoni upgrades.  

How do you decide between a certainty of pepperoni and a coinflip of mushroom or anchovy?

Your ordinal preferences as stated don't answer that question - you decide by having more terms in your utility function.  Assuming independence of outcome (which is NEVER true, but really handy when analyzing toys), you might be able to assign a numeric marginal value to each upgrade.  Or you might just have an additional preference for certain pepperoni over 50/50 mushroom/anchovy.  Heck, you might have a preference for certain anchovy over 50/50 mushroom/pepperoni (though this that uncertainty is itself negative value to you, in addition to your pizza topping valuation).

Note that if you prefer A to B to C, but prefer "certain C" to "coinflip A or B", then you don't have a utility function over ABC.

You might, as in OP, have a utility function over something else. Like maybe "ABC plus a history of correctly predicting which of ABC I'd have".

Summary

Epistemic status: I may misunderstand parts of the post.

Utility functions are inherently coherent:

  • More utility is better
  • Twice as much is twice as good.

One problem is that we usually do not know the utility function of a system. Knowing this could be useful, for example, if we want to talk about what an advanced AI is optimizing for.

So how do we figure out the utility of a system? One solution is to have a measuring stick of it, i.e., a resource such that:

  • The system finds the resource only instrumentally valuable.
  • The system can trade the resource in equal proportions to utility

Possible examples of such a resource:

  • Money
  • Energy

If we have found such a measuring stick for a certain utility maximizer, then we can use it to determine the utility function by observing which trades the system would make. Additionally, we may measure the extent to which the system is a utility maximizer at all by measuring how coherent its trades with the resource are. 

The measuring stick problem asks: How do we find a measuring stick in the wild, when being handed a universe?


Opinion/Confusions

There are some things that confuse me about this and that make me wonder if this is the right frame. There seem some hidden assumptions on optimization in the post that are not transparent to me personally. Some of the confusions may be answered in the post itself, but they do not… “stick”.

  • John seems to assume that there is one resource that is equally valuable for maximizing utility for all systems. Why is that? Resources like energy may seem useful for transforming the whole universe, but maybe my utility function is not about making such large-scale transformations.
    • Take humans as examples: most humans experience strong diminishing returns in almost all (or maybe even all) resources. 
  • Possibly John’s answer would be “well, if you don’t want to make large-scale transformations, then you don’t have a utility function, because utility functions are about transforming a large system into a small set of states”. 
    • John’s earlier post on utility maximization being description length minimization seems to support that view. 
    • There is, however, an intuition in myself that finds that post unnatural, especially since the mapping between utility function and probability distribution had a free parameter
  • What if a system has a utility function, but it’s simply not about states in the same space that John cares about? 
    • E.g., some entity may care about optimizing trajectories, not states. 
      • In that example, we would end up with a distribution over trajectories instead of states. I would imagine that a different type of resource would help in optimizing that than in optimizing states themselves. 

Utility functions are inherently coherent:

  • More utility is better
  • Twice as much is twice as good.

I'm not sure that "twice as much is twice as good" is a necessary property of utility functions. Though I guess it may be true by definition?

IIRC, utility functions are only unique up to positive affine transformations. And also I think only the gaps in utility between lairs of options are meaningful.

100/200 utility is meaningless without a "baseline" with which to evaluate it. And even then, it's the gap between the baseline and a given utility value that would be meaningful not the utility value itself.

Suspect your last claim about the coherence of utility functions is just wrong.

 

The system can trade the resource in equal proportions to utility

Human utility functions over money are very much non linear (closer to logarithmic in fact) and utility functions over energy may also be sublinear.

The law of diminishing marginal utility suggests that sublinear utility over any resource is the norm for humans.

 

  • John seems to assume that there is one resource that is equally valuable for maximizing utility for all systems. Why is that?
    • Resources like energy may seem useful for transforming the whole universe, but maybe my utility function is not about making such large-scale transformations. Take humans as examples: most humans experience strong diminishing returns in almost all (or maybe even all) resources.

I think this is only an issue because you assume that for a resource to qualify as a "measuring stick", the quantity of the resource possessed must be a linear function of utility.

I think that's an unnecessary assumption and not very sensible because as you said, diminishing marginal returns on resources is nigh universal for humans.

Also I don't think making large scale changes is relevant/load bearing for a resource to be a measuring stick of utility.

I think the only requirements are that:

  1. The resource is fungible/can be traded for other resources or things the agent cares about
  2. Agent preferences over the resource are monotonically nondecreasing as a function of the quantity of the resource possessed (this is IMO the property that Wentworth was gesturing at with "additive")

See also my top level comment.

If the rate of entropy generation was increasing within a physical system, wouldn't that indicate an optimization process. As the physical system went through a narrower and narrower band of state space, that would have to be the result of some optimizer.

 The utility function could be represented by the path that is taken through state space. Each next step would be higher in the optimizer's preferences, or it is "throwing away money". 

I do not know how to reverse engineer a utility function from a path through state space, though. (Could resources be represented by the amount of entropy your willing to generate to move from your current state to a different one?)

Epistemic status: probably wrong; intuitively, I feel like I'm onto something but I'm too uncertain about this framing to be confident in it

I refer to optimizers which can be identified by a measuring stick of utility as agenty optimizers

The measuring stick is optimization power. In particular, in the spirit of this sequence, it is the correlation between local optimization and optimization far away. If I have 4 basic actions available to me and each performs two bits of optimization on the universe, I am maximally powerful (for a structure with 4 basic actions) and most definitely either an agent or constructed by one. I speak and the universe trembles.

One might look at the life on Earth and see that it is unusually structured and unusually purposeful and conclude that it is the work of an agenty optimizer. And they would be wrong.

But if they looked closer, at the pipelines and wires and radio waves on Earth, they might conclude that they were the work of an agenty optimizer because they turn small actions (flipping a switch, pressing a key) into large, distant effects (water does or doesn't arrive at a village, a purchase is confirmed and a bushel of apples is shipped across the planet). And they would be correct.

In this framing, resources under my control are structures which propagate and amplify my outputs out into large, distant effects (they needn't be friendly, per se, they just have to be manipulable). Thus, a dollar (+ Amazon + a computer + ...) is an invaluable resource because, with it, I can cause any of literally millions of distinct objects to move from one part of the world to another by moving my fingers in the right way. And I can do that because the world has been reshaped to bend to my will in a way that clearly indicates agency to anyone who knows how to look.

However, I haven't the slightest idea how to turn this framework into a method for actually identifying agents (or resources) in a universe with weird physics.

Also, I have a sense that there is an important difference between accumulating asymmetric power (allies, secret AI) and creating approximately symmetrically empowering infrastructure (Elicit), which is not captured by this framework. Maybe the former is evidence of instrumental resource accumulation whereas the latter provides specific information about the creator's goals? But both *are* clear signs of agenty optimization, so maybe it's not relevant to this context?

Also possibly of note is that more optimization power is not strictly desirable because having too many choices might overwhelm your computational limitations.

I'm not sure if this is helpful, but I tend to think of the term "resources" as referring to things that are expended when used (like dollars or fuel cells). I think of reusable tools (like screwdrivers or light-switches) as being in a different category.

(I realize that approximately all tools will in fact wear out after some amount of use, but these still feel like naturally-distinct categories for most things in my actual life.  I do not think of my screwdriver or my measuring tape or my silverware as having some finite number of charges that are being expended each time I use them.)

EDIT:  Reworded for clarity.

I was going to comment that this older post might provide insight into this problem, and then realized that you wrote that older post...

And retroactively incorporated it as the first post of this sequence, too!

Typo:  reverse X and Y in one of these.

The agent "prefers" X over Y if it expends resources to get from X to Y.

Fixed, thanks.

Why is a resource central here? Consider (if it helps, also change "upgrade" to "switch"):

Let’s start with the simplest coherence theorem: suppose I’ll pay to upgrade pepperoni pizza to mushroom, pay to upgrade mushroom to anchovy, and pay to upgrade anchovy to pepperoni. This does not bode well for my bank account balance. And the only way to avoid having such circular preferences is if there exists some “consistent preference ordering” of the three toppings - i.e. some ordering such that I will only pay to upgrade to a topping later in the order, never earlier. That ordering can then be specified as a utility function: a function which takes in a topping, and gives the topping’s position in the preference order, so that I will only pay to upgrade to a topping with higher utility.

Surely one can notice the circularity problem without using a measuring stick.

If there's no resource loss in that cycle, why is it a problem at all?

Transitivity is a fundamental axiom necessary for a consistent utility function, which is central to rational choice theory. Sure, the potential for resource loss makes it more problematic for the agents you're studying, but if you don't have a consistent utility function to support your modeling in the first place, it's already problematic for your studying of the agents. Put another way, you don't even need to "reach" the coherence argument if you can't get over the consistency bar.

 

Suppose an agent is thinking about whether to cooperate in a Prisoner's Dilemma. In the counterfactual where it cooperates, it might naturally deduce that other agents like it would also cooperate. Therefore we could hand it a world with weird physics, and see whether in the counterfactual where it cooperates, it can deduce more about that world. Then it has presumably found agents like itself in that world.

Prisoner's Dilemma? Counterfactual? Agent? Cooperation? We're talking about starting from low-level physics, there isn't even a built-in a place to introduce these very-high-level concepts!

The agent I'm talking about is separate from your physics-based world. It's from toy setups like Robust Cooperation in the Prisoner's Dilemma. If it can reason about statements like "If my algorithm returns that I cooperate, then I get 3 utility.", then there may be p for which it can prove "If my algorithm returns that I cooperate, then this strange hypothetical-physics-based world has property p." but not "This strange hypothetical-physics-based world has property p.". This would indicate that that strange world contains agents about which that premise is useful, so we can use modal combatants as agent detectors.