Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Let’s start with the simplest coherence theorem: suppose I’ll pay to upgrade pepperoni pizza to mushroom, pay to upgrade mushroom to anchovy, and pay to upgrade anchovy to pepperoni. This does not bode well for my bank account balance. And the only way to avoid having such circular preferences is if there exists some “consistent preference ordering” of the three toppings - i.e. some ordering such that I will only pay to upgrade to a topping later in the order, never earlier. That ordering can then be specified as a utility function: a function which takes in a topping, and gives the topping’s position in the preference order, so that I will only pay to upgrade to a topping with higher utility.

More advanced coherence theorems remove a lot of implicit assumptions (e.g. I could learn over time, and I might just face various implicit tradeoffs in the world rather than explicit offers to trade), and add more machinery (e.g. we can incorporate uncertainty and derive expected utility maximization and Bayesian updates). But they all require something-which-works-like-money.

Money has two key properties in this argument:

  • Money is additive across decisions. If I pay $1 to upgrade anchovy to pepperoni, and another $1 to upgrade pepperoni to mushroom, then I have spent $1 + $1 = $2.
  • All else equal, more money is good. If I spend $3 trading anchovy -> pepperoni -> mushroom -> anchovy, then I could have just stuck with anchovy from the start and had strictly more money, which would be better.

These are the conditions which make money a “measuring stick of utility”: more money is better (all else equal), and money adds. (Indeed, these are also the key properties of a literal measuring stick: distances measured by the stick along a straight line add, and bigger numbers indicate more distance.)

Why does this matter?

There’s a common misconception that every system can be interpreted as a utility maximizer, so coherence theorems don’t say anything interesting. After all, we can always just pick some “utility function” which is maximized by whatever the system actually does. It’s the measuring stick of utility which makes coherence theorems nontrivial: if I spend $3 trading anchovy -> pepperoni -> mushroom -> anchovy, then it implies that either (1) I don’t have a utility function over toppings (though I could still have a utility function over some other silly thing, like e.g. my history of topping-upgrades), or (2) more money is not necessarily better, given the same toppings. Sure, there are ways for that system to “maximize a utility function”, but it can’t be a utility function over toppings which is measured by our chosen measuring stick.

Another way to put it: coherence theorems assume the existence of some resources (e.g. money), and talk about systems which are pareto optimal with respect to those resources - e.g. systems which “don’t throw away money”. Implicitly, we're assuming that the system generally "wants" more resources (instrumentally, not necessarily as an end goal), and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from Y to X. If the agent reaches a world-state which it could have reached with strictly less resource expenditure in all possible worlds, then it's not an expected utility maximizer - it "threw away money" unnecessarily. We assume that the resources are a measuring stick of utility, and then ask whether the system maximizes any utility function over the given state-space measured by that measuring stick.

Ok, but what about utility functions which don’t increase with resources?

As a general rule, we don’t actually care about systems which are “utility maximizers” in some trivial sense, like the rock which “optimizes” for sitting around being a rock. These systems are not very useful to think of as optimizers. We care about things which steer some part of the world into a relatively small state-space.

To the extent that we buy instrumental convergence, using resources as a measuring stick is very sensible. There are various standard resources in our environment, like money or energy, which are instrumentally useful for a very wide variety of goals. We expect a very wide variety of optimizers to “want” those resources, in order to achieve their goals. Conversely, we intuitively expect that systems which throw away such resources will not be very effective at steering parts of the world into relatively small state-space. They will be limited to fewer bits of optimization than systems which use those same resources pareto optimally.

So there’s an argument to be made that we don’t particularly care about systems which “maximize utility” in some sense which isn’t well measured by resources. That said, it’s an intuitive, qualitative argument, not really a mathematical one. What would be required in order to make it into a formal argument, usable for practical quantification and engineering?

The Measuring Stick Problem

The main problem is: how do we recognize a “measuring stick of utility” in the wild, in situations where we don’t already think of something as a resource? If somebody hands me a simulation of a world with some weird physics, what program can I run on that simulation to identify all the “resources” in it? And how does that notion of “resources” let me say useful, nontrivial things about the class of utility functions for which those resources are a measuring stick? These are the sorts of questions we need to answer if we want to use coherence theorems in a physically-reductive theory of agency.

If we could answer that question, in a way derivable from physics without any “agenty stuff” baked in a priori, then the coherence theorems would give us a nontrivial sense in which some physical systems do contain embedded agents, and other physical systems don’t. It would, presumably, allow us to bound the number of bits-of-optimization which a system can bring to bear, with more-coherent-as-measured-by-the-measuring-stick systems able to apply more bits of optimization, all else equal.

61

Ω 22

Mentioned in
22 comments, sorted by Click to highlight new comments since: Today at 11:30 AM
New Comment

I came up with a handful of ideas, but one problem my ideas ran into was that strictly speaking, utility is not monotonically increasing with resources. Rather, utility is monotonically increasing with resources under your control; if a resource is controlled by an enemy, then that decreases your expected utility.

So that raises the question of whether we should define "resources" abstractly/generally, or "resources under your control" specifically. I suspect the former could be derived from the latter by considering something like causality or analogy. And the latter would need to start out with defining "you", which could presumably be done in the style of your "optimization at a distance".

And the latter would need to start out with defining "you", which could presumably be done in the style of your "optimization at a distance".

Yup, next post will argue that a natural solution to the Measuring Stick Problem pops out of optimization-at-a-distance.

Money is an abstraction. A useful abstraction of low complexity that people can coordinate on. Calling it a resource like energy makes sense to me on that abstraction level. But there are Weird Things About Money. People are optimizing relative wealth, not absolute wealth, and happiness is proportionate to wealth relative to peers (Envy Dominates Greed). Thus coordinating on this specific resource is sub-optimal globally even if it is efficient for individual participants locally. 

The same should hold for energy as long as you can convert money to energy.

Another way to put it: coherence theorems assume the existence of some resources (e.g. money), and talk about systems which are pareto optimal with respect to those resources - e.g. systems which “don’t throw away money”. Implicitly, we're assuming that the system generally "wants" more resources (instrumentally, not necessarily as an end goal), and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from Y to X. If the agent reaches a world-state which it could have reached with strictly less resource expenditure in all possible worlds, then it's not an expected utility maximizer - it "threw away money" unnecessarily. We assume that the resources are a measuring stick of utility, and then ask whether the system maximizes any utility function over the given state-space measured by that measuring stick.

That is the best explanation of "what the tool of coherence theorems does and is useful for" that I've ever read, including discussions with you. Thanks for that!

Note that the measuring stick for utility does NOT require a common attribute of measuring sticks for length - every dollar is better to have than not, but does not need to be better by the same amount as other dollars.  It's perfectly acceptable to have a ranking of outcomes ( mushroom > pepperoni > anchovy > nothing ), with no numbers attached, and without implying that the pepperoni->mushroom upgrade is worse than two anchovy->pepperoni upgrades.  

How do you decide between a certainty of pepperoni and a coinflip of mushroom or anchovy?

Your ordinal preferences as stated don't answer that question - you decide by having more terms in your utility function.  Assuming independence of outcome (which is NEVER true, but really handy when analyzing toys), you might be able to assign a numeric marginal value to each upgrade.  Or you might just have an additional preference for certain pepperoni over 50/50 mushroom/anchovy.  Heck, you might have a preference for certain anchovy over 50/50 mushroom/pepperoni (though this that uncertainty is itself negative value to you, in addition to your pizza topping valuation).

Note that if you prefer A to B to C, but prefer "certain C" to "coinflip A or B", then you don't have a utility function over ABC.

You might, as in OP, have a utility function over something else. Like maybe "ABC plus a history of correctly predicting which of ABC I'd have".

If the rate of entropy generation was increasing within a physical system, wouldn't that indicate an optimization process. As the physical system went through a narrower and narrower band of state space, that would have to be the result of some optimizer.

 The utility function could be represented by the path that is taken through state space. Each next step would be higher in the optimizer's preferences, or it is "throwing away money". 

I do not know how to reverse engineer a utility function from a path through state space, though. (Could resources be represented by the amount of entropy your willing to generate to move from your current state to a different one?)

Epistemic status: probably wrong; intuitively, I feel like I'm onto something but I'm too uncertain about this framing to be confident in it

I refer to optimizers which can be identified by a measuring stick of utility as agenty optimizers

The measuring stick is optimization power. In particular, in the spirit of this sequence, it is the correlation between local optimization and optimization far away. If I have 4 basic actions available to me and each performs two bits of optimization on the universe, I am maximally powerful (for a structure with 4 basic actions) and most definitely either an agent or constructed by one. I speak and the universe trembles.

One might look at the life on Earth and see that it is unusually structured and unusually purposeful and conclude that it is the work of an agenty optimizer. And they would be wrong.

But if they looked closer, at the pipelines and wires and radio waves on Earth, they might conclude that they were the work of an agenty optimizer because they turn small actions (flipping a switch, pressing a key) into large, distant effects (water does or doesn't arrive at a village, a purchase is confirmed and a bushel of apples is shipped across the planet). And they would be correct.

In this framing, resources under my control are structures which propagate and amplify my outputs out into large, distant effects (they needn't be friendly, per se, they just have to be manipulable). Thus, a dollar (+ Amazon + a computer + ...) is an invaluable resource because, with it, I can cause any of literally millions of distinct objects to move from one part of the world to another by moving my fingers in the right way. And I can do that because the world has been reshaped to bend to my will in a way that clearly indicates agency to anyone who knows how to look.

However, I haven't the slightest idea how to turn this framework into a method for actually identifying agents (or resources) in a universe with weird physics.

Also, I have a sense that there is an important difference between accumulating asymmetric power (allies, secret AI) and creating approximately symmetrically empowering infrastructure (Elicit), which is not captured by this framework. Maybe the former is evidence of instrumental resource accumulation whereas the latter provides specific information about the creator's goals? But both *are* clear signs of agenty optimization, so maybe it's not relevant to this context?

Also possibly of note is that more optimization power is not strictly desirable because having too many choices might overwhelm your computational limitations.

I'm not sure if this is helpful, but I tend to think of the term "resources" as referring to things that are expended when used (like dollars or fuel cells). I think of reusable tools (like screwdrivers or light-switches) as being in a different category.

(I realize that approximately all tools will in fact wear out after some amount of use, but these still feel like naturally-distinct categories for most things in my actual life.  I do not think of my screwdriver or my measuring tape or my silverware as having some finite number of charges that are being expended each time I use them.)

EDIT:  Reworded for clarity.

I was going to comment that this older post might provide insight into this problem, and then realized that you wrote that older post...

And retroactively incorporated it as the first post of this sequence, too!

Typo:  reverse X and Y in one of these.

The agent "prefers" X over Y if it expends resources to get from X to Y.

Why is a resource central here? Consider (if it helps, also change "upgrade" to "switch"):

Let’s start with the simplest coherence theorem: suppose I’ll pay to upgrade pepperoni pizza to mushroom, pay to upgrade mushroom to anchovy, and pay to upgrade anchovy to pepperoni. This does not bode well for my bank account balance. And the only way to avoid having such circular preferences is if there exists some “consistent preference ordering” of the three toppings - i.e. some ordering such that I will only pay to upgrade to a topping later in the order, never earlier. That ordering can then be specified as a utility function: a function which takes in a topping, and gives the topping’s position in the preference order, so that I will only pay to upgrade to a topping with higher utility.

Surely one can notice the circularity problem without using a measuring stick.

If there's no resource loss in that cycle, why is it a problem at all?

Transitivity is a fundamental axiom necessary for a consistent utility function, which is central to rational choice theory. Sure, the potential for resource loss makes it more problematic for the agents you're studying, but if you don't have a consistent utility function to support your modeling in the first place, it's already problematic for your studying of the agents. Put another way, you don't even need to "reach" the coherence argument if you can't get over the consistency bar.

 

Very very good question IMO. Thank you for this.

Consider a person who comes to a very clear understanding of the world, such that they are extremely capable of designing things, building things, fundraising, collaborating, and so on. Consider a moment where this person is just about to embark on a project but has not yet acquired any resources, perhaps has not even made any connections with anyone at all, yet is highly likely to succeed in their project when they do embark. Would you say this person has “resources”? If not, there is a kind of continuous trading-in that will take place as this person exchanges their understanding of things for resources, then later for final outcomes. Is there really a line between understanding-of-things, resources, and outcomes? The interesting part is the thing that gives this person power, and that seems to be their understanding-of-things.

Suppose an agent is thinking about whether to cooperate in a Prisoner's Dilemma. In the counterfactual where it cooperates, it might naturally deduce that other agents like it would also cooperate. Therefore we could hand it a world with weird physics, and see whether in the counterfactual where it cooperates, it can deduce more about that world. Then it has presumably found agents like itself in that world.

Prisoner's Dilemma? Counterfactual? Agent? Cooperation? We're talking about starting from low-level physics, there isn't even a built-in a place to introduce these very-high-level concepts!

The agent I'm talking about is separate from your physics-based world. It's from toy setups like Robust Cooperation in the Prisoner's Dilemma. If it can reason about statements like "If my algorithm returns that I cooperate, then I get 3 utility.", then there may be p for which it can prove "If my algorithm returns that I cooperate, then this strange hypothetical-physics-based world has property p." but not "This strange hypothetical-physics-based world has property p.". This would indicate that that strange world contains agents about which that premise is useful, so we can use modal combatants as agent detectors.