When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

26Rob Bensinger

11Edouard Harris

3TurnTrout

1Edouard Harris

New Comment

Copying over a Slack comment from Abram Demski:

I think this post could be pretty important.

It offers a formal treatment of "goal-directedness" and its relationship to coherence theorems such as VNM, a topic which has seen some past controversy but which has -- till now -- been dealt with only quite informally. Personally I haven't known how to engage with the whole goal-directedness debate, and I think part of the reason for that is the vagueness of the idea. Goal-directedness doesn't seem that cruxy for most of my thinking, but some other people seem to really strongly perceive it as a crux for miri-type thought, and sometimes as a crux for AI risk more generally. (I once made a "tool AI" argument against AI risk myself, although in hindsight I would say that was all motivated cognition, which ignored the idea that even tool AI has to optimize strongly in order to have high capabilities.)

So, as I see it, there's been something of a stalemate between people who think the "goal-directed AI" vs "non-goal-directed AI" distinction is important for one reason or another, vs people who don't think that.

Alex Turner seems to give real technical meaning to this distinction, showing that most VNM-coherent preferences are indeed "goal directed" in the sense of acting broadly like we expect agents to act (that is, behaving in ways consistent with instrumental convergence). However, he also gives a class of VNM-coherent preferences which are

notgoal-directed in this sense, instead exhibiting essentially random behavior. This gives us a plausible formal proxy for the "goal directed vs not goal directed" distinction!I'm not sure how it can/should carry the broader conversation forward, yet, but it seems like something to think about.

Thanks for writing this.

I have one point of confusion about some of the notation that's being used to prove Lemma 3. Apologies for the detail, but the mistake could very well be on my end so I want to make sure I lay out everything clearly.

First, is being defined here as an *outcome* permutation. Presumably this means that 1) for some , ; and 2) admits a unique inverse . That makes sense.

We also define lotteries over outcomes, presumably as, e.g., , where is the probability of outcome . Of course we can interpret the geometrically as mutually orthogonal unit vectors, so this lottery defines a point on the -simplex. So far, so good.

But the thing that's confusing me is what this implies for the definition of . Because is defined as a permutation over *outcomes* (and not over *probabilities* of outcomes), we should expect this to be

The problem is that this seems to give a different EV from the lemma:

(Note that I'm using as the dummy variable rather than , but the LHS above should correspond to line 2 of the proof.) Doing the same thing for the lottery gives an analogous result. And then looking at the inequality that results suggests that lemma 3 should actually be " induces " as opposed to " induces ".

(As a concrete example, suppose we have a lottery with the permutation , , . Then and our EV is

Yet which appears to contradict the lemma as stated.)

Note that even if this analysis is correct, it doesn't invalidate your main claim. You only really care about the *existence* of a bijection rather than *what* that bijection is — the fact that your outcome space is finite ensures that the proportion of orbit elements that incentivize power seeking remains the same either way. (It could have implications if you try to extend this to a metric space, though.)

Again, it's also possible I've just misunderstood something here — please let me know if that's the case!

Thanks! I think you're right. I think I actually should have defined differently, because writing it out, it isn't what I want. Having written out a small example, intuitively, should hold iff , which will also induce as we want.

I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation. Probably it's more natural to represent as , which makes your insight obvious.

The post is edited and the issues should now be fixed.

No problem! Glad it was helpful. I think your fix makes sense.

I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation.

Yeah, I figured maybe it was because the dummy variable was being used in the EV to sum over outcomes, while the vector was being used to represent the probabilities associated with those outcomes. Because and are similar it's easy to conflate their meanings, and if you apply to the wrong one by accident that has the same effect as applying to the other one. In any case though, the main result seems unaffected.

Cheers!

This post explains a formal link between "what kinds of instrumental convergence exists?" and "what does VNM-coherence tell us about goal-directedness?". It turns out that VNM coherent preference orderings have the

samestatistical incentives as utility functions; most such orderings will incentivize power-seeking in the settings covered by the power-seeking theorems.In certain contexts, coherence theorems

canhave non-trivial implications, in that they provide Bayesian evidence about what the coherent agent will probably do. In the situations where the power-seeking theorems apply, coherent preferencesdosuggest some degree of goal-directedness. Somewhat more precisely, VNM-coherence is Bayesian evidence that the agent prefers to stay alive, keep its options open, etc.However, VNM-coherence over

action-observation historiestells you nothing about what behavior to expect from the coherent agent,becausethere is no instrumental convergence for generic utility functions over action-observation histories!## Intuition

The result follows because the VNM utility theorem lets you consider VNM-coherent preference orderings to be isomorphic to their induced utility functions (with equivalence up to positive affine transformation), and so these preference orderings will have the same generic incentives as the utility functions themselves.

## Formalism

Let o1,...,on be outcomes, in a sense which depends on the context; outcomes could be world-states, universe-histories, or one of several fruits. Outcome lotteries are probability distributions over outcomes, and can be represented as elements of the n-dimensional probability simplex (ie as element-wise non-negative unit vectors).

A preference ordering ≺ is a binary relation on lotteries; it need not be eg complete (defined for all pairs of lotteries).

VNM-coherentpreference orderings are those which obey the VNM axioms. By the VNM utility theorem, coherent preference orderings induce consistent utility functions over outcomes, and consistent utility functions conversely imply a coherent preference ordering.Definition 1: Permuted preference ordering.Let ϕ∈Sn be an outcome permutation, and let ≺ be a preference ordering. ≺ϕ is the preference ordering such that for any lotteries L,M: L≺ϕM if and only if ϕ(L)≺ϕ(M).EDIT: Thanks to Edouard Harris for pointing out that Definition 1 and Lemma 3 were originally incorrect.

Definition 2: Orbit of a preference ordering.Let ≺ be any preference ordering. Its orbit Sn⋅≺ is the set {≺ϕ∣ϕ∈Sn}.The orbits of coherent preference orderings are basically all the preference orderings induced by "relabeling" which outcomes are which. This is made clear by the following result:

Lemma 3: Permuting coherent preferences permutes the induced utility function.Let ≺ be a VNM-coherent preference ordering which induces VNM-utility function u, and let ϕ∈Sn. Then ≺ϕ induces VNM-utility function u′(oi)=u(ϕ(oi)), where oi is any outcome.Proof.Let L,M be any lotteries.QED.

As a corollary, this lemma implies that if ≺ is VNM-coherent, so is ≺ϕ, since it induces a consistent utility function over outcomes.

Consider the orbit of any ≺. By the VNM utility theorem, each preference ordering can be considered isomorphic to its induced utility function (with equivalence up to positive affine transformation).

Then let u be any utility function compatible with ≺. By the above lemma, consider the natural bijection between the (preference ordering) orbit of ≺ and the (utility function) orbit of u, where {≺ϕ∣ϕ∈Sn}↔{u∘ϕ∣ϕ∈Sn}.Footnote representative

When my theorems on power-seeking are applicable, some proportion of the right-hand side is guaranteed to make (formal) power-seeking optimal. But by the bijection and by the fact that the preference orderings incentivize the same things (by the VNM theorem in the reverse direction), the (preference ordering) orbit must have the

exact same proportion of elementsfor which (lotteries representing formal) power-seeking are optimal.Conversely, if we know that some set

Aof lotteries tends to be preferred over another setBof lotteries (in the preference order orbit sense), then the same argument shows thatAtends to have greater expected utility thanB(in the utility function orbit sense). This holds for all (utility function) orbits, because every utility function corresponds to a VNM-coherent preference ordering.So: orbit-level instrumental convergence for utility functions is

equivalentto orbit-level instrumental convergence for VNM-coherent preference orderings.## Implications

becausethere is no instrumental convergence in that setting!canhave non-trivial implications, in that they provide Bayesian evidence about what the coherent agent will probably do.dosuggest some degree of goal-directedness.## The quest for better convergence theorems

Goal-directedness seems to more naturally arise from coherence over resources. (I think the word 'resources' is slightly imprecise here, because resources are only resources in the normal context of human life; money is useless when alone in Alpha Centauri, but time to live is not. So we want coherence over things-which-are-locally-resources, perhaps.)

In his review of

Seeking Power is Often Convergently Instrumental in MDPs, John Wentworth wrote:In his review of

Coherent decisions imply consistent utilities, John wrote:This seems on-track to me. We now know what instrumental convergence looks like in unstructured environments, and how structural assumptions on utility functions affect the shape and strength of that instrumental convergence, and this post explains the precise link between "what kinds of instrumental convergence exists?" and "what does VNM-coherence tell us about goal-directedness?". I'd be excited to see what instrumental convergence looks like in more structured models.

Footnote

representative: In terms of instrumental convergence, positive affine transformation never affects the optimality probability of different lottery sets. So for each (preference ordering) orbit element ≺ϕ, it doesn't matter what representative we select from each equivalence class over induced utility functions — so we may as well pick u∘ϕ!