Coherence arguments do not entail goal-directed behavior

One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.

We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent, since if we could notice these failures, so could the agent. So we should expect that powerful agents appear coherent to us. (Note that it is possible that the agent doesn’t fix the failures because it would not be worth it -- in this case, the argument says that we will not be able to notice any exploitable failures.)

Taken together, these arguments suggest that we should model an agent much smarter than us as an expected utility (EU) maximizer. And many people agree that EU maximizers are dangerous. So does this mean we’re doomed? I don’t think so: it seems to me that the problems about EU maximizers that we’ve identified are actually about goal-directed behavior or explicit reward maximizers. The coherence theorems say nothing about whether an AI system must look like one of these categories. This suggests that we could try building an AI system that can be modeled as an EU maximizer, yet doesn’t fall into one of these two categories, and so doesn’t have all of the problems that we worry about.

Note that there are two different flavors of arguments that the AI systems we build will be goal-directed agents (which are dangerous if the goal is even slightly wrong):

  • Simply knowing that an agent is intelligent lets us infer that it is goal-directed. (EDIT: See these comments for more details on this argument.)
  • Humans are particularly likely to build goal-directed agents.

I will only be arguing against the first claim in this post, and will talk about the second claim in the next post.

All behavior can be rationalized as EU maximization

Suppose we have access to the entire policy of an agent, that is, given any universe-history, we know what action the agent will take. Can we tell whether the agent is an EU maximizer?

Actually, no matter what the policy is, we can view the agent as an EU maximizer. The construction is simple: the agent can be thought as optimizing the utility function U, where U(h, a) = 1 if the policy would take action a given history h, else 0. Here I’m assuming that U is defined over histories that are composed of states/observations and actions. The actual policy gets 1 utility at every timestep; any other policy gets less than this, so the given policy perfectly maximizes this utility function. This construction has been given before, eg. at the bottom of page 6 of this paper. (I think I’ve seen it before too, but I can’t remember where.)

But wouldn’t this suggest that the VNM theorem has no content? Well, we assumed that we were looking at the policy of the agent, which led to a universe-history deterministically. We didn’t have access to any probabilities. Given a particular action, we knew exactly what the next state would be. Most of the axioms of the VNM theorem make reference to lotteries and probabilities -- if the world is deterministic, then the axioms simply say that the agent must have transitive preferences over outcomes. Given that we can only observe the agent choose one history over another, we can trivially construct a transitive preference ordering by saying that the chosen history is higher in the preference ordering than the one that was not chosen. This is essentially the construction we gave above.

What then is the purpose of the VNM theorem? It tells you how to behave if you have probabilistic beliefs about the world, as well as a complete and consistent preference ordering over outcomes. This turns out to be not very interesting when “outcomes” refers to “universe-histories”. It can be more interesting when “outcomes” refers to world states instead (that is, snapshots of what the world looks like at a particular time), but utility functions over states/snapshots can’t capture everything we’re interested in, and there’s no reason to take as an assumption that an AI system will have a utility function over states/snapshots.

There are no coherence arguments that say you must have goal-directed behavior

Not all behavior can be thought of as goal-directed (primarily because I allowed the category to be defined by fuzzy intuitions rather than something more formal). Consider the following examples:

  • A robot that constantly twitches
  • The agent that always chooses the action that starts with the letter “A”
  • The agent that follows the policy <policy> where for every history the corresponding action in <policy> is generated randomly.

These are not goal-directed by my “definition”. However, they can all be modeled as expected utility maximizers, and there isn’t any particular way that you can exploit any of these agents. Indeed, it seems hard to model the twitching robot or the policy-following agent as having any preferences at all, so the notion of “exploiting” them doesn’t make much sense.

You could argue that neither of these agents are intelligent, and we’re only concerned with superintelligent AI systems. I don’t see why these agents could not in principle be intelligent: perhaps the agent knows how the world would evolve, and how to intervene on the world to achieve different outcomes, but it does not act on these beliefs. Perhaps if we peered into the inner workings of the agent, we could find some part of it that allows us to predict the future very accurately, but it turns out that these inner workings did not affect the chosen action at all. Such an agent is in principle possible, and it seems like it is intelligent.

(If not, it seems as though you are defining intelligence to also be goal-driven, in which case I would frame my next post as arguing that we may not want to build superintelligent AI, because there are other things we could build that are as useful without the corresponding risks.)

You could argue that while this is possible in principle, no one would ever build such an agent. I wholeheartedly agree, but note that this is now an argument based on particular empirical facts about humans (or perhaps agent-building processes more generally). I’ll talk about those in the next post; here I am simply arguing that merely knowing that an agent is intelligent, with no additional empirical facts about the world, does not let you infer that it has goals.

As a corollary, since all behavior can be modeled as maximizing expected utility, but not all behavior is goal-directed, it is not possible to conclude that an agent is goal-driven if you only know that it can be modeled as maximizing some expected utility. However, if you know that an agent is maximizing the expectation of an explicitly represented utility function, I would expect that to lead to goal-driven behavior most of the time, since the utility function must be relatively simple if it is explicitly represented, and simple utility functions seem particularly likely to lead to goal-directed behavior.

There are no coherence arguments that say you must have preferences

This section is another way to view the argument in the previous section, with “goal-directed behavior” now being operationalized as “preferences”; it is not saying anything new.

Above, I said that the VNM theorem assumes both that you use probabilities and that you have a preference ordering over outcomes. There are lots of good reasons to assume that a good reasoner will use probability theory. However, there’s not much reason to assume that there is a preference ordering over outcomes. The twitching robot, “A”-following agent, and random policy agent from the last section all seem like they don’t have preferences (in the English sense, not the math sense).

Perhaps you could define a preference ordering by saying “if I gave the agent lots of time to think, how would it choose between these two histories?” However, you could apply this definition to anything, including eg. a thermostat, or a rock. You might argue that a thermostat or rock can’t “choose” between two histories; but then it’s unclear how to define how an AI “chooses” between two histories without that definition also applying to thermostats and rocks.

Of course, you could always define a preference ordering based on the AI’s observed behavior, but then you’re back in the setting of the first section, where all observed behavior can be modeled as maximizing an expected utility function and so saying “the AI is an expected utility maximizer” is vacuous.

Convergent instrumental subgoals are about goal-directed behavior

One of the classic reasons to worry about expected utility maximizers is the presence of convergent instrumental subgoals, detailed in Omohundro’s paper The Basic AI Drives. The paper itself is clearly talking about goal-directed AI systems:

To say that a system of any design is an “artificial intelligence”, we mean that it has goals which it tries to accomplish by acting in the world.

It then argues (among other things) that such AI systems will want to “be rational” and so will distill their goals into utility functions, which they then maximize. And once they have utility functions, they will protect them from modification.

Note that this starts from the assumption of goal-directed behavior and derives that the AI will be an EU maximizer along with the other convergent instrumental subgoals. The coherence arguments all imply that AIs will be EU maximizers for some (possibly degenerate) utility function; they don’t prove that the AI must be goal-directed.

Goodhart’s Law is about goal-directed behavior

A common argument for worrying about AI risk is that we know that a superintelligent AI system will look to us like an EU maximizer, and if it maximizes a utility function that is even slightly wrong we could get catastrophic outcomes.

By now you probably know my first response: that any behavior can be modeled as an EU maximizer, and so this argument proves too much, suggesting that any behavior causes catastrophic outcomes. But let’s set that aside for now.

The second part of the claim comes from arguments like Value is Fragile and Goodhart’s Law. However, if we consider utility functions that assign value 1 to some histories and 0 to others, then if you accidentally assign a history where I needlessly stub my toe a 1 instead of a 0, that’s a slightly wrong utility function, but it isn’t going to lead to catastrophic outcomes.

The worry about utility functions that are slightly wrong holds water when the utility functions are wrong about some high-level concept, like whether humans care about their experiences reflecting reality. This is a very rarefied, particular distribution of utility functions, that are all going to lead to goal-directed or agentic behavior. As a result, I think that the argument is better stated as “if you have a slightly incorrect goal, you can get catastrophic outcomes”. And there aren’t any coherence arguments that say that agents must have goals.

Wireheading is about explicit reward maximization

There are a few papers that talk about the problems that arise with a very powerful system with a reward function or utility function, most notably wireheading. The argument that AIXI will seize control of its reward channel falls into this category. In these cases, typically the AI system is considering making a change to the system by which it evaluates goodness of actions, and the goodness of the change is evaluated by the system after the change. Daniel Dewey argues in Learning What to Value that if the change is evaluated by the system before the change, then these problems go away.

I think of these as problems with reward maximization, because typically when you phrase the problem as maximizing reward, you are maximizing the sum of rewards obtained in all timesteps, no matter how those rewards are obtained (i.e. even if you self-modify to make the reward maximal). It doesn’t seem like AI systems have to be built this way (though admittedly I do not know how to build AI systems that reliably avoid these problems).

Summary

In this post I’ve argued that many of the problems we typically associate with expected utility maximizers are actually problems with goal-directed agents or with explicit reward maximization. Coherence arguments only entail that a superintelligent AI system will look like an expected utility maximizer, but this is actually a vacuous constraint, and there are many potential utility functions for which the resulting AI system is neither goal-directed nor explicit-reward-maximizing. This suggests that we could try to build AI systems of this type, in order to sidestep the problems that we have identified so far.

New Comment
69 comments, sorted by Click to highlight new comments since:

I think that strictly speaking this post (or at least the main thrust) is true, and proven in the first section. The title is arguably less true: I think of 'coherence arguments' as including things like 'it's not possible for you to agree to give me a limitless number of dollars in return for nothing', which does imply some degree of 'goal-direction'.

I think the post is important, because it constrains the types of valid arguments that can be given for 'freaking out about goal-directedness', for lack of a better term. In my mind, it provokes various follow-up questions:

  1. What arguments would imply 'goal-directed' behaviour?
  2. With what probability will a random utility maximiser be 'goal-directed'?
  3. How often should I think of a system as a utility maximiser in resources, perhaps with a slowly-changing utility function?
  4. How 'goal-directed' are humans likely to make systems, given that we are making them in order to accomplish certain tasks that don't look like random utility functions?
  5. Is there some kind of 'basin of goal-directedness' that systems fall in if they're even a little goal-directed, causing them to behave poorly?

Off the top of my head, I'm not familiar with compelling responses from the 'freak out about goal-directedness' camp on points 1 through 5, even though as a member of that camp I think that such responses exist. Responses from outside this camp include Rohin's post 'Will humans build goal-directed agents?'. Another response is Brangus' comment post, although I find its theory of goal-directedness uncompelling.

I think that it's notable that Brangus' post was released soon after this was announced as a contender for Best of LW 2018. I think that if this post were added to the Best of LW 2018 Collection, the 'freak out' camp might produce more of these responses and move the dialogue forward. As such, I think it should be added, both because of the clear argumentation and because of the response it is likely to provoke.

Putting my cards on the table, this is my guess at the answers to the questions that I raise:

  1. I don't know.
  2. Low.
  3. Frequent if it's an 'intelligent' one.
  4. Relatively. You probably don't end up with systems that resist literally all changes to their goals, but you probably do end up with systems that resist most changes to their goals, barring specific effort to prevent that.
  5. Probably.

That being said, I think that a better definition of 'goal-directedness' would go a long way in making me less confused by the topic.

I have no idea why I responded 'low' to 2. Does anybody think that's reasonable and fits in with what I wrote here, or did I just mean high?

"random utility-maximizer" is pretty ambiguous; if you imagine the space of all possible utility functions over action-observation histories and you imagine a uniform distribution over them (suppose they're finite, so this is doable), then the answer is low.

Heh, looking at my comment it turns out I said roughly the same thing 3 years ago.

I pretty strongly agree with this review (and jtbc it was written without any input from me, even though Daniel and I are both at CHAI).

I think of 'coherence arguments' as including things like 'it's not possible for you to agree to give me a limitless number of dollars in return for nothing', which does imply some degree of 'goal-direction'.

Yeah, maybe I should say "coherence theorems" to be clearer about this? (Like, it isn't a theorem that I shouldn't give you limitless number of dollars in return for nothing; maybe I think that you are more capable than me and fully aligned with me, and so you'd do a better job with my money. Or maybe I value your happiness, and the best way to purchase it is to give you money no strings attached.)

Responses from outside this camp

Fwiw, I do in fact worry about goal-directedness, but (I think) I know what you mean. (For others, I think Daniel is referring to something like "the MIRI camp", though that is also not an accurate pointer, and it is true that I am outside that camp.)

My responses to the questions:

  1. The ones in Will humans build goal-directed agents?, but if you want arguments that aren't about humans, then I don't know.
  2. Depends on the distribution over utility functions, the action space, etc, but e.g. if it uniformly selects a numeric reward value for each possible trajectory (state-action sequence) where the actions are low-level (e.g. human muscle control), astronomically low.
  3. That will probably be a good model for some (many?) powerful AI systems that humans build.
  4. I don't know. (I think it depends quite strongly on the way in which we train powerful AI systems.)
  5. Not likely at low levels of intelligence, plausible at higher levels of intelligence, but really the question is not specified enough.

it was written without any input from me

Well, I didn't consult you in the process of writing the review, but we've had many conversations on the topic which presumably have influenced how I think about the topic and what I ended up writing in the review.

I think of 'coherence arguments' as including things like 'it's not possible for you to agree to give me a limitless number of dollars in return for nothing', which does imply some degree of 'goal-direction'.

Yeah, maybe I should say "coherence theorems" to be clearer about this?

Sorry, I meant theorems taking 'no limitless dollar sink' as an axiom and deriving something interesting from that.

In this essay, Rohin sets out to debunk what ey perceive as a prevalent but erroneous idea in the AI alignment community, namely: "VNM and similar theorems imply goal-directed behavior". This is placed in the context of Rohin's thesis that solving AI alignment is best achieved by designing AI which is not goal-directed. The main argument is: "coherence arguments" imply expected utility maximization, but expected utility maximization does not imply goal-directed behavior. Instead, it is a vacuous constraint, since any agent policy can be regarded as maximizing the expectation of some utility function.

I have mixed feelings about this essay. On the one hand, the core argument that VNM and similar theorems do not imply goal-directed behavior is true. To the extent that some people believed the opposite, correcting this mistake is important. On the other hand, (i) I don't think the claim Rohin is debunking is the claim Eliezer had in mind in those sources Rohin cites (ii) I don't think that the conclusions Rohin draws or at least implies are the right conclusions.

The actual claim that Eliezer was making (or at least my interpretation of it) is, coherence arguments imply that if we assume an agent is goal-directed then it must be an expected utility maximizer, and therefore EU maximization is the correct mathematical model to apply to such agents.

Why do we care about goal-directed agents in the first place? The reason is, on the one hand goal-directed agents are the main source of AI risk, and on the other hand, goal-directed agents are also the most straightforward approach to solving AI risk. Indeed, if we could design powerful agents with the goals we want, these agents would protect us from unaligned AIs and solve all other problems as well (or at least solve them better than we can solve them ourselves). Conversely, if we want to protect ourselves from unaligned AIs, we need to generate very sophisticated long-term plans of action in the physical world, possibly restructuring the world in a rather extreme way to safe-guard it (compare with Bostrom's arguments for mass surveillance). The ability to generate such plans is almost by definition goal-directed behavior.

Now, knowing that goal-directed agents are EU maximizers doesn't buy us much. As Rohin justly observes, without further constraints it is a vacuous claim (although the situation becomes better if we constraint ourselves to instrumental reward functions). Moreover, the model of reasoning in complex environments that I'm advocating myself (quasi-Bayesian reinforcement learning) doesn't even look like EU maximization (technically there is a way to interpret it as EU maximization but it underspecifies the behavior). This is a symptom of the fact that the setting and assumptions of VNM and similar theorems are not good enough to study goal-directed behavior. However, I think that it can be an interesting and important line of research, to try and figure out the right setting and assumptions.

This last point is IMO the correct takeaway from Rohin's initial observation. In contrast, I remain skeptical about Rohin's thesis that we should dispense with goal-directedness altogether, for the reason I mentioned before: powerful goal-directed agents seem necessary or at least very desirable to create a defense system from unaligned AI. Moreover, the study of goal-directed agents is important to understand the impact of any powerful AI system on the world, since even a system not designed to be goal-directed can develop such agency (due to reasons like malign hypotheses, mesa-optimization and self-fullfiling prophecies).

This is placed in the context of Rohin's thesis that solving AI alignment is best achieved by designing AI which is not goal-directed.
[...]
I remain skeptical about Rohin's thesis that we should dispense with goal-directedness altogether

Hmm, perhaps I believed this when I wrote the sequence (I don't think so, but maybe?), but I certainly don't believe it now. I believe something more like:

  • Humans have goals and want AI systems to help them achieve them; this implies that the human-AI system as a whole should be goal-directed.
  • One particular way to do this is to create a goal-directed AI system, and plug in a goal that (we think) we want. Such AI systems are well-modeled as EU maximizers with "simple" utility functions.
  • But there could plausibly be AI systems that are not themselves goal-directed, but nonetheless the resulting human-AI system is sufficiently goal-directed. For example, a "genie" that properly interprets your instructions based on what you mean and not what you say seems not particularly goal-directed, but when combined with a human giving instructions becomes goal-directed.
  • One counterargument is that in order to be competitive, you must take the human out of the loop. I don't find this compelling, for a few reasons. First, you can interpolate between lots of human feedback (the human says "do X for a minute" every minute to the "genie") and not much human feedback (the human says "pursue my CEV forever") depending on how competitive you need to be. This allows you to tradeoff between competitiveness and how much of the goal-directedness remains in the human. Second, you can help the human to provide more efficient and effective feedback (see e.g. recursive reward modeling). Finally, laws and regulations can be effective at reducing competition.
  • Nonetheless, it's not obvious how to create such non-goal-directed AI, and the AI community seems very focused on building goal-directed AI, and so there's a good chance we will build goal-directed AI and will need to focus on alignment of goal-directed AI systems.
  • As a result, we should be thinking about non-goal-directed AI approaches to alignment, while also working on alignment of goal-directed systems.

I think when I wrote the sequence, I thought the "just do deep RL" approach to AGI wouldn't work, and now I think it has more of a chance, and this has updated me towards powerful AI systems being goal-directed. (However, I do not think it is clear that "just do deep RL" approaches lead to goal-directed systems.)

I think that the discussion might be missing a distinction between different types or degrees of goal-directedness. For example, consider Dialogic Reinforcement Learning. Does it describe a goal-directed agent? On the one hand, you could argue it doesn't, because this agent doesn't have fixed preferences and doesn't have consistent beliefs over time. On the other hand, you could argue it does, because this agent is still doing long-term planning in the physical world. So, I definitely agree that aligned AI systems will only be goal-directed in the weaker sense that I alluded to, rather than in the stronger sense, and this is because the user is only goal-directed in the weak sense emself.

If we're aiming at "weak" goal-directedness (which might be consistent with your position?), does it mean studying strong goal-directedness is redundant? I think that answer is, clearly no. Strong goal-directed systems are a simpler special case on which to hone our theories of intelligence. Trying to understand weak goal-directed agents without understanding strong goal-directed agents seems to me like trying to understand molecules without understanding atoms.

On the other hand, I am skeptical about solutions to AI safety that require the user doing a sizable fraction of the actual planning. I think that planning does not decompose into an easy part and a hard part (which is not essentially planning in itself) in a way which would enable such systems to be competitive with fully autonomous planners. The strongest counterargument to this position, IMO, is the proposal to use counterfatual oracles or recursively amplified versions thereof in the style of IDA. However, I believe that such systems will still fail to be simultaneously safe and competitive because (i) forecasting is hard if you don't know which features are important to forecast, and becomes doubly hard if you need to impose confidence threshold to avoid catastrophic errors and in particular malign hypotheses (thresholds of the sort used in delegative RL) (ii) it seems plausible that competitive AI would have to be recursively self-improving (I updated towards this position after coming up with Turing RL) and that might already necessitate long-term planning and (iii) such system are vulnerable to attacks from the future and to attacks from counterfactual scenarios.

I think when I wrote the sequence, I thought the "just do deep RL" approach to AGI wouldn't work, and now I think it has more of a chance, and this has updated me towards powerful AI systems being goal-directed. (However, I do not think it is clear that "just do deep RL" approaches lead to goal-directed systems.)

To be clear, my own position is not strongly correlated with whether deep RL leads to AGI (i.e. I think it's true even if deep RL doesn't lead to AGI). But also, the question seems somewhat underspecified, since it's not clear which algorithmic innovation would count as still "just deep RL" and which wouldn't.

On the other hand, I am skeptical about solutions to AI safety that require the user doing a sizable fraction of the actual planning.

Agreed (though we may be using the word "planning" differently, see below).

If we're aiming at "weak" goal-directedness (which might be consistent with your position?)

I certainly agree that we will want AI systems that can find good actions, where "good" is based on long-term consequences. However, I think counterfactual oracles and recursive amplification also meet this criterion; I'm not sure why you think they are counterarguments. Perhaps you think that the AI system also needs to autonomously execute the actions it finds, whereas I find that plausible but not necessary?

To be clear, my own position is not strongly correlated with whether deep RL leads to AGI

Yes, that's what I thought. My position is more correlated because I don't see (strong) goal-directedness as a necessity, but I do think that deep RL is likely (though not beyond reasonable doubt) to lead to strongly goal-directed systems.

I certainly agree that we will want AI systems that can find good actions, where "good" is based on long-term consequences. However, I think counterfactual oracles and recursive amplification also meet this criterion; I'm not sure why you think they are counterarguments. Perhaps you think that the AI system also needs to autonomously execute the actions it finds, whereas I find that plausible but not necessary?

Maybe we need to further refine the terminology. We could say that counterfactual oracles are not intrinsically goal-directed. Meaning that, the algorithm doesn't start with all the necessary components to produce good plans, but instead tries to learn these components by emulating humans. This approach comes with costs that I think will make it uncompetitive compared to intrinsically goal-direct agents, for the reasons I mentioned before. Moreover, I think that any agent which is "extrinsically goal-directed" rather than intrinsically goal-directed will have such penalties.

In order for an agent to gain strategic advantage it is probably not necessary for it be powerful enough to emulate humans accurately, reliably and significantly faster than real-time. We can consider three possible worlds:

World A: Agents that aren't powerful enough for even a limited scope short-term emulation of humans can gain strategic advantage. This world is a problem even for Dialogic RL, but I am not sure whether it's a fatal problem.

World B: Agents that aren't powerful enough for a short-term emulation of humans cannot gain strategic advantage. Agents that aren't powerful enough for a long-term emulation of humans (i.e high bandwidth and faster than real-time) can gain strategic advantage. This world is good for Dialogic RL but bad for extrinsically goal-directed approaches.

World C: Agents that aren't powerful enough for a long-term emulation of humans cannot gain strategic advantage. In this world delegating the remaining part of the AI safety problem to extrinsically goal-directed agents is viable. However, if unaligned intrinsically goal-directed agents are deployed before a defense system is implemented, they will probably still win because of their more efficient use of computing resources, lower risk-aversiveness, because even a sped-up version of the human algorithm might still have suboptimal sample complexity and because of attacks from the future. Dialogic RL will also be disadvantaged compared to unaligned AI (because of risk-aversiveness) but at least the defense system will be constructed faster.

Allowing the AI to execute the actions it finds is also advantageous because of higher bandwidths and shorter reaction times. But this concerns me less.

I think I don't understand what you mean here. I'll say some things that may or may not be relevant:

I don't think the ability to plan implies goal-directedness. Tabooing goal-directedness, I don't think an AI that can "intrinsically" plan will necessarily pursue convergent instrumental subgoals. For example, the AI could have "intrinsic" planning capabilities, that find plans that when executed by a human lead to outcomes the human wants. Depending on how it finds such plans, such an AI may not pursue any of the convergent instrumental subgoals. (Google Maps would be an example of such an AI system, and by my understanding Google Maps has "intrinsic" planning capabilities.)

I also don't think that we will find the one true algorithm for planning (I agree with most of Richard's positions in Realism about rationality).

I don't think that my intuitions depend on an AI's ability to emulate humans (e.g. Google Maps does not emulate humans).

Google Maps is not a relevant example. I am talking about "generally intelligent" agents. Meaning that, these agents construct sophisticated models of the world starting from a relatively uninformed prior (comparably to humans or more so)(fn1)(fn2). This is in sharp contrast to Google Maps that operates strictly within the model it was given a priori. General intelligence is important, since without it I doubt it will be feasible to create a reliable defense system. Given general intelligence, convergent instrumental goals follow: any sufficiently sophisticated model of the world implies that achieving converging instrumental goals is instrumentally valuable.

I don't think it makes that much difference whether a human executes the plan or the AI itself. If the AI produces a plan that is not human comprehensible and the human follows it blindly, the human effectively becomes just an extension of the AI. On the other hand, if the AI produces a plan which is human comprehensible, then after reviewing the plan the human can just as well delegate its execution to the AI.

I am not sure what is the significance in this context of "one true algorithm for planning"? My guess is, there is a relatively simple qualitatively optimal AGI algorithm(fn3), and then there are various increasingly complex quantitative improvements of it, which take into account specifics of computing hardware and maybe our priors about humans and/or the environment. Which is the way algorithms for most natural problems behave, I think. But also improvements probably stop mattering beyond the point where the AGI can come with them on its own within a reasonable time frame. And, I dispute Richard's position. But then again, I don't understand the relevance.

(fn1) When I say "construct models" I am mostly talking about the properties of the agent rather than the structure of the algorithm. That is, the agent can effectively adapt to a large class of different environments or exploit a large class of different properties the environment can have. In this sense, model-free RL is also constructing models. Although I'm also leaning towards the position that explicitly model-based approaches are more like to scale to AGI.

(fn2) Even if you wanted to make a superhuman AI that only solves mathematical problems, I suspect that the only way it could work is by having the AI generate models of "mathematical behaviors".

(fn3) As an analogy, a "qualitatively optimal" algorithm for a problem in is just any polynomial time algorithm. In the case of AGI, I imagine a similar computational complexity bound plus some (also qualitative) guarantee(s) about sample complexity and/or query complexity. By "relatively simple" I mean something like, can be described within 20 pages given that we can use algorithms for other natural problems.

A year later, I continue to agree with this post; I still think its primary argument is sound and important. I'm somewhat sad that I still think it is important; I thought this was an obvious-once-pointed-out point, but I do not think the community actually believes it yet.

I particularly agree with this sentence of Daniel's review:

I think the post is important, because it constrains the types of valid arguments that can be given for 'freaking out about goal-directedness', for lack of a better term."

"Constraining the types of valid arguments" is exactly the right way to describe the post. Many responses to the post have been of the form "this is missing the point of EU maximization arguments", and yes, the post is deliberately missing that point. The post is not saying that arguments for AI risk are wrong, just that they are based on intuitions and not provable theorems. While I do think that we are likely to build goal-directed agents, I do not think the VNM theorem and similar arguments support that claim: they simply describe how a goal-directed agent should think.

However, talks like AI Alignment: Why It’s Hard, and Where to Start and posts like Coherent decisions imply consistent utilities seem to claim that "VNM and similar theorems" implies "goal-directed agents". While there has been some disagreement over whether this claim is actually present, it doesn't really matter -- readers come away with that impression. I see this post as correcting that claim; it would have been extremely useful for me to read this post a little over two years ago, and anecdotally I have heard that others have found it useful as well.

I am somewhat worried that if readers who read this post in isolation will get the wrong impression, since it really was meant as part of the sequence. For example, I think Brangus' comment post is proposing an interpretation of "goal-directedness" that I proposed and argued against in the previous post (see also my response, which mostly quotes the previous post). Similarly, I sometimes hear the counterargument that there will be economic pressures towards goal-directed AI, even though this position is compatible with the post and addressed in the next post. I'm not sure how to solve this though, without just having both the previous and next posts appended to this post. (Part of the problem is that different people have different responses to the post, so it's hard to address all of them without adding a ton of words.) ETA: Perhaps adding the thoughts in this comment?

+1, I would have written my own review, but I think I basically just agree with everything in this one (and to the extent I wanted to further elaborate on the post, I've already done so here).

[-]Rohin ShahΩ13190

(This comment provides more intuition pumps for why it is invalid to argue "math implies AI risk". This is not a controversial point -- the critical review agrees that this is true -- but I figured it was worth writing down for anyone who might still find it confusing, or feel like my argument in the post is "too clever".)

It should seem really weird to you on a gut level to hear the claim that VNM theorem, and only the VNM theorem, implies that AI systems would kill us all. Like, really? From just the assumption that we can't steal resources from the AI system with certainty [1], we can somehow infer that the AI must kill us all? Just by knowing that the AI system calculates the value of uncertain cases by averaging the values of outcomes based on their probabilities [2], we can infer that the AI system will take over the world?

But it's even worse than that. Consider the following hopefully-obvious claims:

  • The argument for AI risk should still apply if the universe is deterministic.
  • The argument for AI risk should still apply if the agent is made more intelligent.

If you believe that, then you should believe that the argument for AI risk should also work in a deterministic universe in which the AI can perfectly predict exactly what the universe does. However, in such a universe, the VNM theorem is nearly contentless -- the AI has no need of probability, and most of the VNM axioms are irrelevant. All you get with the VNM theorem in such a universe is that the AI's ordering over outcomes is transitive: If it chooses A over B and B over C, then it also chooses A over C. Do you really think that just from transitivity you can argue for AI x-risk? Something must have gone wrong somewhere.


I think the way to think about the VNM theorem is to see it as telling you how to compactly describe choice-procedures.

Suppose there are N possible outcomes in the world. Then one way to describe a choice-procedure is to describe, for all N(N-1)/2 pairs of outcomes, which outcome the choice-procedure chooses. This description has size O(N^2).

If you assume that the choice-procedure is transitive (choosing A over B and choosing B over C implies choosing A over C), then you can do better: you can provide a ranking of the options (e.g. B, A, C). This description has size O(N).

The VNM theorem deals with the case where you introduce lotteries over outcomes, e.g. a 50% chance of A, 20% chance of B, and 30% chance of C, and now you have to choose between lotteries. While there were only N outcomes, there are uncountably infinitely many lotteries, so simply writing down what the choice-procedure does in all cases would require an uncountably infinitely large description.

The VNM theorem says that if the choice-procedure satisfies a few intuitive axioms, then you can still have an O(N) size description, called a utility function. This function assigns a number to each of the N outcomes (hence the O(N) size description) [3]. Then, to compute what the choice-procedure would say for a pair of lotteries, you simply compute the expected utility for each lottery, and say that the choice-procedure would choose the one that is higher.

Notably, the utility function can be arbitrarily complicated, in the sense that it can assign any number to each outcome, independently of all the other outcomes. People then impose other conditions, like "utility must be monotonically increasing in the amount of money you have", and get stronger conclusions, but these are not implied by the VNM theorem. Ultimately the VNM theorem is a representation theorem telling you how to compactly represent a choice-procedure.

It seems to me that AI risk is pretty straightforwardly about how the choice-procedures that we build rank particular outcomes, as opposed to different lotteries over outcomes. The VNM theorem / axioms say ~nothing about that, so you shouldn't expect it to add anything to the argument for AI risk.


  1. The VNM axioms are often justified on the basis that if you don't follow them, you can be Dutch-booked: you can be presented with a series of situations where you are guaranteed to lose utility relative to what you could have done. So on this view, we have "no Dutch booking" implies "VNM axioms" implies "AI risk". ↩︎

  2. The conclusion of the VNM theorem is that you must maximize expected utility, which means that your "better-than" relation is done by averaging the utilities of outcomes weighted by their probabilities, and then using the normal "better-than" relation on numbers (i.e. higher numbers are better than lower numbers). ↩︎

  3. Technically, since each of the numbers is a real number, it still requires infinite memory to write down this description, but we'll ignore that technicality. ↩︎

I finally read Rational preference: Decision theory as a theory of practical rationality, and it basically has all of the technical content of this post; I'd recommend it as a more in-depth version of this post. (Unfortunately I don't remember who recommended it to me, whoever you are, thanks!) Some notable highlights:

It is, I think, very misleading to think of decision theory as telling you to maximize your expected utility. If you don't obey its axioms, then there is no utility function constructable for you to maximize the expected value of. If you do obey the axioms, then your expected utility is always maximized, so the advice is unnecessary. The advice, 'Maximize Expected Utility' misleadingly suggests that there is some quantity, definable and discoverable independent of the formal construction of your utility function, that you are supposed to be maximizing. That is why I am not going to dwell on the rational norm, Maximize Expected Utility! Instead, I will dwell on the rational norm, Attend to the Axioms!

Very much in the spirit of the parent comment.

Unfortunately, the Fine Individuation solution raises another problem, one that looks deeper than the original problems. The problem is that Fine Individuation threatens to trivialize the axioms.

(Fine Individuation is basically the same thing as moving from preferences-over-snapshots to preferences-over-universe-histories.)

All it means is that a person could not be convicted of intransitive preferences merely by discovering things about her practical preferences. [...] There is no possible behavior that could reveal an impractical preference

His solution is to ask people whether they were finely individuating, and if they weren't, then you can conclude they are inconsistent. This is kinda sorta acknowledging that you can't notice inconsistency from behavior ("practical preferences" aka "choices that could actually be made"), though that's a somewhat inaccurate summary.

There is no way that anyone could reveal intransitive preferences through her behavior. Suppose on one occasion she chooses X when the alternative was Y, on another she chooses Y when the alternative was Z, and on a third she chooses g when the alternative was X. But that is nonsense; there is no saying that the Y she faced in the first occasion was the same as the Y she faced on the second. Those alternatives could not have been just the same, even leaving aside the possibility of individuating them by reference to what else could have been chosen. They will be alternatives at different times, and they will have other potentially significant differentia.

Basically making the same point with the same sort of construction as the OP.

Note that this starts from the assumption of goal-directed behavior and derives that the AI will be an EU maximizer along with the other convergent instrumental subgoals.

The result is actually stronger than that, I think: if the AI is goal-directed at least in part, then that part will (tend to) purge the non-goal directed behaviours and then follow the EU path.

I wonder if we could get theorems as to what kinds of minimal goal directed behaviour will result in the agent becoming a completely goal-directed agent.

Seems like it comes down to the definition of goal-directed. Omohundro uses a chess-playing AI as a motivating example, and intuitively, a chess-playing AI seems "fully goal-directed". But even as chess and go-playing AIs have become superhuman, and found creative plans humans can't find, we haven't seen any examples of them trying to e.g. kill other processes on your computer so they can have more computational resources and play a better game. A theory which can't explain these observations doesn't sound very useful.

Maybe this discussion is happening on the wrong level of abstaction. All abstractions are leaky, and abstractions like "intelligent", "goal-oriented", "creative plans", etc. are much leakier than typical computer science abstractions. An hour of looking at the source code is going to be worth five hours of philosophizing. The most valuable thing the AI safety community can do might be to produce a checklist for someone creating the software architecture or reading the source code for the first AGI, so they know what failure modes to look for.

A chess tree search algorithm would never hit upon killing other processes. An evolutionary chess-playing algorithm might learn to do that. It's not clear whether goal-directed is relevant to that distinction.

[-]gwernΩ13410

That's not very imaginative. Here's how a chess tree search algorithm - let's take AlphaZero for concreteness - could learn to kill other processes, even if it has no explicit action which corresponds to interaction with other processes and is apparently sandboxed (aside from the usual sidechannels like resource use). It's a variant of the evolutionary algorithm which learned to create a board so large that its competing GAs crashed/were killed while trying to deal with it (the Tic-tac-toe memory bomb). In this case, position evaluations can indirectly reveal that an exploration strategy caused enough memory use to trigger the OOM, killing rival processes, and freeing up resources for the tree search to get a higher win rate by more exploration:

  1. one of the main limits to tree evaluation is memory consumption, due to the exponential growth of breadth-first memory requirements (this is true regardless of whether an explicit tree or implicit hash-based representation is used); to avoid this, memory consumption is often limited to a fixed amount of memory or a mix of depth/breadth-first strategies are used to tame memory growth, even though this may not be optimal, as it may force premature stopping to expansion of the game tree (resorting to light/heavy playouts) or force too much exploitation depthwise along a few promising lines of play and too little exploration etc. (One of the criticisms of AlphaZero, incidentally, was that too little RAM was given to the standard chess engines to permit them to reach their best performance.)

  2. when a computer OS detects running out of memory, it'll usually invoke an 'OOM killer', which may or may not kill the program which makes the request which uses up the last of free memory

  3. so, it is possible that if a tree search algorithm exhausts memory (because the programmer didn't remember to include a hard limit, the hard limit turns out to be incorrect for the machine being trained on, the limit is defined wrong like in terms of max depth instead of total nodes, etc), it may not crash or be killed but other programs, using unknown & potentially large percentages of memory, may be killed instead to free up memory. (I've observed this on Linux, to my frustration, where the programs I don't want killed get killed by the OOM reaper instead of the haywire program.)

  4. once other programs are killed to free up memory, all that memory is now available for the tree search algorithm to use; using this memory will increase performance by allowing more of the game tree to be explicitly evaluated, either wider or deeper.

  5. in AlphaZero, the choice of widening or deepening is inherently controlled by the NN, which is trained to predict the result of the final values of each position and increase win probabilities.

  6. reaching a position (which can be recognized by its additional complexity, indicating it lies at a certain additional depth in the tree and thus indirectly reveals how much memory is being used by the NN's cumulative exploration) which triggers an OOM killing other programs will result in more accurate position evaluations, leading to higher values/higher win probability; so it will reinforce a strategy where it learns to aggressively widen early in the game to exhaust memory, waits for an OOM to happen, and then in the rest of the game proceeds to explore more aggressively (rather than depth-first exploit) given the new memory.

    (Depending on the exact details of how the tree expansion & backups are done, it's possible that the AlphaZero NN couldn't observe the benefits of wide-then-deep - it might just look like noise in value estimates - but there are expert iteration variants where the NN directly controls the tree expansion rather than merely providing value estimates for the MCTS algorithm to explore using, and those should be able to observe indirect benefits of exploration strategies over a game.)

At no point does it interact directly with other processes, or even know that they exist; it just implicitly learns that expanding a decision tree in a particular wide-then-deep fashion leads to better evaluations more consistent with the true value and/or end-game result (because of side-effects leading to increased resource consumption leading to better performance). And that's how a tree-search algorithm can hit upon killing other processes.

This story seems to reinforce my "leaky abstraction" point. The story hinges on nitty gritty details of how the AI is implemented and how the operating system manages resources. There's no obvious usefulness in proving theorems and trying to make grand statements about utility maximizers, optimizers, goal-oriented systems, etc. I expect that by default, a programmer who tried to apply a theorem of Stuart's to your chess system would not think to consider these details related to memory management (formally verifying a program's source code says nothing about memory management if that happens lower in the stack). But if they did think to consider these details of memory management, having no access to Stuart's theorem, they'd still have a good shot at preventing the problem (by changing the way the NN controls tree expansion or simply capping the program's memory use).

Leaky abstractions are a common cause of computer security problems also. I think this is a big reason why crypto proofs fail so often. A proof is a tower on top of your existing set of abstractions; it's fairly useless if your existing abstractions are faulty.

What I like about this thread, and why I'm worried about people reading this post and updating away from thinking that sufficiently powerful processes that don't look like what we think are dangerous is safe, is that it helps make clear that Rohin seems to be making an argument that hinges on leaky or even confused abstractions. I'm not sure any of the rest of us have much better abstractions to offer that aren't leaky, and I want to encourage what Rohin does in this post of thinking through the implications of the abstractions he's using to draw conclusions that are specific enough to be critiqued, because through a process like this we can get a clearer idea of where we have shared confusion and then work to resolve it.

The argument Rohin is responding to also rests on leaky abstractions, I would argue.

At the end of the day sometimes the best approach, if there aren't any good abstractions in a particular domain, is to set aside your abstractions and look directly at the object level.

If there is a fairly simple, robust FAI design out there, and we rule out the swath of design space it resides in based on an incorrect inference from a leaky abstraction, that would be a bad outcome.

we haven't seen any examples of them trying to e.g. kill other processes on your computer so they can have more computational resources and play a better game.

It's a good point, but... we won't see examples like this if the algorithms that produce this kind of behavior take longer to produce the behavior than the amount of time we've let them run.

I think there are good reasons to view the effective horizon of different agents as part of their utility function. Then I think a lot of the risk we incur is because humans act as if we have short effective horizons. But I don't think we *actually* do have such short horizons. In other words, our revealed preferences are more myopic than our considered preferences.

Now, one can say that this actually means we don't care that much about the long-term future, but I don't agree with that conclusion; I think we *do* care (at least, I do), but aren't very good at acting as if we(/I) do.

Anyways, if you buy this like of argument about effective horizons, then you should be worried that we will easily be outcompeted by some process/entity that behaves as if it has a much longer effective horizon, so long as it also finds a way to make a "positive-sum" trade with us (e.g. "I take everything after 2200 A.D., and in the meanwhile, I give you whatever you want").

===========================

I view the chess-playing algorithm as either *not* fully goal directed, or somehow fundamentally limited in its understanding of the world, or level of rationality. Intuitively, it seems easy to make agents that are ignorant or indifferent(/"irrational") in such a way that they will only seek to optimize things within the ontology we've provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute). However, our understanding of such things doesn't seem mature.... at least I'm not satisfied with my current understanding. I think Stuart Armstrong and Tom Everrit are the main people who've done work in this area, and their work on this stuff seems quite under appreciated.

Intuitively, it seems easy to make agents that are ignorant or indifferent(/"irrational") in such a way that they will only seek to optimize things within the ontology we've provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute)

It isn't obvious to me that specifying the ontology is significantly easier than specifying the right objective. I have an intuition that ontological approaches are doomed. As a simple case, I'm not aware of any fundamental progress on building something that actually maximizes the number of diamonds in the physical universe, nor do I think that such a thing has a natural, simple description.

Diamond maximization seems pretty different from winning at chess. In the chess case, we've essentially hardcoded a particular ontology related to a particular imaginary universe, the chess universe. This isn't a feasible approach for the diamond problem.

In any case, the reason this discussion is relevant, from my perspective, is because it's related to the question of whether you could have a system which constructs its own superintelligent understanding of the world (e.g. using self-supervised learning), and engages in self-improvement (using some process analogous to e.g. neural architecture search) without being goal-directed. If so, you could presumably pinpoint human values/corrigibility/etc. in the model of the world that was created (using labeled data, active learning, etc.) and use that as an agent's reward function. (Or just use the self-supervised learning system as a tool to help with FAI research/make a pivotal act/etc.)

It feels to me as though the thing I described in the previous paragraph is amenable to the same general kind of ontological whitelisting approach that we use for chess AIs. (To put it another way, I suspect most insights about meta-learning can be encoded without referring to a lot of object level content about the particular universe you find yourself building a model of.) I do think there are some safety issues with the approach I described, but they seem fairly possible to overcome.

I strongly agree.

I should've been more clear.

I think this is a situation where our intuition is likely wrong.

This sort of thing is why I say "I'm not satisfied with my current understanding".

we won't see examples like this if the algorithms that produce this kind of behavior take longer to produce the behavior than the amount of time we've let them run.

Are you suggesting that Deep Blue would behave in this way if we gave it enough time to run? If so, can you explain the mechanism by which this would occur?

I think Stuart Armstrong and Tom Everrit are the main people who've done work in this area, and their work on this stuff seems quite under appreciated.

Can you share links?

I don't know how deep blue worked. My impression was that it doesn't use learning, so the answer would be no.

A starting point for Tom and Stuart's works: https://scholar.google.com/scholar?rlz=1C1CHBF_enCA818CA819&um=1&ie=UTF-8&lr&cites=1927115341710450492


This seems right (though I have some apprehension around talking about "parts" of an AI). From the perspective of proving a theorem, it seems like you need some sort of assumption on what the rest of the AI looks like, so that you can say something like "the goal-directed part will outcompete the other parts". Though perhaps you could try defining goal-directed behavior as the sort of behavior that tends to grow and outcompete things -- this could be a useful definition? I'm not sure.

Actually, no matter what the policy is, we can view the agent as an EU maximizer. The construction is simple: the agent can be thought as optimizing the utility function U, where U(h, a) = 1 if the policy would take action a given history h, else 0. Here I’m assuming that U is defined over histories that are composed of states/observations and actions.

This is not the type signature for a utility function that matters for the coherence arguments (by which I don't mean VNM - see this comment). It does often fit the type signature in the way those arguments are formulated/formalised, but intuitively, it's not getting at the point of the theorems. I suggest you consider utility functions defined as functions of the state of the world only, not including the action taken. (Yes I know actions could be logged in the world state, the agent is embedded in the state, etc. - this is all irrelevant for the point I'm trying to make - I'm suggesting to consider the setup where there's a Cartesian boundary, an unknown transition function, and environment states that don't contain a log of actions.) I don't think the above kind of construction works in that setting. I think that's the kind of setting it's better to focus on.

Have you seen this post, which looks at the setting you mentioned?

From my perspective, I want to know why it makes sense to assume that the AI system will have preferences over world states, before I start reasoning about that scenario. And there are reasons to expect something along these lines! I talk about some of them in the next post in this sequence! But I think once you've incorporated some additional reason like "humans will want goal-directed agents" or "agents optimized to do some tasks we write down will hit upon a core of general intelligence", then I'm already on board that you get goal-directed behavior, and I'm not interested in the construction in this post any more. The only point of the construction in this post is to demonstrate that you need this additional reason.

One additional source that I found helpful to look at is the paper "Formalizing Convergent Instrumental Goals" by Tsvi Benson-Tilsen and Nate Soares, which tries to formalize Omohundro's instrumental convergence idea using math. I read the paper quickly and skipped the proofs, so I might have misunderstood something, but here is my current interpretation.

The key assumptions seem to appear in the statement of Theorem 2; these assumptions state that using additional resources will allow the agent to implement a strategy that gives it strictly higher utility (compared to the utility it could achieve if it didn't make use of the additional resources). Therefore, any optimal strategy will make use of those additional resources (killing humans in the process). In the Bit Universe example given in the paper, if the agent doesn't terminally care what happens in some particular region (I guess they chose this letter because it's supposed to represent where humans are), but contains resources that can be burned to increase utility in other regions, the agent will burn those resources.

Both Rohin's and Jessica's twitching robot examples seem to violate these assumptions (if we were to translate them into the formalism used in the paper), because the robot cannot make use of additional resources to obtain a higher utility.

For me, the upshot of looking at this paper is something like:

  • MIRI people don't seem to be arguing that expected utility maximization alone implies catastrophe.
  • There are some additional conditions that, when taken together with expected utility maximization, seem to give a pretty good argument for catastrophe.
  • These additional conditions don't seem to have been argued for (or at least, this specific paper just assumes them).

See also Alex Turner's work on formalizing instrumentally convergent goals, and his walkthrough of the MIRI paper.

Can you say more about Alex Turner's formalism? For example, are there conditions in his paper or post similar to the conditions I named for Theorem 2 above? If so, what do they say and where can I find them in the paper or post? If not, how does the paper avoid the twitching robot from seeking convergent instrumental goals?

Sure, I can say more about Alex Turner's formalism! The theorems show that, with respect to some distribution of reward functions and in the limit of farsightedness (as the discount rate goes to 1), the optimal policies under this distribution tend to steer towards parts of the future which give the agent access to more terminal states.

Of course, there exist reward functions for which twitching or doing nothing is optimal. The theorems say that most reward functions aren't like this.

I encourage you to read the post and/or paper; it's quite different from the one you cited in that it shows how instrumental convergence and power-seeking arise from first principles. Rather than assuming "resources" exist, whatever that means, resource acquisition is explained as a special case of power-seeking.

ETA: Also, my recently completed sequence focuses on formally explaining and deeply understanding why catastrophic behavior seems to be incentivized. In particular, see The Catastrophic Convergence Conjecture.

I read the post and parts of the paper. Here is my understanding: conditions similar to those in Theorem 2 above don't exist, because Alex's paper doesn't take an arbitrary utility function and prove instrumental convergence; instead, the idea is to set the rewards for the MDP randomly (by sampling i.i.d. from some distribution) and then show that in most cases, the agent seeks "power" (states which allow the agent to obtain high rewards in the future). So it avoids the twitching robot not by saying that it can't make use of additional resources, but by saying that the twitching robot has an atypical reward function. So even though there aren't conditions similar to those in Theorem 2, there are still conditions analogous to them (in the structure of the argument "expected utility/reward maximization + X implies catastrophe"), namely X = "the reward function is typical". Does that sound right?

Writing this comment reminded me of Oliver's comment where X = "agent wasn't specifically optimized away from goal-directedness".

because Alex's paper doesn't take an arbitrary utility function and prove instrumental convergence;

That's right; that would prove too much.

namely X = "the reward function is typical". Does that sound right?

Yeah, although note that I proved asymptotic instrumental convergence for typical functions under iid reward sampling assumptions at each state, so I think there's wiggle room to say "but the reward functions we provide aren't drawn from this distribution!". I personally think this doesn't matter much, because the work still tells us a lot about the underlying optimization pressures.

The result is also true in the general case of an arbitrary reward function distribution, you just don't know in advance which terminal states the distribution prefers.

Yeah, that upshot sounds pretty reasonable to me. (Though idk if it's reasonable to think of that as endorsed by "all of MIRI".)

Therefore, any optimal strategy will make use of those additional resources (killing humans in the process).

Note that this requires the utility function to be completely indifferent to humans (or actively against them).

An agent that constantly twitches could still be a threat if it were trying to maximise the probability that it would actually twitch in the future. For example, if it were to break down, it wouldn't be able to twitch, so it might want to gain control of resources.

I don't suppose you could clarify exactly how this agent that is twitching is defined. In particular, how does it accumulate over time? Do you get 1 utility for each point in time where you twitch and is your total utility the undiscounted sum of these utilities.

I don't suppose you could clarify exactly how this agent that is twitching is defined. In particular, how does it accumulate over time? Do you get 1 utility for each point in time where you twitch and is your total utility the undiscounted sum of these utilities.

I am not defining this agent using a utility function. It turns out that because of coherence arguments and the particular construction I gave, I can view the agent as maximizing some expected utility.

I like Gurkenglas's suggestion of a random number generator hooked up to motor controls, let's go with that.

An agent that constantly twitches could still be a threat if it were trying to maximise the probability that it would actually twitch in the future. For example, if it were to break down, it wouldn't be able to twitch, so it might want to gain control of resources.

Yeah, but it's not trying to maximize that probability. I agree that a superintelligent agent that is trying to maximize the amount of twitching it does would be a threat, possibly by acquiring resources. But motor controls hooked up to random numbers certainly won't do that.

If your robot powered by random numbers breaks down, it indeed will not twitch in the future. That's fine, clearly it must have been maximizing a utility function that assigned utility 1 to it breaking at that exact moment in time. Jessica's construction below would also work, but it's specific to the case where you take the same action across all histories.

Presumably, it is a random number generator hooked up to motor controls. There is no explicit calculation of utilities that tells it to twitch.

It can maximize the utility function: if I take the twitch action in time step otherwise. In a standard POMDP setting this always takes the twitch action.

Oh that's interesting, so you've chosen a discount rate such that twitching now is always more important than twitching for the rest of time. And presumably it can't both twitch AND take other actions in the world in the same time-step, as that'd make it an immediate threat.

Such a utility maximiser might become dangerous if it were broken in such a way that it wasn't allowed to take the twitch action for a long period of time including the current time step, in which case it would take whatever actions would allow itself to twitch again as soon as possible. I wonder how dangerous such a robot would be?

On one hand, the goal of resuming twitching as soon as possible would seem to only require a limited amount of power to be accumulated, on the other hand, any resources accumulated in this process would then be deployed to maximising its utility. For example, it might have managed to gain control of a repair drone and this could now operate independently even if the original could now only twitch and nothing else. Even then, it'd likely be less of a threat as if the repair drone tried to leave to do anything, there would be a chance that the original robot would break down and the repair would be delayed. On the other hand, perhaps the repair drone can hack other systems without moving. This might result in resource accumulation.

In a POMDP there is no such thing as not being able to take a particular action at a particular time. You might have some other formalization of agents in mind; my guess is that, if this formalization is made explicit, there will be an obvious utility function that rationalizes the "always twitch" behavior.

POMDP is an abstraction. Real agents can be interfered with.

AI agents are designed using an agency abstraction. The notion of an AI "having a utility function" itself only has meaning relative to an agency abstraction. There is no such thing as a "real agent" independent of some concept of agency.

All the agency abstractions I know of permit taking one of some specified set of actions at each time step, which can easily be defined to include the "twitch" action. If you disagree with my claim, you can try formalizing a natural one that doesn't have this property. (There are trivial ways to restrict the set of actions, but then you could use a utility function to rationalize "twitch if you can, take the lexicographically first action you can otherwise")

How do you imagine the real agent working? Can you describe the process by which it chooses actions?

Presumably twitching requires sending a signal to a motor control and the connection here can be broken

Sorry, I wasn't clear enough. What is the process which both:

  • Sends the signal to the motor control to twitch, and
  • Infers that it could break or be interfered with, and sends signals to the motor controls that cause it to be in a universe-state where it is less likely to break or be interfered with?

I claim that for any such reasonable process, if there is a notion of a "goal" in this process, I can create a goal that rationalizes the "always-twitch" policy. If