Comment on Coherence arguments do not imply goal directed behavior

Ronny Fernandez

In coherence arguments do not imply goal directed behavior Rohin Shah argues that a system's merely being at all model-able as an EU maximizer does not imply that it has "goal directed behavior". The argument as I understand it runs something like this:

1: Any behavior whatsoever maximizes some utility function.

2: Not all behaviors are goal directed.

Conclusion: A system's behavior maximizing some utility function does not imply that its behavior is goal directed.

I think this argument is technically sound, but misses an important connection between VNM coherence and goal directed behavior.

Shah does not give a formal definition of "goal directed behavior" but it is basically what you intuitively think it is. Goal directed behavior is the sort of behavior that seems like it is aimed at accomplishing some goal. Shah correctly points out that a system being goal directed and being good at accomplishing its goal is what makes it dangerous, not merely that it is good at maximizing some utility function. Every object in the universe perfectly maximizes the utility function that assigns 1 to all of the actual causal consequences of its behavior, and 0 to any other causal consequences its behavior might have had.

Shah seems to suggest that being model-able as an EU maximizer is not very closely related to goal directed behavior. Sure, having goal directed behavior implies that you are model-able as an EU maximizer, but so does having any kind of behavior whatsoever.

The implication does not run the other way according to Shah. Something being an EU maximizer for some utility function, even a perfect one, does not imply that its behavior is goal directed. I think this is right, but I will argue that nonetheless, it is true that it being a good idea for you to model an agent as an EU maximizer does imply that its behavior will seem goal directed (at least to you).

Shah gives the example of a twitching robot. This is not a robot that maximizes the probability of its twitching, or that wants to twitch as long as possible. Shah agrees that a robot that maximized those things would be dangerous. Rather, this is a robot that just twitches. Such a robot maximizes a utility function that assigns 1 to whatever the actual consequences of its actual twitching behaviors are, and 0 to anything else that the consequences might have been.

This system is a perfect EU maximizer for that utility function, but it is not an optimization process for any utility function. For a system to be an optimization process it must be that it is more efficient to predict it by modeling it as an optimization process than by modeling it as a mechanical system. Another way to put it is that it must be a good idea for you to model it as an EU maximizer.

This might be true in two different ways. It might be more efficient in terms of time or compute. My predictions of the behavior when I model the system as an EU maximizer might not be as good as my predictions of the behavior when I model it as a mechanical system, but the reduced accuracy is worth it, because modeling the system mechanically would take me much longer or be otherwise costly. Think of predicting a chess playing program. Even though I could predict the next move by learning its source code and computing it by hand on paper, I would be better off in most contexts just thinking about what I would do in its circumstances if I were trying to win at chess.

Another related but distinct sense in which it might be more efficient is that modeling the system as an EU maximizer might allow me to compress its behavior more than modeling it as a mechanical system. Imagine if I had to send someone a python program that makes predictions about the behavior of the twitching robot. I could write a program that just prints "twitch" over and over again, or I could write a program that models the whole world and picks the behavior that best maximizes the expected value of a utility function that assigns 1 to whatever the actual consequences of the twitching are, and 0 to whatever else they might have been. I claim that the second program would be longer. It would not however allow the receiver of my message to predict the behavior of the robot any more accurately than a program that just prints "it twitches again" over and over.

Maybe the exact twitching pattern is complicated, or maybe it stops at some particular time, and in that case the first program would have to be more complicated, but as long as the twitching does not seem goal directed, I claim that a python program that predicts the robot's behavior by modeling the universe and the counterfactual consequences of different kinds of possible twitching will always be longer than one that predicts the twitching by exploiting regularities that follow from the robot's mechanical design. I think this might be what it means for a system to be goal directed.

(Might also be worth pointing out that knowing that there is a utility function which the twitching robot is a perfect optimizer relative to does not allow us to predict its behavior in advance. "It optimizes the utility function that assigns 1 to the consequences of its behavior and 0 to everything else" is a bad theory of the twitching robot in the same way that "the lady down the street is a witch; she did it" is a bad theory of anything.)

A system seems goal directed to you if the best way you have of predicting it is by modeling it as an EU maximizer with some particular utility function and credence function. (Actually, the particulars of the EU formalism might not be very relevant to what makes humans think of a system's behavior as goal directed. It being a good idea to model it as having something like preferences and some sort of reasonably accurate model of the world that supports counterfactual reasoning is probably good enough.) This conception of goal directed-ness is somewhat awkward because the first notion of "efficiently model" is relative to your capacities and goals, and the second notion is relative to the programming language we choose, but I think it is basically right nonetheless. Luckily, we humans have relatively similar capacities and goals, and it can be shown that using the second notion of "efficiently model" we will only disagree about how agenty different systems are by at most some additive constant regardless of what programming languages we choose.

One argument that what it means for a system's behavior to seem goal directed to you is just for it to be best for you to model it as an EU maximizer is that if it were a better idea for you to model it some other way, that is probably how you would model it instead. This is why we do not model bottle caps as EU maximizers but do model chess programs as (something at least a lot like) EU maximizers. This is also why the twitching robot does not seem intelligent to us, absent other subsystems that we should model as EU maximizers, but that's a story for a different post.

I think we should expect most systems that it is a good idea for us to model as EU maximizers to pursue convergent instrumental goals like computational power and ensuring survival, etc. If I know the utility function of an EU maximizer better than I know its specific behavior, often the best way for me to predict its behavior is by imagining what I would do in its circumstances if I had the same goal. Take a complicated utility like the utility function that assigns 1 to whatever the actual consequences of the twitching robot's twitches are and 0 to anything else. Imagine that I did not have the utility function specified that way, which hides all of the complexity in "whatever the actual consequences are." Rather, imagine I had the utility specified as an extremely specific description of the world that gets scored above all else without reference to the actual twitching pattern of the robot. If maximizing that utility function were my goal, it would seem like a good idea to me to get more computational power for predicting the outcomes of my available actions, to make sure that I am not turned off prematurely, and to try to get as accurate a model of my environment as possible.

In conclusion, I agree with Shah that being able to model a system as an EU maximizer at all does not imply that its behavior is goal directed, but I think that sort of misses the point. If the best way for you to model a system is to model it as an EU maximizer, then its behavior will seem goal directed to you, and if the shortest program that predicts a system's behavior does so by modeling it as an EU maximizer, then its behavior will be goal directed (or at least up to an additive constant). I think the best way for you to model most systems that are more intelligent than you will be to model them as EU maximizers, or something close, but again, that's a story for a different post.

I think this formulation of goal-directedness is pretty similar to one I suggested in the post before the coherence arguments post (Intuitions about goal-directed behavior, section "Our understanding of the behavior"). I do think this is an important concept to explain our conception of goal-directedness, but I don't think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops). Should they worry that their laptops are going to take over the world?

For a deeper response, I'd recommend Intuitions about goal-directed behavior. I'll quote some of the relevant parts here:

There is a general pattern in which as soon as we understand something, it becomes something lesser. As soon as we understand rainbows, they are relegated to the “dull catalogue of common things”. This suggests a somewhat cynical explanation of our concept of “intelligence”: an agent is considered intelligent if we do not know how to achieve the outcomes it does using the resources that it has (in which case our best model for that agent may be that it is pursuing some goal, reflecting our tendency to anthropomorphize). That is, our evaluation about intelligence is a statement about our epistemic state.

[... four examples ...]

To the extent that the Misspecified Goal argument relies on this intuition, the argument feels a lot weaker to me. If the Misspecified Goal argument rested entirely upon this intuition, then it would be asserting that because we are ignorant about what an intelligent agent would do, we should assume that it is optimizing a goal, which means that it is going to accumulate power and resources and lead to catastrophe. In other words, it is arguing that assuming that an agent is intelligent definitionally means that it will accumulate power and resources. This seems clearly wrong; it is possible in principle to have an intelligent agent that nonetheless does not accumulate power and resources.

Also, the argument is not saying that in practice most intelligent agents accumulate power and resources. It says that we have no better model to go off of other than “goal-directed”, and then pushes this model to extreme scenarios where we should have a lot more uncertainty.

See also the summary of that post:

“From the outside”, it seems like a goal-directed agent is characterized by the fact that we can predict the agent’s behavior in new situations by assuming that it is pursuing some goal, and as a result it is acquires power and resources. This can be interpreted either as a statement about our epistemic state (we know so little about the agent that our best model is that it pursues a goal, even though this model is not very accurate or precise) or as a statement about the agent (predicting the behavior of the agent in new situations based on pursuit of a goal actually has very high precision and accuracy). These two views have very different implications on the validity of the Misspecified Goal argument for AI risk.

But also, even ignoring all of that, I see this post as compatible with my post. My goal was for people to premise their AI safety risk arguments on the concept of goal-directedness, rather than utility maximization, and this post does exactly that.

I do think this is an important concept to explain our conception of goal-directedness, but I don't think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops).

This definition is supposed to also explains why a mouse has agentic behavior, and I would consider it a failure of the definition if it implied that mice are dangerous. I think a system becomes more dangerous as your best model of that system as an optimizer increases in optimization power.

NB: I've not made a post about this point, but your thoughts made me think of it, so I'll bring it up here. Sorry if I left a comment elsewhere making this same point previously and I forgot about it. Also this is not really a direct response to your post, which I'm not explicitly agreeing or disagreeing with in this comment, but more a riff on the same ideas because you got me thinking about them.

I think much of the confusion around goals and goal directed behavior and what constitutes it and what doesn't lies in the fact that goals, as we are treating them here, are teleological, viz. they are defined in context of what we care about. Another way to say this is that goals are a way we anthropomorphize things, thinking of them as operating the same way we experience our own minds operating.

To see this, we can simply shift our perspective to think of anything as being goal directed. Is a twitching robot goal directed? Sure, if I created the robot to twitch, it's doing a great job of achieving its purpose. Is a bottle cap goal directed? Sure, it was created to keep stuff in, and it keeps doing a fine job of that. Conversely, am I goal directed? Maybe not: I just keep doing stuff and it's only after the fact that I can construct a story that says I was aiming to some goal. Is a paperclip maximizer goal directed? Maybe not: it just makes paperclips because it's programmed to and has no idea that that's what it's doing, no more than the bottle cap knows it's holding in liquid or the twitch robot knows it's twitching.

This doesn't mean goals are not important; I think goals matter a lot when we think about alignment because they are a construct that falls out of how humans make sense of the world and their own behavior, but they are interesting for that reason, not because they are a natural part of the world that exists prior to our creation of them in our ontologies, i.e. goals are a feature of the map, not the unmapped territory.

You seem to be using the words "goal-directed" differently than the OP.

And in different ways throughout your comment.

Is a bottle cap goal directed? Sure, it was created to keep stuff in, and it keeps doing a fine job of that.

It is achieving a purpose. (State of the world.)

Conversely, am I goal directed? Maybe not: I just keep doing stuff and it's only after the fact that I can construct a story that says I was aiming to some goal.

You seem to have a higher standard for people. I imagine you exhibit goal-directed behavior with the aim of maintaining certain equilibria/homeostasis - eating, sleeping, as well as more complicated behaviors to enable those. This is more than a bottle cap does, and more difficult a job than performed by a thermostat.

Is a paperclip maximizer goal directed? Maybe not: it just makes paperclips because it's programmed to and has no idea that that's what it's doing, no more than the bottle cap knows it's holding in liquid or the twitch robot knows it's twitching.

This sounds like is a machine that makes paperclips, without optimizing - not a maximizer. (Unless the twitching robot is a maximizer.) "Opt" means "to make a choice (from a range of possibilities)" - you do this, the other things not so much.

goals are a feature of the map, not the unmapped territory.

You don't think that a map of the world (including the agents in it) would include goals? (I can imagine a counterfactual where someone is put in different circumstances, but continues to pursue the same ends, at least at a basic level - eating, sleeping, etc.)

You seem to be using the words "goal-directed" differently than the OP.

And in different ways throughout your comment.

That's a manifestation of my point: what it would mean for something to be a goal seems to be able to shift depending on what it is you think is an important feature of the thing that would have the goal.

Planned summary for the newsletter:

I <@have argued@>(@Coherence arguments do not imply goal-directed behavior@) that coherence arguments that argue for modeling rational behavior as expected utility maximization do not add anything to AI risk arguments. This post argues that there is a different way in which to interpret these arguments: we should only model a system to be an EU maximizer if it was the result of an optimization process, such that the EU maximizer model is the best model we have of the system. In this case, the best way to predict the agent is to imagine what we would do if we had its goals, which leads to the standard convergent instrumental subgoals.

Planned opinion:

This version of the argument seems to be more a statement about our epistemic state than about actual AI risk. For example, I know many people without technical expertise who anthropomorphize their laptops as though they were pursuing some goal, but they don't (and shouldn't) worry that their laptops are going to take over the world. More details in this comment.

This theory of goal-directedness has the virtue of being closely tied to what we care about:

--If a system is goal-directed according to this definition, then (probably) it is the sort of thing that might behave as if it has convergent instrumental goals. It might, for example, deceive us and then turn on us later. Whereas if a system is not goal-directed according to this definition, then absent further information we have no reason to expect those behaviors.

--Obviously we want to model things efficiently. So we are independently interested in what the most efficient way to model something is. So this definition doesn't make us go that far out of our way to compute, so to speak.

On the other hand, I think this definition is not completely satisfying, because it doesn't help much with the most important questions:

--Given a proposal for an AGI architecture, is it the sort of thing that might deceive us and then turn on us later? Your definition answers: "Well, is it the sort of thing that can be most efficiently modelled as an EU-maximizer? If yes, then yes, if no, then no." The problem with this answer is that trying to see whether or not we can model the system as an EU-maximizer involves calculating out the system's behavior and comparing it to what an EU-maximizer (worse, to a range of EU-maximizers with various relatively simple or salient utility and credence functions) would do, and if we are doing that we can probably just answer the will-it-deceive-us question directly. Alternatively perhaps we could look at the structure of the system--the architecture--and say "see, this here is similar to the EU-max algorithm." But if we are doing that, then again, maybe we don't need this extra step in the middle; maybe we can jump straight from looking at the structure of the system to inferring whether or not it will act like it has convergent instrumental goals.

For a deeper response, I'd recommend Intuitions about goal-directed behavior. I'll quote some of the relevant parts here:

There is a general pattern in which as soon as we understand something, it becomes something lesser. As soon as we understand rainbows, they are relegated to the “dull catalogue of common things”. This suggests a somewhat cynical explanation of our concept of “intelligence”: an agent is considered intelligent if we do not know how to achieve the outcomes it does using the resources that it has (in which case our best model for that agent may be that it is pursuing some goal, reflecting our tendency to anthropomorphize). That is, our evaluation about intelligence is a statement about our epistemic state.

[... four examples ...]

To the extent that the Misspecified Goal argument relies on this intuition, the argument feels a lot weaker to me. If the Misspecified Goal argument rested entirely upon this intuition, then it would be asserting that because we are ignorant about what an intelligent agent would do, we should assume that it is optimizing a goal, which means that it is going to accumulate power and resources and lead to catastrophe. In other words, it is arguing that assuming that an agent is intelligent definitionally means that it will accumulate power and resources. This seems clearly wrong; it is possible in principle to have an intelligent agent that nonetheless does not accumulate power and resources.

Also, the argument is not saying that in practice most intelligent agents accumulate power and resources. It says that we have no better model to go off of other than “goal-directed”, and then pushes this model to extreme scenarios where we should have a lot more uncertainty.

LESSWRONG
LW

LESSWRONG
LW

30

Comment on Coherence arguments do not imply goal directed behavior

30

Ω 15

30

Ω 15

30

Ω 15