Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is a result of numerous discussions with other participants and organizers of the MIRI Summer Fellows Program 2019.


I recently (hopefully) dissolved some of my confusion about agency. In the first part of the post, I describe a concept that I believe to be central to most debates around agency. I then briefly list some questions and observations that remain interesting to me. The gist of the post should make sense without reading any of the math.

Antropomorphization, but with architectures that aren't humans

Architectures

Consider the following examples of "architectures":

Example (architectures)

  1. Architectures I would intuitively call "agenty":
    1. Monte Carlo tree search algorithm, parametrized by the number of rollouts made each move and utility function (or heuristic) used to evaluate positions.
    2. (semi-vague) "Classical AI-agent" with several interconnected modules (utility function and world model, actions, planning algorithm, and observations used for learning and updating the world model).
    3. (vague) Human parametrized by their goals, knowledge, and skills (and, of course, many other details).
  2. Architectures I would intuitively call "non-agenty":
    1. A hard-coded sequence of actions.
    2. Look-up table.
    3. Random generator (outputting on every input, for some probability distribution ).
  3. Multi-agent architectures[1]:
    1. Ant colony.
    2. Company (consisting of individual employees, operating within an economy).
    3. Comprehensive AI services.

Working definition: Architecture is some model parametrizable by that receives inputs, produces outputs, and possibly keeps an internal state. We denote specific instances of as .

Generalizing anthropomorphization

Throughout the post, will refer to some object, procces, entity, etc., whose behavior we want to predict or understand. Examples include rocks, wind, animals, humans, AGIs, economies, families, or the universe.

A standard item in the human mental toolbox is anthropomorphization: modeling various things as humans (specifically, ourselves) with "funny" goals or abilities. We can make the same mental move for architectures other than humans:

Working definition (-morphization): Let be an architecture. Then any[2] model is an -morphization of .

Antropomorphization makes good predictions for other humans and some animals (curiosity, fear, hunger). On the other hand, it doesn't work so well for rocks, lightning, and AGI-s --- not that it would prevent us from using it anyway. We can measure the usefulness of -morphization by the degree to which it makes good predictions:

Working definition (prediction error): Suppose exists in a world and is a sequence of variables (events about ) that we want to predict. Suppose that is how actually unfolds and is the prediction obtained by -morphizing as . The prediction error of (w.r.t. and in ) is the expected Brier score of with respect to .

Informally, we say that -morphizing is accurate if the corresponding prediction error is low.[3]

When do we call things agents?

Main claim:

  1. I claim that the question "Is an agent?" is without substance, and we should instead be asking "From the point of view of some external observer , does seem to exhibit agent-like behavior?".
  2. Moreover, "agent-like behavior" also seems ill-defined, because what we associate with "agency" is subjective. I propose to explicitly operationalize the question as "Is -morphizing accurate?".

(A related question is how difficult is it for us to "run" . Indeed, we anthropomorphize so many things precisely because it is cheap for us to do so.)

Relatedly, I believe we already implicitly do this operationalization: Suppose you talk to your favorite human about agency. will likely subconsciously associate agency with certain architectures, maybe such as those in Example 1.1-3. Moreover, will ascribe varying degrees of agency to different architectures --- for me, 1.3 seems more agenty than 1.1. Similarly, there are some architectures that will associate with "definitely not an agent". I conjecture that some exhibits agent-like behavior according to if it can be accurately predicted via -morphization for some agenty-to- architecture . Similarly, would say that exhibits non-agenty behavior if we can accurately predict it using some non-agenty-to- architecture.

Critically, exhibiting agent-like-to- and non-agenty-to- behavior is not mutually exclusive, and I think this causes most of the confusion around agency. Indeed, we humans seem very agenty but, at the same time, determinism implies that there exists some hard-coded behavior that we enact. A rock rolling downhill can be viewed as merely obeying the non-agenty laws of physics, but what if it "wants to" get as low as possible? And, as a result, we sometimes go "Humans are definitely agents, and rocks are definitely non-agents...although, wait, are they?".

If we ban the concept of agency, which interesting problems remain?

"Agency" often comes up when discussing various alignment-related topics, such as the following:

Optimizer?

How do we detect whether performs (or capable of performing) optimization? How to detect this from 's architecture (or causal origin) rather than looking at its behavior? (This seems central to the topic of mesa-optimization.)

Agent-like behavior vs agent-like architecture

Consider the following conjecture: "Suppose some exhibits agent-like behavior. Does it follow that physically contains agent-like architecture, such as the one from Example 1.2?". This conjecture is false --- as an example, Q-learning is a "fairly agenty" architecture that leads to intelligent behavior. However, the resulting RL "agent" has a fixed policy and thus functions as a large look-up table. A better question would thus be whether there exist an agent-like architecture causally upstream of . This question also has a negative answer, as witnessed by the example of an ant colony --- agent-like behavior without agent-like architecture, produced by a "non-agenty" optimization process of evolution. Nonetheless, a general version of the question remains: If some exhibits agent-like behavior, does it follow that there exists some interesting physical structure[4] causally upstream of ?[5]

Moral standing

Suppose there is some , which I model as having some goals. When making actions should I give weight to those goals? (The answer to this question seems more related to conciousness than to -morphization. Note also that a particularly interesting version of the question can be obtained by replacing "I" by "AGI"...)

PC or NPC?

When making plans, should we model as a part of the environment, or does it enter our game-theoretical considerations? Is able to model us?

Creativity, unbounded goals, environment-generality

In some sense, AlphaZero is an extremely capable game-playing agent. On the other hand, if we gave it access to the internet[6], it wouldn't do anything with it. The same cannot be said for humans and unaligned AGIs, who would not only be able to orient in this new environment but would eagerly execute elaborate plans to increase their influence. How can we tell whether some is more like the former or the latter?

To summarize, I believe that many arguments and confusions surrounding agency can disappear if we explicitly use -morphization. This should allow us to focus on the problems listed above. Most definitions I gave are either semi-formal or informal, but I believe they could be made fully formal in more specific cases.

Regarding feedback: Sugestions for a better name for "-morphization" super-welcome! If you know of an application for which such formalization would be useful, please do let me know. Pointing out places where you expect a useful formalization to be impossible is also welcome.


  1. You might also view multi-agent systems these as monolithic agents, but this view might often give you wrong intuitions. I am including this category as an example that -- intuitively -- doesn't belong to either of the "agent" and "not-agent" categories. ↩︎

  2. By default, we do not assume that -morphization of is useful in any way, or even the most useful among all instances of . This goes against the intuition according to which we would pick some that is close to optimal (among ) for predicting . I am currently unsure how to formalize this intuition, apart from requiring that is optimal (which seems too strong a condition). ↩︎

  3. Distinguishing between "small enough" and "too big" prediction errors seems non-trivial since different environments are naturally more difficult to predict than others. Formalizing this will likely require additional insights. ↩︎

  4. An example of such "interesting physical structure" would be an implementation of an optimization architecture. ↩︎

  5. Even if true, this conjecture will likely require some additional assumptions. Moreover, I expect "randomly-generated look-up tables that happen to stumble upon AGI by chance" to serve as a particularly relevant counterexample. ↩︎

  6. Whatever that means in this case. ↩︎

New to LessWrong?

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 12:36 PM

I think that the concept of "agency" (although maybe "intelligence" would be a better word?), in the context of AI alignment, implies the ability to learn the environment and exploit this knowledge towards a certain goal. The only way to pursue a goal effectively without learning is having hard-coded knowledge of the environment. But, where would this knowledge come from? For complex environments, it is only likely to come from learning algorithms upstream.

So, a rock is definitely not an agent since there is nothing it learns about its environment (I am not even sure what the input/output channels of a rock are supposed to be). Q-learning is an agent, but the resulting policy is not an agent in itself. Similarly, AlphaGo is a sort of agent when regarded together with the training loop (it can in principle learn to play different games), but not when disconnected from it. Evolution is an agent, even if not a very powerful one. An ant colony is probably a little agentic because it can learn something, although I'm not sure how much.

Yep, that totally makes sense.

Observations inspired by your comment: While this shouldn't necessarily be so, it seems the particular formulations make a lot of difference when it comes to exchanging ideas. If I read your comment without the

(although maybe "intelligence" would be a better word?)

bracket, I immediatelly go "aaa, this is so wrong!". And if I substitute "intelligent" for "agent", I totally agree with it. Not sure whether this is just me, or whether it generalizes to other people.

More specifically, I agree that from the different concepts in the vicinity of "agency", "the ability to learn the environment and exploit this knowledge towards a certain goal" seems to be particularly important to AI alignment. I think the word "agency" is perhaps not well suited for this particular concept, since it comes with so many other connotations. But "intelligence" seems quite right.

I am not even sure what the input/output channels of a rock are supposed to be

I guess you imagine that the input is the physical forces affecting the ball and the output is the forces the ball exerts on the environment. Obviously, this is very much not useful for anything. But it suddenly becomes non-trivial if you consider something like the billiard-ball computer (seems like a theoretical construct, not sure if anybody actually built one...but it seems like a relevant example anyway).

You mention the distinction between agent-like architecture and agent-like behavior (which I find similar to my distinction between selection and control), but how does the concept of -morphism account for this distinction? I have a sense that (formalized) versions of -morphism are going to be more useful (or easier?) for the behavioral side, though it isn't really clear.

I have a sense that (formalized) versions of A(Θ)-morphism are going to be more useful (or easier?) for the behavioral side, though it isn't really clear.

I think -morphisation is primarily useful for describing what we often mean when we say "agency". In particular, I view this as distinct from which concepts we should be thinking about in this space. (I think the promising candidates include learning that Vanessa points to in her comment, optimization, search, and the concepts in the second part of my post.)

However, I think it might also serve as a useful part of the language for describing (non) agent-like behavior. For example, we might want to SGD-morphise an ecoli bacteria independently of whether it actually implements some form of stochastic gradient descent w.r.t. the concentration of some chemicals in the environment.

You mention the distinction between agent-like architecture and agent-like behavior (which I find similar to my distinction between selection and control), but how does the concept of A(Θ)-morphism account for this distinction?

I think of agent-like architectures as something objective, or related to the territory. In contrast, agent-like behavior is something subjective, something in the map. Importantly, agent-like behavior, or the lack of it, of some is something that exists in the map of some entity (where often ).

The selection/control distinction seems related, but not quite similar to me. Am I missing something there?

I think of agent-like architectures as something objective, or related to the territory. In contrast, agent-like behavior is something subjective, something in the map. Importantly, agent-like behavior, or the lack of it, of some X is something that exists in the map of some entity Y (where often Y≠X).
The selection/control distinction seems related, but not quite similar to me. Am I missing something there?

A(Θ)-morphism seems to me to involve both agent-like architecture and agent-like behavior, because it just talks about prediction generally. Mostly I was asking if you were trying to point it one way or the other (we could talk about prediction-of-internals exclusively, to point at structure, or prediction-of-external exclusively, to talk about behavior -- I was unsure whether you were trying to do one of those things).

Since you say that you are trying to formalize how we informally talk, rather than how we should, I guess you weren't trying to make A(Θ)-morphism get at this distinction at all, and were separately mentioning the distinction as one which should be made.

I agree with your summary :). The claim was that humans often predict behavior by assuming that something has a particular architecture.

(And some confusions about agency seem to appear precisely because of not making the architecture/behavior distinction.)

This question also has a negative answer, as witnessed by the example of an ant colony --- agent-like behavior without agent-like architecture, produced by a "non-agenty" optimization process of evolution. Nonetheless, a general version of the question remains: If some X exhibits agent-like behavior, does it follow that there exists some interesting physical structure causally upstream of X?

Neat example! But for my part, I'm confused about this last sentence, even after reading the footnote:

An example of such "interesting physical structure" would be an implementation of an optimization architecture.

For one thing, I'm not sure I have much intuition about what is meant by "optimization architecture". For instance, I would not know how to begin answering the question:

Does optimization behavior imply optimization architecture?

And I have even less of a clue what is intended by "interesting physical structure" (perhaps facetiously, any process that causes agent-like behavior to arise sounds "interesting" for that reason alone).

In your ant colony example, is evolution the "interesting physical structure", and if so, how is it a physical structure?

First off, while I feel somewhat de-confused about X-like behavior, I don't feel very confident about X-like architectures. Maybe the meaning is somewhat clear on higher levels of abstraction (e.g., if my brain goes "realize I want to describe a concept --> visualize several explanations and judge each for suitability --> pick the one that seems the best --> send a signal to start typing it down", then this would be a kind of search/optimization-thingy). But on the level of physics, I don't really know what an architecture means. So take this with a grain of salt.

Maybe the term "physical structure" is misleading. The thing I was trying to point at is the distinction between being able to accurately model Y using model X, and Y actually being X. In the sense that there might be a giant look-up table (GLUT) that accuractly predicts your behavior, but on no level of abstraction is it correct to say that you actually are a GLUT. Whereas modelling you as having some goals, planning, etc. might be less accurate but somewhat more, hm, true. I realize this isn't very precise, but I guess you can see what I mean.

That being said, I suppose that what I meant by "optimization architecture" is, for example, a stochastic gradient descent with the emphasis on "this is the input", "this is the part of the algorithm that does the calculation", and "this is the output". An "implementation of an optimization architecture" would be...well, the atoms of your computer that perform SGD, or maybe some simple bacteria that moves in the direction where the concentration of whatever-it-likes is the highest (not that anything I know would implement precisely SGD, but still).

Ad "interesting physical structure" behind the ant-colony: If by "evolution" we mean the atoms that the world is made of, as they changed over time until your ant colony emerged...then yeah, this is a physical structure causally upstream of the ant colony, and one that is responsible for the ant colony behaving the way it does. I wouldn't say it is interesting (to me, and w.r.t. the ant colony) though, since it is totally incomprehensible to me. (But maybe "interestingness" doesn't really make sense on the level of physics, and is only relevant in relation to our abstract world-models and their understanding.)

Finally, the ideal thing a "X-like behavior ==> Y-like architecture" theorem would cash out into is a criterion that you can actually check and say with certainty that the thing will not exhibit X-like behavior. (Whether this is reasonable to hope for is another matter.) So, even if all that I have written in this comment turns out to be nonsense, getting such criterion is what we are after :-).