This post is a result of numerous discussions with other participants and organizers of the MIRI Summer Fellows Program 2019.
I recently (hopefully) dissolved some of my confusion about agency. In the first part of the post, I describe a concept that I believe to be central to most debates around agency. I then briefly list some questions and observations that remain interesting to me. The gist of the post should make sense without reading any of the math.
Antropomorphization, but with architectures that aren't humans
Consider the following examples of "architectures":
- Architectures I would intuitively call "agenty":
- Monte Carlo tree search algorithm, parametrized by the number of rollouts made each move and utility function (or heuristic) used to evaluate positions.
- (semi-vague) "Classical AI-agent" with several interconnected modules (utility function and world model, actions, planning algorithm, and observations used for learning and updating the world model).
- (vague) Human parametrized by their goals, knowledge, and skills (and, of course, many other details).
- Architectures I would intuitively call "non-agenty":
- A hard-coded sequence of actions.
- Look-up table.
- Random generator (outputting on every input, for some probability distribution ).
- Multi-agent architectures:
- Ant colony.
- Company (consisting of individual employees, operating within an economy).
- Comprehensive AI services.
Working definition: Architecture is some model parametrizable by that receives inputs, produces outputs, and possibly keeps an internal state. We denote specific instances of as .
Throughout the post, will refer to some object, procces, entity, etc., whose behavior we want to predict or understand. Examples include rocks, wind, animals, humans, AGIs, economies, families, or the universe.
A standard item in the human mental toolbox is anthropomorphization: modeling various things as humans (specifically, ourselves) with "funny" goals or abilities. We can make the same mental move for architectures other than humans:
Working definition (-morphization): Let be an architecture. Then any model is an -morphization of .
Antropomorphization makes good predictions for other humans and some animals (curiosity, fear, hunger). On the other hand, it doesn't work so well for rocks, lightning, and AGI-s --- not that it would prevent us from using it anyway. We can measure the usefulness of -morphization by the degree to which it makes good predictions:
Working definition (prediction error): Suppose exists in a world and is a sequence of variables (events about ) that we want to predict. Suppose that is how actually unfolds and is the prediction obtained by -morphizing as . The prediction error of (w.r.t. and in ) is the expected Brier score of with respect to .
Informally, we say that -morphizing is accurate if the corresponding prediction error is low.
When do we call things agents?
- I claim that the question "Is an agent?" is without substance, and we should instead be asking "From the point of view of some external observer , does seem to exhibit agent-like behavior?".
- Moreover, "agent-like behavior" also seems ill-defined, because what we associate with "agency" is subjective. I propose to explicitly operationalize the question as "Is -morphizing accurate?".
(A related question is how difficult is it for us to "run" . Indeed, we anthropomorphize so many things precisely because it is cheap for us to do so.)
Relatedly, I believe we already implicitly do this operationalization: Suppose you talk to your favorite human about agency. will likely subconsciously associate agency with certain architectures, maybe such as those in Example 1.1-3. Moreover, will ascribe varying degrees of agency to different architectures --- for me, 1.3 seems more agenty than 1.1. Similarly, there are some architectures that will associate with "definitely not an agent". I conjecture that some exhibits agent-like behavior according to if it can be accurately predicted via -morphization for some agenty-to- architecture . Similarly, would say that exhibits non-agenty behavior if we can accurately predict it using some non-agenty-to- architecture.
Critically, exhibiting agent-like-to- and non-agenty-to- behavior is not mutually exclusive, and I think this causes most of the confusion around agency. Indeed, we humans seem very agenty but, at the same time, determinism implies that there exists some hard-coded behavior that we enact. A rock rolling downhill can be viewed as merely obeying the non-agenty laws of physics, but what if it "wants to" get as low as possible? And, as a result, we sometimes go "Humans are definitely agents, and rocks are definitely non-agents...although, wait, are they?".
If we ban the concept of agency, which interesting problems remain?
"Agency" often comes up when discussing various alignment-related topics, such as the following:
How do we detect whether performs (or capable of performing) optimization? How to detect this from 's architecture (or causal origin) rather than looking at its behavior? (This seems central to the topic of mesa-optimization.)
Agent-like behavior vs agent-like architecture
Consider the following conjecture: "Suppose some exhibits agent-like behavior. Does it follow that physically contains agent-like architecture, such as the one from Example 1.2?". This conjecture is false --- as an example, Q-learning is a "fairly agenty" architecture that leads to intelligent behavior. However, the resulting RL "agent" has a fixed policy and thus functions as a large look-up table. A better question would thus be whether there exist an agent-like architecture causally upstream of . This question also has a negative answer, as witnessed by the example of an ant colony --- agent-like behavior without agent-like architecture, produced by a "non-agenty" optimization process of evolution. Nonetheless, a general version of the question remains: If some exhibits agent-like behavior, does it follow that there exists some interesting physical structure causally upstream of ?
Suppose there is some , which I model as having some goals. When making actions should I give weight to those goals? (The answer to this question seems more related to conciousness than to -morphization. Note also that a particularly interesting version of the question can be obtained by replacing "I" by "AGI"...)
PC or NPC?
When making plans, should we model as a part of the environment, or does it enter our game-theoretical considerations? Is able to model us?
Creativity, unbounded goals, environment-generality
In some sense, AlphaZero is an extremely capable game-playing agent. On the other hand, if we gave it access to the internet, it wouldn't do anything with it. The same cannot be said for humans and unaligned AGIs, who would not only be able to orient in this new environment but would eagerly execute elaborate plans to increase their influence. How can we tell whether some is more like the former or the latter?
To summarize, I believe that many arguments and confusions surrounding agency can disappear if we explicitly use -morphization. This should allow us to focus on the problems listed above. Most definitions I gave are either semi-formal or informal, but I believe they could be made fully formal in more specific cases.
Regarding feedback: Sugestions for a better name for "-morphization" super-welcome! If you know of an application for which such formalization would be useful, please do let me know. Pointing out places where you expect a useful formalization to be impossible is also welcome.
You might also view multi-agent systems these as monolithic agents, but this view might often give you wrong intuitions. I am including this category as an example that -- intuitively -- doesn't belong to either of the "agent" and "not-agent" categories. ↩︎
By default, we do not assume that -morphization of is useful in any way, or even the most useful among all instances of . This goes against the intuition according to which we would pick some that is close to optimal (among ) for predicting . I am currently unsure how to formalize this intuition, apart from requiring that is optimal (which seems too strong a condition). ↩︎
Distinguishing between "small enough" and "too big" prediction errors seems non-trivial since different environments are naturally more difficult to predict than others. Formalizing this will likely require additional insights. ↩︎
An example of such "interesting physical structure" would be an implementation of an optimization architecture. ↩︎
Even if true, this conjecture will likely require some additional assumptions. Moreover, I expect "randomly-generated look-up tables that happen to stumble upon AGI by chance" to serve as a particularly relevant counterexample. ↩︎
Whatever that means in this case. ↩︎